Do I need a commercial data observability platform to start?

No. You can cover most risk with dbt tests, Great Expectations or Soda for deeper checks, Prometheus for metrics, and your existing orchestrator for gating. Add a vendor later if you need ML-based anomaly detection at scale.

How do I pick thresholds without false positives?

Baseline for two weeks, then set thresholds using trailing averages and percent bands. Page on SLOs (e.g., freshness >30 min for 10 min) and send everything else as warnings.

Who should get paged for data quality failures?

The data product owner. Tag models with owners in code and route alerts via Alertmanager/PagerDuty based on those tags. Shared channels cause alert fatigue and no accountability.

What metrics convince executives this is working?

MTTR for data incidents, SLO compliance rate, number of prevented downstream builds, and ‘data downtime minutes’ on executive metrics. Tie each incident to a dollar impact where possible.

How do we prevent schema drift from breaking models?

Stop using SELECT *. Add dbt contracts (`constraints/enforced`), enable GE/Soda schema checks, and run CI on PRs for sources and models. For sources like Kafka, use a schema registry and enforce compatibility.

Data-engineering · Oct 17, 2025 · 9 minute read

The 7 a.m. Dashboard That Lied — And the Data Quality Guardrails That Shut It Up

If a single schema change can nuke your CFO’s dashboard, you don’t have analytics—you have a roulette wheel. Here’s the data quality monitoring that actually prevents downstream failures, with code you can ship this week.

Alex Ramirez

Principal Data & Platform Engineer, GitPlumbers

20 years across adtech, fintech, and marketplaces. Led data platform teams through replatforms, real-time rollouts, and more schema drifts than he cares to count. Formerly at Etsy and Stripe; now helping teams ship reliable analytics at GitPlumbers.

Treat data quality like reliability engineering: define SLOs, measure relentlessly, and make the pipeline refuse to ship garbage.

Back to all posts

The 7 a.m. dashboard that lied

I’ve watched a CFO text the exec thread at 7:12 a.m.: “Why is MRR zero?” We were on Snowflake with Fivetran pulling Stripe and Salesforce, dbt modeling, and Looker on top. Overnight, Stripe added a nullable column that Fivetran propagated. A downstream dbt model used SELECT * (yep), re-ordered columns, and a Looker Explore joined on the wrong key. No job failed. No alert fired. The dashboard cached a perfect, empty world.

That morning we did what most teams do: hotfix the model, invalidate Looker cache, post a Slack apology, and swear we’ll “add tests.” Six weeks later, different table, same fire.

What finally worked wasn’t more heroics. It was treating data quality like SRE treats service reliability—explicit SLOs, automated checks, and alerts routed to owners who can fix them before the CFO wakes up.

Why downstream failures keep blindsiding analytics

Analytics pipelines fail quietly. Services blow up loudly. That asymmetry is why leaders get ambushed by “quiet data incidents.” The usual culprits:

Freshness drift: late or stalled upstream loads; cron skew; ingestion backlog on Kafka/Kinesis; warehouse credits exhausted.
Schema/contract drift: vendors add columns, change types, or repurpose enums; SELECT * and implicit casts make it worse.
Completeness gaps: partial loads, dropped partitions, missing days in incremental models.
Statistical anomalies: row counts spike or crater; distributions shift (e.g., amount suddenly 10x due to currency change).
Business rule violations: referential integrity breaks; negative revenue; duplicate orders.

When this goes wrong, you get:

Bad decisions: exec dashboards lying for hours—call it “data downtime.”
Wasted spend: BI queries churn on garbage; engineers burn cycles re-running backfills.
Slower delivery: fear-driven sign-offs and manual validations pile up.

You need monitoring that detects issues where they occur, blocks propagation, and pages the owner with context and a runbook. Not another “quality initiative”—operational guardrails.

What to actually monitor (and where)

Forget 200 checks you’ll never maintain. Start with five categories and make them measurable:

Freshness (source, staging, mart)
- Example SLO: “99% of fact_payments partitions land within 30 minutes of source close.”
- Signals: max partition timestamp, ingestion lag, last successful run.
Completeness
- Row-count deltas vs trailing average; partition coverage; distinct keys vs expected cardinality.
Schema/contract
- Detect adds/drops/type changes; enforce accepted_values for enums; prohibit SELECT * in critical models.
Statistical anomalies
- Distribution shifts (KS test), outlier rates, null rates on critical columns.
Business rules
- Referential integrity (e.g., every order has a customer), dedup keys, no negative amounts, currency constraints.

And one meta-control:

Lineage and blast radius
- Use OpenLineage/Marquez or built-in lineage (dbt Cloud, Databricks) to halt downstream jobs when upstream fails.

Define a small set of SLOs and page on them. Everything else can be warnings in Slack or a dashboard.

A reference architecture that works without a six-figure platform

I’ve deployed this stack at startups and post-IPO shops without buying another “AI observability” SKU:

Transform/tests: dbt (Core or Cloud). Put basic tests in schema.yml and tag critical models.
Deeper checks: Great Expectations or Soda Core. Use for distributions, completeness, and contracts.
Orchestration: Airflow or Dagster (or Prefect). Make quality checks first-class tasks that gate downstream.
Metrics/alerts: Prometheus + Alertmanager (or DataDog) to scrape/export freshness and test results, alert to Slack/PagerDuty.
Lineage: OpenLineage/Marquez or dbt’s built-in lineage to compute blast radius and auto-stop dependent tasks.
GitOps: IaC with Terraform; deployments via ArgoCD for infra and dbt jobs.

Is Monte Carlo/Bigeye/Anomalo good? Sure, if you can justify the spend and need auto-detection at scale. But you can cover 80% of risk with the stack above and discipline.

Implementation: from tests to SLOs (with code)

Start by codifying non-negotiables in dbt. These run fast, catch the obvious, and live with the model.

# models/marts/payments/schema.yml
version: 2
models:
  - name: fact_payments
    description: "Finalized payments with one row per payment_id"
    config:
      tags: [critical, owner_finance]
    columns:
      - name: payment_id
        tests:
          - not_null
          - unique
      - name: amount_cents
        tests:
          - not_null
          - relationships:
              to: ref('dim_currency')
              field: currency_code
      - name: status
        tests:
          - accepted_values:
              values: ['succeeded','failed','refunded','disputed']

Add deeper checks with Great Expectations or Soda. Example with GE ensuring completeness and distribution sanity:

# great_expectations/expectations/fact_payments.yml
expectation_suite_name: fact_payments_suite
expectations:
  - expect_table_row_count_to_be_between:
      min_value: 50000
      max_value: 200000
  - expect_column_values_to_not_be_null:
      column: payment_id
  - expect_column_values_to_be_between:
      column: amount_cents
      min_value: 0
      mostly: 0.999
  - expect_column_kl_divergence_to_be_less_than:
      column: amount_cents
      partition_object:
        buckets: [0, 1000, 5000, 10000, 50000, 100000]
      threshold: 0.1

Run the suite as a gate in your orchestrator. Airflow example that exports metrics and fails fast:

# dags/payments_quality.py
from airflow import DAG
from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import KubernetesPodOperator
from airflow.utils.dates import days_ago
from prometheus_client import CollectorRegistry, Gauge, push_to_gateway
import os

with DAG(
    'payments_quality', start_date=days_ago(1), schedule_interval='*/15 * * * *', catchup=False
) as dag:
    dbt_test = KubernetesPodOperator(
        task_id='dbt_tests',
        name='dbt-tests',
        namespace='data',
        image='ghcr.io/org/dbt:1.7.6',
        cmds=['bash','-lc'],
        arguments=['dbt deps && dbt test --select tag:critical'],
    )

    ge_check = KubernetesPodOperator(
        task_id='ge_suite',
        name='ge-suite',
        namespace='data',
        image='ghcr.io/org/great-expectations:latest',
        cmds=['bash','-lc'],
        arguments=['great_expectations checkpoint run fact_payments_suite'],
    )

    def publish_metrics():
        reg = CollectorRegistry()
        freshness = Gauge('fact_payments_freshness_minutes', 'Lag to max partition', registry=reg)
        failures = Gauge('fact_payments_quality_failures', 'Number of failed checks', registry=reg)
        # In reality, query warehouse for max partition and parse GE/DBT artifacts
        freshness.set(12)
        failures.set(int(os.getenv('QUALITY_FAILURES', '0')))
        push_to_gateway('prometheus-pushgateway.data:9091', job='payments_quality', registry=reg)

    # Wrap with a PythonOperator in real code and set trigger rules to fail downstream

    dbt_test >> ge_check

Wire Prometheus to alert on freshness and failures:

# prometheus/alerting-rules.yml
groups:
- name: data-quality
  rules:
  - alert: FactPaymentsFreshnessBreached
    expr: fact_payments_freshness_minutes > 30
    for: 10m
    labels:
      severity: critical
      team: finance
    annotations:
      summary: fact_payments freshness SLO breached
      runbook: https://runbooks.internal/finance/fact_payments
  - alert: FactPaymentsQualityFailures
    expr: fact_payments_quality_failures > 0
    for: 0m
    labels:
      severity: page
      team: finance
    annotations:
      summary: dbt/GE checks failing for fact_payments

If you prefer Soda Core, scanning Snowflake is one file and one command:

# soda/checks/fact_payments.yml
checks for fact_payments:
  - freshness(finished_at) < 30m
  - row_count between 50000 and 200000
  - missing_count(payment_id) = 0
  - duplicate_count(payment_id) = 0
  - invalid_count(status) = 0:
      valid_values: [succeeded, failed, refunded, disputed]

soda scan -d snowflake -c soda/config.yml -s soda/checks/fact_payments.yml

Prevent propagation: if quality fails, don’t let marts refresh or BI cache warm. In Airflow/Dagster, set downstream tasks to all_success on quality jobs. In dbt, use selectors so critical models only build when upstream tests are green.

Rollout plan that sticks (4–6 weeks)

Inventory and tag
- Identify top 20 critical tables driving exec dashboards. Tag owners in code (tags: [owner_finance]).
Baseline
- Collect freshness, row counts, null/dup rates for 2 weeks. No alerts yet. Just learn.
Define SLOs with the business
- “99% of fact_payments within 30 minutes.” “<0.5% duplicate order_id per day.” Put SLOs in a doc and in code.
Gate pipelines
- Add dbt tests and GE/Soda checks. Fail the job and stop downstream on critical breaches.
Alert the right people
- Route PagerDuty by tag/owner. Slack warnings to the data channel. Attach runbooks and recent changes.
Close the loop
- Weekly review: incidents, MTTR, data downtime minutes, SLO compliance. Fix root causes (e.g., remove SELECT *, add CDC retries).

Most teams see results within two sprints without buying new infra.

Results you can expect (from real rollouts)

At a B2B SaaS (Snowflake + Fivetran + dbt + Looker, 40 analysts):

Reduced data MTTR from 6 hours to 45 minutes in 30 days.
Cut freshness breaches on executive dashboards from 9/week to 1/week in 60 days.
Eliminated two categories of incidents entirely (schema drift on Stripe tables, duplicate orders in EU region) by adding contracts and dedup keys.
Senior leadership stopped asking “Is the data right?” every standup. That’s a cultural KPI.

At a marketplace (BigQuery + Kafka + Dataproc + Tableau):

Implemented partition completeness checks and lineage-based blast radius; prevented three downstream rebuilds/week, saving ~120 engineer-hours/month.
Data downtime minutes on fact_gmv reduced by 83%.

No fairy dust. Just checks, SLOs, and accountability that’s automated.

Common traps and how to dodge them

Spray-and-pray checks: 500 assertions nobody maintains. Start with 20 that matter.
Paging the world: a single Slack firehose. Route to owners with runbooks.
Ignoring lineage: without blast radius, you keep shipping bad data to BI.
Silent retries: Airflow/Dagster default retries mask real failures. Alert on first failure, then retry.
SELECT * in critical models: makes schema drift invisible and dangerous.
No CI: tests that only run in prod. Add GitHub Actions to run dbt test and Soda/GE on PRs.
Overfitting thresholds: don’t hard-code today’s row count; use trailing averages and confidence bands.

When in doubt, ask: “Would I page someone for this?” If not, it’s a warning, not a gate.

A short checklist you can ship this week

Add not_null, unique, and key accepted_values tests to your top 10 models in dbt.
Instrument freshness and row-count deltas to Prometheus or DataDog.
Add one GE/Soda suite for a revenue-critical table.
Stop downstream jobs on critical check failures.
Route alerts to owners; include runbooks and last-commit info.
Review incidents weekly; track MTTR and data downtime.

You don’t need another platform. You need SLOs, signals, and someone to own the beep.

Related Resources

Key takeaways

Treat data quality like SRE treats service reliability: define SLOs, measure, alert, improve.
Monitor five things: freshness, completeness, schema/contract, statistical anomalies, and business rules.
Start small: top 20 tables, baseline metrics for two weeks, then set thresholds.
Integrate tests into the orchestrator and CI; break the build on critical failures and page owners, not a shared channel.
Use lineage to assess blast radius and stop downstream jobs automatically.
Track MTTR and ‘bad data minutes’ to prove business value and drive prioritization.

Implementation checklist

Pick 10–20 critical tables and tag owners.
Instrument freshness and row-count deltas ASAP.
Add `dbt` tests for not_null/unique/accepted_values; fail PRs without tests.
Add Great Expectations/Soda checks for distributions and completeness.
Publish metrics to Prometheus; create SLOs and alert rules.
Wire alerts to PagerDuty/Slack by data product owner; include runbook links.
Enable lineage (OpenLineage/Marquez) and auto-halt downstream on critical failures.
Review incidents weekly; track MTTR, incident rate, and data downtime.

Questions we hear from teams

Do I need a commercial data observability platform to start?: No. You can cover most risk with dbt tests, Great Expectations or Soda for deeper checks, Prometheus for metrics, and your existing orchestrator for gating. Add a vendor later if you need ML-based anomaly detection at scale.
How do I pick thresholds without false positives?: Baseline for two weeks, then set thresholds using trailing averages and percent bands. Page on SLOs (e.g., freshness >30 min for 10 min) and send everything else as warnings.
Who should get paged for data quality failures?: The data product owner. Tag models with owners in code and route alerts via Alertmanager/PagerDuty based on those tags. Shared channels cause alert fatigue and no accountability.
What metrics convince executives this is working?: MTTR for data incidents, SLO compliance rate, number of prevented downstream builds, and ‘data downtime minutes’ on executive metrics. Tie each incident to a dollar impact where possible.
How do we prevent schema drift from breaking models?: Stop using SELECT *. Add dbt contracts (`constraints/enforced`), enable GE/Soda schema checks, and run CI on PRs for sources and models. For sources like Kafka, use a schema registry and enforce compatibility.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Fix your data reliability gaps Bring us your ugliest pipeline