What if our data producers won’t agree to a contract?

Start at the consumer boundary. Define the contract and enforce it at ingest with schema validation, DLQs, and compatibility checks. Publish breaches as metrics tied to business impact. Once producers see their changes page someone, they’ll show up to the table.

We already have Monte Carlo/Soda. Isn’t that enough?

Those are great, but they’re observability. Without explicit contracts and enforcement in CI/orchestration, you’re watching failures, not preventing them. Use them to power SLIs and runbooks.

How do we avoid alert fatigue?

Alert on SLO burn rates (fast and slow), not every failed check. Group alerts by dataset with a single owner/runbook. Add quiet hours for non-critical datasets. Review noise weekly and prune.

Can we do this in streaming?

Yes. Enforce schemas via Schema Registry, route invalid events to DLQs, and run windowed completeness/freshness checks with Kafka Streams or Flink. Expose the same SLIs and wire the same alerting.

What’s the first week look like with GitPlumbers?

Day 1-2: pick top 3 datasets, draft contracts. Day 3: dbt tests + source freshness. Day 4: Great Expectations + Prometheus metrics. Day 5: Airflow circuit breaker + runbooks. You get a working slice plus a backlog to scale.

Data-engineering · Nov 21, 2025 · 10 minute read

The Tuesday Morning Dashboard Fire We Never Fought Again: Data Quality Guardrails That Block Bad Data Upstream

Data reliability isn’t a dashboard; it’s a contract, a circuit breaker, and a runbook. Here’s how we stop bad data before it burns your quarter.

Alex Romero

Principal Data Engineer, GitPlumbers

20 years building and rescuing data platforms at scale (AdTech, retail, fintech). Formerly at Shopify Data Platform and a founding member of two analytics teams that learned reliability the hard way.

“Bad data is worse than no data. Build tripwires that fail fast before the business does.”

Back to all posts

The outage you’ve lived through

Two quarters ago, a retail client’s CFO pinged at 7:08 AM: the revenue dashboard was blank. A late CDC update dropped an extra NULL in order_total and the dbt model “helpfully” filtered out the entire dataset. The pipeline ran green. The Looker tiles were empty. Finance paused a promo, marketing held back spend, and the CEO’s staff meeting was chaos. I’ve seen this movie. What actually prevented a repeat wasn’t more observability dashboards—it was a set of guardrails that stopped bad data at the gate.

What follows is the battle-tested pattern we implemented at GitPlumbers: contracts + SLOs, checks at each hop, CI enforcement, orchestration circuit breakers, and SRE-grade alerting. It’s not pretty, but it works.

Why downstream analytics keep failing (and what to fix)

The root causes aren’t mysterious:

Schema drift: Someone adds a column, changes a type, or sends an unexpected enum. Producers ship; consumers crash quietly.
Freshness debt: Cron-based ingestion slips; daily dashboards silently use stale data.
Silent truncation and null creep: Casts, over-aggressive filters, and late arrives drop records with no alarm.
AI-generated pipelines: ‘Vibe-coded’ SQL from Copilot that passes the happy path but crumbles at month-end edge cases.
No contracts, no owners: The data domain, not the platform, owns truth—but nobody wrote it down.

Here’s what actually works:

Write data contracts with SLOs for your top 10 business-critical datasets.
Instrument checks where they belong: producer, ingest, transform, warehouse, BI.
Enforce in CI/CD and orchestration with hard stops.
Alert on SLO burn with clear runbooks.
Measure impact: MTTR, failed-orders prevented, forecast accuracy, on-time dashboard availability.

Start with the contract: schema, semantics, SLOs

Stop arguing about “what’s correct” at 7 AM. Agree in advance.

Schema: data types, nullability, enumerations, PK/FK, partition keys.
Semantics: business definitions (e.g., order_total includes discounts, excludes tax), timeliness expectations, idempotency rules.
SLOs: measurable targets for freshness, completeness, accuracy, and uniqueness.

A simple contract stored with code (GitOps-style), versioned, and reviewed:

# contracts/orders.yaml
name: orders
owner: commerce-data@company.com
schema:
  id: string # UUID v4, not null
  created_at: timestamp
  customer_id: string
  order_total: decimal(12,2)
  currency: string # ISO-4217
  status: [PLACED, FULFILLED, CANCELLED]
rules:
  - name: pk_unique
    expression: unique(id)
  - name: non_negative_total
    expression: order_total >= 0
  - name: currency_enum
    expression: currency in ["USD","EUR","GBP"]
slos:
  freshness:
    target: "<= 15m"   # end-to-end
    probe: max(created_at) vs now()
  completeness:
    target: ">= 99.5%" # per 1h window vs source of truth
  accuracy:
    target: ">= 99.9%" # reconciliation vs payments
  uniqueness:
    target: ">= 99.99%"
lineage:
  upstream: [kafka://orders.v3]
  downstream: [snowflake://analytics.fact_orders]

Bind these SLOs to business KPIs. For this client, hitting 99.5% hourly completeness reduced promo misallocation by 12% because budgets weren’t reacting to phantom troughs.

Put checks where the breakage starts (not just where it’s visible)

You can’t test quality only in the warehouse and expect resilience. Instrument each hop.

1) Producer and ingest

Schema enforcement: Use Confluent Schema Registry with Avro/Protobuf and compatibility BACKWARD.

# Example: enforce backward compatibility on orders topic
confluent schema-registry subject update-config orders-value --compatibility BACKWARD

Reject bad messages with DLQs and metrics.

# Kafka Connect setting
errors.tolerance=none
errors.deadletterqueue.topic.name=dlq.orders

Volume and lag checks with ksqlDB or Kafka Streams.

-- ksqlDB anomaly-ish detector: low traffic alert window
CREATE TABLE order_volume AS
SELECT WINDOWSTART AS ts, COUNT(*) AS cnt
FROM orders
WINDOW TUMBLING (SIZE 5 MINUTES)
GROUP BY 1;

2) Transform (dbt)

Schema tests catch nulls, dupes, and broken relationships. Fail the run.

# models/fact_orders.yml
version: 2
models:
  - name: fact_orders
    columns:
      - name: id
        tests: [not_null, unique]
      - name: customer_id
        tests:
          - relationships:
              to: ref('dim_customers')
              field: id
      - name: order_total
        tests:
          - not_null
          - dbt_utils.expression_is_true:
              expression: ">= 0"

Freshness checks using dbt sources.

# models/src_orders.yml
sources:
  - name: raw
    tables:
      - name: orders
        loaded_at_field: created_at
        freshness:
          warn_after: {count: 10, period: minute}
          error_after: {count: 20, period: minute}

3) Warehouse semantic checks (Great Expectations or Soda)

These are great for accuracy and invariants that don’t fit dbt’s schema lens.

# great_expectations/checkpoints/fact_orders.yml
name: fact_orders_checkpoint
config_version: 1
class_name: SimpleCheckpoint
validations:
  - batch_request:
      datasource_name: snowflake
      data_asset_name: analytics.fact_orders
    expectation_suite_name: fact_orders_suite

# expectations/fact_orders_suite.py
from great_expectations.core.batch import BatchRequest
from great_expectations.expectations.core import ExpectColumnValuesToBeBetween

suite = context.create_expectation_suite("fact_orders_suite", overwrite_existing=True)
suite.add_expectation(
  ExpectColumnValuesToBeBetween(
    column="order_total", min_value=0, max_value=100000
  )
)

4) BI layer

Smoke tests on critical dashboards: run parameterized queries and validate row counts, last updated times, and key segments present. We use a tiny pytest suite hitting Looker or Metabase APIs.

Enforce quality in CI/CD and orchestration (fail fast, loudly)

Treat quality gates like build gates.

CI on every model/schema change

Run dbt build + soda scan or ge checkpoint run on a sampled environment.

# .github/workflows/data-ci.yml
name: data-ci
on: [pull_request]
jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: '3.11' }
      - uses: actions/setup-node@v4
        with: { node-version: '20' }
      - name: Install deps
        run: |
          pip install dbt-snowflake great_expectations soda-core-snowflake
      - name: dbt build (sample)
        run: dbt build --profiles-dir profiles --vars '{env: ci}'
      - name: Great Expectations
        run: great_expectations checkpoint run fact_orders

Airflow: circuit breaker and lineage-aware retries

Short-circuit downstream tasks if checks fail. Don’t “keep calm and continue.”

# dags/orders_quality.py
from airflow import DAG
from airflow.operators.empty import EmptyOperator
from airflow.operators.python import ShortCircuitOperator
from great_expectations_provider.operators.great_expectations import GreatExpectationsOperator

with DAG("orders_pipeline", schedule_interval="*/15 * * * *", start_date=...
         , catchup=False) as dag:

    ingest = EmptyOperator(task_id="ingest_orders")

    ge_check = GreatExpectationsOperator(
        task_id="ge_fact_orders",
        data_context_root_dir="/opt/airflow/great_expectations",
        checkpoint_name="fact_orders_checkpoint",
        fail_task_on_validation_failure=True,
    )

    stop_on_fail = ShortCircuitOperator(
        task_id="circuit_breaker",
        python_callable=lambda ti: ti.xcom_pull(task_ids='ge_fact_orders')['success'],
    )

    transform = EmptyOperator(task_id="dbt_run")
    publish = EmptyOperator(task_id="publish_metrics")

    ingest >> ge_check >> stop_on_fail >> transform >> publish

Bake lineage into retries: if the upstream freshness SLA is missed, skip instead of retrying forever. We’ve seen Airflow spend five hours reprocessing garbage.

Alert like an SRE: SLO burn, not noise

We convert checks into metrics and alert on burn rate and user impact.

Expose SLIs to Prometheus using a tiny exporter or pushgateway from your checks.

# /metrics
data_freshness_seconds{dataset="orders"} 340
data_completeness_ratio{dataset="orders",window="1h"} 0.996
data_quality_failures_total{dataset="fact_orders",type="expectation"} 0

Alert on fast burn (severe regression) and slow burn (sustained miss).

# alerting/data-rules.yml
- alert: OrdersFreshnessFastBurn
  expr: data_freshness_seconds{dataset="orders"} > 1200
  for: 5m
  labels: { severity: page }
  annotations:
    summary: Orders freshness SLO fast-burn
    runbook: https://runbooks.company.com/orders-freshness

- alert: FactOrdersCompletenessSlowBurn
  expr: avg_over_time(data_completeness_ratio{dataset="orders"}[2h]) < 0.995
  for: 30m
  labels: { severity: ticket }
  annotations:
    summary: Completeness slow-burn; investigate CDC delay

Route alerts to Slack/Jira with clear owners and escalation. No mystery alerts.

If you don’t include a runbook link and a single accountable owner per dataset, you’re just generating anxiety, not reliability.

Measure outcomes: prove the value or it won’t last

Here’s what changed for that retail client in four weeks:

On-time dashboard availability (8 AM daily) moved from 92% to 99.4%.
MTTR for data incidents dropped from ~8h to 45m.
Promo budget misfires decreased by 12% due to timely and complete revenue data.
Analyst rework for “why is this number wrong?” tickets fell by 40%.

We reviewed SLOs in a weekly 30-minute ops review with product, analytics, and platform. The business now funds contracts as first-class backlog items. That’s the unlock.

Pitfalls and what actually works

I’ve seen these fail repeatedly:

Observability-only plays: Adding Monte Carlo/Soda without contracts turns into pretty charts about chaos you already knew existed.
All-or-nothing MDM: You don’t need a year-long master data project to stop nulls today.
‘Allow nulls then fix later’: Later never comes.
AI-generated SQL in prod without tests: vibe code belongs behind feature flags and staging checks.

And here’s what works consistently:

Contract the top 10 datasets tied to money and customers first.
Start with dbt tests + source freshness, then add Great Expectations for semantics.
Wire a circuit breaker in orchestration so bad data doesn’t flow downhill.
Expose SLIs to Prometheus and alert on burn with runbooks.
Make owners explicit: a distribution list is not ownership.

If you’re underwater and need it done yesterday, GitPlumbers drops in with a contract starter kit, dbt/GE templates, Airflow wiring, and a one-week pilot that leaves behind pipelines you can actually maintain.

Related Resources

Key takeaways

Data quality that prevents failures begins with explicit contracts and SLOs, not ad-hoc checks.
Instrument checks where issues originate: at the producer, ingest, transform, and warehouse layers.
Automate enforcement via CI/CD and orchestration with fail-fast circuit breakers.
Alert on SLO burn rate with runbooks to reduce MTTR and avoid alert fatigue.
Tie reliability metrics to business KPIs so leadership actually cares and funds the work.

Implementation checklist

Define data contracts (schema + semantics + SLOs) for high-impact datasets.
Implement dbt tests and Great Expectations checks for core models.
Wire Airflow/Orchestration with short-circuit stops on quality failures.
Expose Prometheus metrics for data SLIs and create actionable alerts.
Version, review, and test contracts in CI before schema changes reach prod.
Add runbooks with owners, escalation paths, and rollback steps.
Review reliability metrics in weekly ops with product and analytics leaders.

Questions we hear from teams

What if our data producers won’t agree to a contract?: Start at the consumer boundary. Define the contract and enforce it at ingest with schema validation, DLQs, and compatibility checks. Publish breaches as metrics tied to business impact. Once producers see their changes page someone, they’ll show up to the table.
We already have Monte Carlo/Soda. Isn’t that enough?: Those are great, but they’re observability. Without explicit contracts and enforcement in CI/orchestration, you’re watching failures, not preventing them. Use them to power SLIs and runbooks.
How do we avoid alert fatigue?: Alert on SLO burn rates (fast and slow), not every failed check. Group alerts by dataset with a single owner/runbook. Add quiet hours for non-critical datasets. Review noise weekly and prune.
Can we do this in streaming?: Yes. Enforce schemas via Schema Registry, route invalid events to DLQs, and run windowed completeness/freshness checks with Kafka Streams or Flink. Expose the same SLIs and wire the same alerting.
What’s the first week look like with GitPlumbers?: Day 1-2: pick top 3 datasets, draft contracts. Day 3: dbt tests + source freshness. Day 4: Great Expectations + Prometheus metrics. Day 5: Airflow circuit breaker + runbooks. You get a working slice plus a backlog to scale.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to an engineer about a one-week data quality pilot Download the data contract and SLO starter kit