The Tuesday Morning Dashboard Fire We Never Fought Again: Data Quality Guardrails That Block Bad Data Upstream
Data reliability isn’t a dashboard; it’s a contract, a circuit breaker, and a runbook. Here’s how we stop bad data before it burns your quarter.
“Bad data is worse than no data. Build tripwires that fail fast before the business does.”Back to all posts
The outage you’ve lived through
Two quarters ago, a retail client’s CFO pinged at 7:08 AM: the revenue dashboard was blank. A late CDC update dropped an extra NULL in order_total and the dbt model “helpfully” filtered out the entire dataset. The pipeline ran green. The Looker tiles were empty. Finance paused a promo, marketing held back spend, and the CEO’s staff meeting was chaos. I’ve seen this movie. What actually prevented a repeat wasn’t more observability dashboards—it was a set of guardrails that stopped bad data at the gate.
What follows is the battle-tested pattern we implemented at GitPlumbers: contracts + SLOs, checks at each hop, CI enforcement, orchestration circuit breakers, and SRE-grade alerting. It’s not pretty, but it works.
Why downstream analytics keep failing (and what to fix)
The root causes aren’t mysterious:
- Schema drift: Someone adds a column, changes a type, or sends an unexpected enum. Producers ship; consumers crash quietly.
- Freshness debt: Cron-based ingestion slips; daily dashboards silently use stale data.
- Silent truncation and null creep: Casts, over-aggressive filters, and late arrives drop records with no alarm.
- AI-generated pipelines: ‘Vibe-coded’ SQL from Copilot that passes the happy path but crumbles at month-end edge cases.
- No contracts, no owners: The data domain, not the platform, owns truth—but nobody wrote it down.
Here’s what actually works:
- Write data contracts with SLOs for your top 10 business-critical datasets.
- Instrument checks where they belong: producer, ingest, transform, warehouse, BI.
- Enforce in CI/CD and orchestration with hard stops.
- Alert on SLO burn with clear runbooks.
- Measure impact: MTTR, failed-orders prevented, forecast accuracy, on-time dashboard availability.
Start with the contract: schema, semantics, SLOs
Stop arguing about “what’s correct” at 7 AM. Agree in advance.
- Schema: data types, nullability, enumerations, PK/FK, partition keys.
- Semantics: business definitions (e.g.,
order_totalincludes discounts, excludes tax), timeliness expectations, idempotency rules. - SLOs: measurable targets for freshness, completeness, accuracy, and uniqueness.
A simple contract stored with code (GitOps-style), versioned, and reviewed:
# contracts/orders.yaml
name: orders
owner: commerce-data@company.com
schema:
id: string # UUID v4, not null
created_at: timestamp
customer_id: string
order_total: decimal(12,2)
currency: string # ISO-4217
status: [PLACED, FULFILLED, CANCELLED]
rules:
- name: pk_unique
expression: unique(id)
- name: non_negative_total
expression: order_total >= 0
- name: currency_enum
expression: currency in ["USD","EUR","GBP"]
slos:
freshness:
target: "<= 15m" # end-to-end
probe: max(created_at) vs now()
completeness:
target: ">= 99.5%" # per 1h window vs source of truth
accuracy:
target: ">= 99.9%" # reconciliation vs payments
uniqueness:
target: ">= 99.99%"
lineage:
upstream: [kafka://orders.v3]
downstream: [snowflake://analytics.fact_orders]Bind these SLOs to business KPIs. For this client, hitting 99.5% hourly completeness reduced promo misallocation by 12% because budgets weren’t reacting to phantom troughs.
Put checks where the breakage starts (not just where it’s visible)
You can’t test quality only in the warehouse and expect resilience. Instrument each hop.
1) Producer and ingest
- Schema enforcement: Use
Confluent Schema Registrywith Avro/Protobuf and compatibilityBACKWARD.
# Example: enforce backward compatibility on orders topic
confluent schema-registry subject update-config orders-value --compatibility BACKWARD- Reject bad messages with DLQs and metrics.
# Kafka Connect setting
errors.tolerance=none
errors.deadletterqueue.topic.name=dlq.orders- Volume and lag checks with ksqlDB or Kafka Streams.
-- ksqlDB anomaly-ish detector: low traffic alert window
CREATE TABLE order_volume AS
SELECT WINDOWSTART AS ts, COUNT(*) AS cnt
FROM orders
WINDOW TUMBLING (SIZE 5 MINUTES)
GROUP BY 1;2) Transform (dbt)
- Schema tests catch nulls, dupes, and broken relationships. Fail the run.
# models/fact_orders.yml
version: 2
models:
- name: fact_orders
columns:
- name: id
tests: [not_null, unique]
- name: customer_id
tests:
- relationships:
to: ref('dim_customers')
field: id
- name: order_total
tests:
- not_null
- dbt_utils.expression_is_true:
expression: ">= 0"- Freshness checks using dbt sources.
# models/src_orders.yml
sources:
- name: raw
tables:
- name: orders
loaded_at_field: created_at
freshness:
warn_after: {count: 10, period: minute}
error_after: {count: 20, period: minute}3) Warehouse semantic checks (Great Expectations or Soda)
These are great for accuracy and invariants that don’t fit dbt’s schema lens.
# great_expectations/checkpoints/fact_orders.yml
name: fact_orders_checkpoint
config_version: 1
class_name: SimpleCheckpoint
validations:
- batch_request:
datasource_name: snowflake
data_asset_name: analytics.fact_orders
expectation_suite_name: fact_orders_suite# expectations/fact_orders_suite.py
from great_expectations.core.batch import BatchRequest
from great_expectations.expectations.core import ExpectColumnValuesToBeBetween
suite = context.create_expectation_suite("fact_orders_suite", overwrite_existing=True)
suite.add_expectation(
ExpectColumnValuesToBeBetween(
column="order_total", min_value=0, max_value=100000
)
)4) BI layer
- Smoke tests on critical dashboards: run parameterized queries and validate row counts, last updated times, and key segments present. We use a tiny pytest suite hitting
LookerorMetabaseAPIs.
Enforce quality in CI/CD and orchestration (fail fast, loudly)
Treat quality gates like build gates.
CI on every model/schema change
- Run
dbt build+soda scanorge checkpoint runon a sampled environment.
# .github/workflows/data-ci.yml
name: data-ci
on: [pull_request]
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: { python-version: '3.11' }
- uses: actions/setup-node@v4
with: { node-version: '20' }
- name: Install deps
run: |
pip install dbt-snowflake great_expectations soda-core-snowflake
- name: dbt build (sample)
run: dbt build --profiles-dir profiles --vars '{env: ci}'
- name: Great Expectations
run: great_expectations checkpoint run fact_ordersAirflow: circuit breaker and lineage-aware retries
- Short-circuit downstream tasks if checks fail. Don’t “keep calm and continue.”
# dags/orders_quality.py
from airflow import DAG
from airflow.operators.empty import EmptyOperator
from airflow.operators.python import ShortCircuitOperator
from great_expectations_provider.operators.great_expectations import GreatExpectationsOperator
with DAG("orders_pipeline", schedule_interval="*/15 * * * *", start_date=...
, catchup=False) as dag:
ingest = EmptyOperator(task_id="ingest_orders")
ge_check = GreatExpectationsOperator(
task_id="ge_fact_orders",
data_context_root_dir="/opt/airflow/great_expectations",
checkpoint_name="fact_orders_checkpoint",
fail_task_on_validation_failure=True,
)
stop_on_fail = ShortCircuitOperator(
task_id="circuit_breaker",
python_callable=lambda ti: ti.xcom_pull(task_ids='ge_fact_orders')['success'],
)
transform = EmptyOperator(task_id="dbt_run")
publish = EmptyOperator(task_id="publish_metrics")
ingest >> ge_check >> stop_on_fail >> transform >> publish- Bake lineage into retries: if the upstream
freshnessSLA is missed, skip instead of retrying forever. We’ve seen Airflow spend five hours reprocessing garbage.
Alert like an SRE: SLO burn, not noise
We convert checks into metrics and alert on burn rate and user impact.
- Expose SLIs to Prometheus using a tiny exporter or pushgateway from your checks.
# /metrics
data_freshness_seconds{dataset="orders"} 340
data_completeness_ratio{dataset="orders",window="1h"} 0.996
data_quality_failures_total{dataset="fact_orders",type="expectation"} 0- Alert on fast burn (severe regression) and slow burn (sustained miss).
# alerting/data-rules.yml
- alert: OrdersFreshnessFastBurn
expr: data_freshness_seconds{dataset="orders"} > 1200
for: 5m
labels: { severity: page }
annotations:
summary: Orders freshness SLO fast-burn
runbook: https://runbooks.company.com/orders-freshness
- alert: FactOrdersCompletenessSlowBurn
expr: avg_over_time(data_completeness_ratio{dataset="orders"}[2h]) < 0.995
for: 30m
labels: { severity: ticket }
annotations:
summary: Completeness slow-burn; investigate CDC delay- Route alerts to Slack/Jira with clear owners and escalation. No mystery alerts.
If you don’t include a runbook link and a single accountable owner per dataset, you’re just generating anxiety, not reliability.
Measure outcomes: prove the value or it won’t last
Here’s what changed for that retail client in four weeks:
- On-time dashboard availability (8 AM daily) moved from 92% to 99.4%.
- MTTR for data incidents dropped from ~8h to 45m.
- Promo budget misfires decreased by 12% due to timely and complete revenue data.
- Analyst rework for “why is this number wrong?” tickets fell by 40%.
We reviewed SLOs in a weekly 30-minute ops review with product, analytics, and platform. The business now funds contracts as first-class backlog items. That’s the unlock.
Pitfalls and what actually works
I’ve seen these fail repeatedly:
- Observability-only plays: Adding Monte Carlo/Soda without contracts turns into pretty charts about chaos you already knew existed.
- All-or-nothing MDM: You don’t need a year-long master data project to stop nulls today.
- ‘Allow nulls then fix later’: Later never comes.
- AI-generated SQL in prod without tests: vibe code belongs behind feature flags and staging checks.
And here’s what works consistently:
- Contract the top 10 datasets tied to money and customers first.
- Start with dbt tests + source freshness, then add Great Expectations for semantics.
- Wire a circuit breaker in orchestration so bad data doesn’t flow downhill.
- Expose SLIs to Prometheus and alert on burn with runbooks.
- Make owners explicit: a distribution list is not ownership.
If you’re underwater and need it done yesterday, GitPlumbers drops in with a contract starter kit, dbt/GE templates, Airflow wiring, and a one-week pilot that leaves behind pipelines you can actually maintain.
Key takeaways
- Data quality that prevents failures begins with explicit contracts and SLOs, not ad-hoc checks.
- Instrument checks where issues originate: at the producer, ingest, transform, and warehouse layers.
- Automate enforcement via CI/CD and orchestration with fail-fast circuit breakers.
- Alert on SLO burn rate with runbooks to reduce MTTR and avoid alert fatigue.
- Tie reliability metrics to business KPIs so leadership actually cares and funds the work.
Implementation checklist
- Define data contracts (schema + semantics + SLOs) for high-impact datasets.
- Implement dbt tests and Great Expectations checks for core models.
- Wire Airflow/Orchestration with short-circuit stops on quality failures.
- Expose Prometheus metrics for data SLIs and create actionable alerts.
- Version, review, and test contracts in CI before schema changes reach prod.
- Add runbooks with owners, escalation paths, and rollback steps.
- Review reliability metrics in weekly ops with product and analytics leaders.
Questions we hear from teams
- What if our data producers won’t agree to a contract?
- Start at the consumer boundary. Define the contract and enforce it at ingest with schema validation, DLQs, and compatibility checks. Publish breaches as metrics tied to business impact. Once producers see their changes page someone, they’ll show up to the table.
- We already have Monte Carlo/Soda. Isn’t that enough?
- Those are great, but they’re observability. Without explicit contracts and enforcement in CI/orchestration, you’re watching failures, not preventing them. Use them to power SLIs and runbooks.
- How do we avoid alert fatigue?
- Alert on SLO burn rates (fast and slow), not every failed check. Group alerts by dataset with a single owner/runbook. Add quiet hours for non-critical datasets. Review noise weekly and prune.
- Can we do this in streaming?
- Yes. Enforce schemas via Schema Registry, route invalid events to DLQs, and run windowed completeness/freshness checks with Kafka Streams or Flink. Expose the same SLIs and wire the same alerting.
- What’s the first week look like with GitPlumbers?
- Day 1-2: pick top 3 datasets, draft contracts. Day 3: dbt tests + source freshness. Day 4: Great Expectations + Prometheus metrics. Day 5: Airflow circuit breaker + runbooks. You get a working slice plus a backlog to scale.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
