The Night the CFO’s Dashboard Went Dark: Building Data Quality Gates That Actually Prevent Analytics Failures
If your “source of truth” is lying by 2%, the business will find out before you do. Here’s how to build data reliability into the pipeline so downstream analytics don’t crater at 2 a.m.
Small outages are cheaper than silent corruption. Fail fast, quarantine, and publish with confidence.Back to all posts
The incident you’ve lived through
Quarter-end. The CFO’s dashboard shows revenue off by a few points. Finance swears sales is fine. BI blames “lag.” Ops says the pipelines are green. I’ve been in that war room. In one case, an upstream partner renamed amount_usd to amount_cents, nulls exploded in a Dim join, and a late-arriving partition overwrote yesterday’s good data. Nothing “failed”—it just silently lied.
If you rely on hope and SQL duct tape, this will keep happening. What actually works is treating data reliability like SRE treats production: contracts, SLOs, quality gates, lineage, and incident response. Build guardrails that prevent bad data from ever reaching downstream analytics—before the CFO notices.
What to monitor (and why it pays the bills)
Data quality that protects analytics boils down to a few SLIs you can measure and enforce with SLOs:
- Freshness: time since last successful, validated load (
p99 < 30mduring business hours for operational dashboards) - Completeness: expected rows/fields present (e.g.,
>= 99.5%of expected orders per hour) - Validity: values match constraints (e.g.,
amount > 0,status IN ('paid','refunded')) - Uniqueness: keys are unique (
order_id, composite keys) - Consistency: referential integrity and distribution stability (e.g.,
country_codeconforms, join coverage>= 99.9%)
Tie these to real commitments:
- Finance close: no late partitions beyond 6 hours; completeness
>= 99.9%forordersby noon UTC. - Marketing spend: spend attribution table freshness
<= 15mwith> 99%click-to-order join coverage. - Exec dashboards: p95 time-to-fix for data incidents (
MTTR) under 2 hours.
You don’t need 500 checks. You need the right ones on the tables that move revenue and risk.
Design the guardrails: contracts, gates, quarantine, lineage
I’ve seen teams bolt on a data observability tool and call it a day. Then the tool pages at 3 a.m. and no one knows whether to stop the pipeline or ship. Here’s the design that works:
Data contracts at boundaries
- Define schemas and semantics for upstream producers (API, event topics, CSV drops). Enforce them strictly at ingestion.
- Version schemas; refuse unknown fields by default. Keep a
v1,v2co-existence window if needed.
Quality gates at three choke points
- Before transforms (raw to bronze): reject or quarantine rows that fail basic validity.
- Before joins (silver): enforce uniqueness and referential integrity.
- Before serving (gold/BI): final aggregate sanity and distribution checks; block publish if violated.
Quarantine & backfill
- Route bad rows to a
quarantinetable with reason codes; don’t delete—fix upstream or patch, then backfill.
- Route bad rows to a
Lineage & ownership
- Capture lineage with
OpenLineageand surface in DataHub/OpenMetadata; every dataset must have an owner, SLOs, and a runbook.
- Capture lineage with
Fail fast with clear blast radius
- If SLO-violating, fail the job early, don’t feed downstream tables. Small outages are cheaper than silent corruption.
This is how Netflix, Airbnb, and Shopify describe their internal patterns (publicly): contracts, SLOs, and aggressive gating wins over “monitor everything” vanity graphs.
Concrete implementation (dbt, GE/Soda, Airflow/Dagster, Snowflake/BigQuery)
You don’t need to boil the ocean. Here’s a practical stack that works on Snowflake, BigQuery, or Databricks.
- Contracts at ingestion
# python - pydantic schema for an event, used at ingestion (e.g., Databricks Autoloader or Kafka consumer)
from pydantic import BaseModel, conint, constr, validator
from datetime import datetime
class OrderEvent(BaseModel):
order_id: constr(min_length=8)
user_id: constr(min_length=8)
amount_cents: conint(gt=0)
currency: constr(regex=r"^[A-Z]{3}$")
status: constr(regex=r"^(paid|refunded)$")
created_at: datetime
@validator('currency')
def usd_only(cls, v):
# example: temporary contract: only USD supported this quarter
if v != 'USD':
raise ValueError('currency != USD not supported')
return v- dbt tests for silver/gold
# models/orders.yml
version: 2
models:
- name: fct_orders
description: Fact table for paid/refunded orders
tests:
- dbt_utils.unique_combination_of_columns:
combination_of_columns: [order_id]
- dbt_utils.expression_is_true:
expression: "amount_cents > 0"
- relationships:
to: ref('dim_users')
field: user_id
name: fk_user_id_exists
columns:
- name: status
tests:
- accepted_values:
values: ['paid','refunded']- Great Expectations (or Soda) for distribution drift
# expectations/order_amounts.py
from great_expectations.dataset import PandasDataset
class OrdersDataset(PandasDataset):
_expectations_config = {
"expectations": [
{"expectation_type": "expect_column_values_to_be_between",
"kwargs": {"column": "amount_cents", "min_value": 100, "max_value": 5000000, "mostly": 0.999}},
{"expectation_type": "expect_column_kl_divergence_to_be_less_than",
"kwargs": {"column": "status", "partition_object": {"paid": 0.97, "refunded": 0.03}, "threshold": 0.05}}
]
}- SodaCL quick syntax
# soda/checks/orders.yml
checks for orders:
- row_count >= 0
- schema:
fail:
when required column missing: [order_id, user_id, amount_cents, status]
- missing_count(status) = 0
- duplicate_count(order_id) = 0
- valid_format(currency) in ["USD"]
- distribution_diff(status) < 5%- Gate in Airflow
# dags/orders_quality.py
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime
with DAG('orders_pipeline', start_date=datetime(2024,1,1), schedule_interval='*/15 * * * *', catchup=False) as dag:
load_raw = PythonOperator(task_id='load_raw', python_callable=ingest_orders)
ge_check = PythonOperator(task_id='ge_check', python_callable=run_ge_suite) # fail DAG on violation
transform = PythonOperator(task_id='transform', python_callable=run_dbt)
publish = PythonOperator(task_id='publish', python_callable=publish_gold)
load_raw >> ge_check >> transform >> publish- Or gate in Dagster assets
# dagster asset with expectations
from dagster import asset, Output, ExpectationResult
@asset
def silver_orders(raw_orders):
df = raw_orders.dropna(subset=['order_id','user_id','amount_cents'])
passed = df['amount_cents'].gt(0).all() and df['order_id'].is_unique
yield ExpectationResult(success=passed, description="positive amounts and unique order_id")
yield Output(df)- Warehouse constraints
-- Snowflake: enforce at the table level where possible
ALTER TABLE FCT_ORDERS ADD CONSTRAINT AMOUNT_POSITIVE CHECK (AMOUNT_CENTS > 0);
ALTER TABLE FCT_ORDERS ADD CONSTRAINT PK_ORDER_ID PRIMARY KEY (ORDER_ID);- Table versioning and safe deletes (lakehouse)
-- Iceberg/Delta: keep history to enable rollback and time travel
ALTER TABLE orders SET TBLPROPERTIES ('history.expire.max-snapshot-age-ms'='604800000');These aren’t academic. They’re cheap, fast to implement, and stop entire classes of failures before BI sees them.
Alerting, SLOs, and incident response like an SRE team
Dashboards don’t fix themselves. Instrument SLIs and wire alerts the way you would for an API.
- Metrics: emit freshness lag, row counts, failure counts as Prometheus metrics or push to Grafana Cloud.
- Alerting: page on breach of SLO windows, not on single job failures. Route to on-call via PagerDuty.
- Runbooks: every alert links to a wiki page: known breakages, backfill scripts, who to call upstream.
- Auto-remediation: quarantine partition, roll back gold view to last good snapshot, open a Jira + Slack channel.
Example Prometheus rule:
# prometheus/alerts.yml
groups:
- name: data-quality
rules:
- alert: OrdersFreshnessSLOBreached
expr: (time() - orders_last_success_timestamp_seconds) > 1800
for: 10m
labels:
severity: page
annotations:
summary: Orders freshness SLO breached (>30m)
runbook: https://internal/wiki/orders-freshnessLineage helps triage. With OpenLineage feeding DataHub or OpenMetadata, you can see exactly which dashboards depend on fct_orders and decide whether to hide or freeze them automatically until the fix lands.
Results you can actually measure
When we implement this pattern at clients, a typical 60–90 day outcome looks like:
- 90% fewer CFO surprises: no silent corruption in top-10 tables; issues are quarantined and visible.
- MTTR down from 8h to <1.5h: on-call can locate the blast radius in minutes with lineage + runbooks.
- Alert noise down 70%: we alert on SLO violations, not every transient task blip.
- Backfill confidence up: versioned tables, consistent contracts, and tested transforms mean backfills don’t break downstream.
It’s not magic—it’s engineering guardrails plus ruthless scoping to what the business actually uses.
Common failure modes (and how to avoid them)
I’ve watched teams repeat these mistakes:
- “Monitor everything” syndrome: 1,000 checks and zero accountability. Fix: monitor the top-10 tables that drive decisions; each has an owner and SLOs.
- Flaky tests: nondeterministic checks cause alert fatigue. Fix: stabilize with deterministic windows and canary samples; only promote checks after a week burn-in.
- Slow, expensive checks: scanning petabytes hourly for a null count is wasteful. Fix: partitioned checks, sampling,
approx_count_distinct, or BigQueryINFORMATION_SCHEMAmetadata checks. - Schema drift from AI-generated code: I’ve seen “vibe coding” ETLs silently introduce new columns or types. Fix: enforce contracts in CI, run data linters on PR, and block merges if contracts change without review.
- No quarantine: dropping bad rows hides systemic issues. Fix: quarantine with reason codes and track by source; report weekly to upstream owners.
- Ownership gaps: “Who owns customers_dim?” If you don’t know, that’s why trust is low. Fix: publish ownership in metadata and on dashboards.
A pragmatic rollout plan (30/60/90)
You can do this without a platform committee or a seven-figure tool spend.
First 30 days
- Pick top 10 revenue/risk tables. Write SLIs/SLOs. Assign owners and create runbooks.
- Add 5 checks/table (uniqueness, nulls, referential integrity, value ranges, freshness).
- Gate pipelines in Airflow/Dagster on failures; enable quarantine tables.
Days 31–60
- Add distribution drift checks to the top 3 joins that tend to break.
- Wire Prometheus/Grafana/PagerDuty and connect alerts to runbooks.
- Deploy lineage (OpenLineage + DataHub/OpenMetadata). Start tagging dependents.
Days 61–90
- Expand contracts upstream; require schema versioning for producers.
- Automate backfill jobs with safe roll-forward/rollback.
- Quarterly review of SLOs and alert noise; prune, tune, and promote canary checks.
If you only do one thing this quarter: put a quality gate before your BI publish step and fail when SLIs are out of SLO. You’ll prevent the midnight “why is revenue down” call.
Key takeaways
- Quality gates must run before transformations, at joins, and before serving to downstream analytics.
- Define SLIs/SLOs for freshness, completeness, validity, uniqueness, and consistency—tie them to business commitments.
- Use data contracts at upstream boundaries; fail fast and quarantine bad data instead of shipping silent corruption.
- Automate tests in CI (dbt, GE, Soda); gate Airflow/Dagster jobs on these checks; alert through Prometheus/Grafana or PagerDuty.
- Make lineage and ownership first-class: every dataset needs an owner, runbook, and measurable SLOs.
- Start small: top 10 tables by business impact, 5 checks per table, and a 90-day rollout that sticks.
Implementation checklist
- Map top 10 business-critical tables and define owners and SLOs.
- Implement data contracts at ingestion with strict schema enforcement.
- Add dbt/GE/Soda tests for uniqueness, nulls, referential integrity, and distribution drift.
- Gate pipelines in Airflow/Dagster to quarantine or halt on failure.
- Instrument SLIs as Prometheus metrics and wire alerts to on-call.
- Adopt lineage (OpenLineage + DataHub/OpenMetadata) and publish runbooks.
- Pilot canary checks on a 1% sample before scaling to full volumes.
- Review alert noise monthly; adjust thresholds and auto-remediation steps.
Questions we hear from teams
- What’s the minimum viable setup to stop most analytics outages?
- Add contracts at ingestion, 5 dbt/GE/Soda checks on your top 10 tables, and a publish gate that fails if SLIs breach SLOs (freshness, completeness, uniqueness). Wire alerts to PagerDuty with a runbook. That gets you 80% of the benefit quickly.
- Do we need to buy a data observability tool?
- Not necessarily. Start with dbt/GE/Soda + OpenLineage + Grafana/Prometheus. If your estate is massive or multi-tenant, tools like Monte Carlo or Databand can help with automated anomaly detection and ownership workflows.
- How do we handle schema changes from upstream teams?
- Use versioned data contracts. Bet on additive changes only; require a review for breaking changes. Implement a CI check that blocks merges when contracts change without version bumps. Run both v1 and v2 during a migration window and backfill.
- We have AI-generated ETL code that keeps drifting. Can this help?
- Yes. Enforce contracts in CI, add schema compatibility checks, and run quality suites on every PR. GitPlumbers does AI code refactoring and vibe code cleanup to make AI-generated code safe before it hits production pipelines.
- How do we keep costs in check for large datasets?
- Use partition pruning, metadata-based checks (row counts from `_PARTITIONS_SUMMARY`/`INFORMATION_SCHEMA`), approximate functions, and canary sampling. Only run full scans on nightly windows or after anomalies.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
