Circuit Breakers for Data: Quality Monitoring That Stops Bad Loads Before They Wreck Analytics
You don’t need another dashboard about your dashboards. You need guardrails that block bad data before it hits finance, product, or ML.
“Quality monitoring without enforcement is a pager with the ringer off.”Back to all posts
The quarter-end dashboard outage you’ve lived through
I was in the war room at 6:12am when the CFO’s revenue dashboard went blank. Overnight, a partner shipped an extra column and “helpfully” changed order_total from cents to dollars. Our ELT dutifully loaded it, dbt transformed it, and Looker happily graphed fiction. No red lights, just bad charts. Engineering apologized. Finance missed guidance. I’ve seen this movie at unicorns and Fortune 100s.
Here’s the punchline: monitoring existed. It just didn’t prevent anything. Logs were green. The “quality” dashboard was stale. Alerts went to a dead Slack channel. What actually works is putting circuit breakers on data—guardrails that stop bad loads before they land in downstream analytics.
What “preventing failures” actually means
Preventing failures is not “alert at 3am.” It’s designing the system so bad data can’t advance.
- SLOs for data products: Treat each table/model as a product with published SLOs:
- Freshness: 99.9% of hours within 15 minutes of source event time.
- Completeness: 99% of days with row count within 10% of forecast.
- Accuracy: Business rule conformance (e.g.,
order_total = sum(line_items)within 0.1%).
- Data contracts: Explicit schemas and semantics at ingress. Any drift requires a version bump and migration, not a surprise at 2am.
- Circuit breakers: Pipeline gates that fail fast when checks fail. No “warn-only” for tier-1 data.
- Observability with ownership: Metrics in Prometheus/Grafana, on-call mapped to data owners, and alerts tied to SLO budget burn, not one-off blips.
If that sounds like SRE for data, it is. We’ve applied it at GitPlumbers across Snowflake, BigQuery, Databricks, and Postgres warehouses. It works.
The checks that matter (and the ones that don’t)
Skip the vanity metrics. Focus on signals that correlate with business outages:
- Schema drift: Columns added/removed/typed differently. Contract violation => block.
- Freshness:
max(event_time)lag vs now. If it’s late, block dependent jobs. - Volume/Completeness: Row deltas vs baselines. Spike or cliff? Block.
- Nulls/Uniqueness: Primary keys must be unique/non-null; critical dims not null.
- Referential integrity: Fact foreign keys exist in dim tables.
- Distribution/Anomalies: Value ranges, z-scores vs rolling window. Catch unit flips and skew.
- Business rules: Recompute invariants in SQL. E.g., totals match component sums, currency codes valid.
Things I’ve seen add noise:
- Counting “failed Airflow tasks” as a quality metric. That’s availability, not correctness.
- One-off anomaly detectors without baselines. They page constantly and erode trust.
- Screenshot tests of BI dashboards. Fun demo, little signal.
Build it with tools you already run
You don’t need to buy a platform tomorrow. Use dbt, SQL, and either Soda Core or Great Expectations. Keep checks as code, versioned with the model.
- dbt tests for schema + primary constraints
dbt_project.yml:
name: analytics
version: '1.6'
models:
analytics:
orders:
+tests:
- unique:
column_name: order_id
- not_null:
column_name: order_id
- relationships:
to: ref('dim_customer')
field: customer_id
- accepted_values:
column_name: currency
values: ['USD','EUR','GBP']- Freshness + completeness in SQL (Snowflake/BigQuery)
-- freshness_check.sql
SELECT
TIMESTAMP_DIFF(CURRENT_TIMESTAMP(), MAX(event_time), MINUTE) AS lag_minutes
FROM raw.orders_stream;
-- completeness_check.sql (compare to 7-day moving average)
WITH baseline AS (
SELECT AVG(daily_rows) AS avg_rows
FROM (
SELECT DATE(event_time) d, COUNT(*) daily_rows
FROM raw.orders_stream
WHERE event_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
GROUP BY d
)
)
SELECT
COUNT(*) AS today_rows,
(SELECT avg_rows FROM baseline) AS avg_rows
FROM raw.orders_stream
WHERE DATE(event_time) = CURRENT_DATE();- Soda Core for distributions and freshness gates
# soda/scan.yml
checks for orders_clean:
- freshness(rows) < 15m
- row_count between 0.9 * reference_row_count and 1.1 * reference_row_count
- missing_count(order_total) = 0
- duplicate_count(order_id) = 0
- values in (currency) must be in ["USD", "EUR", "GBP"]
- distribution of order_total:
mean between 40 and 80
stddev between 5 and 50
fail threshold: 1Run it in CI and in prod:
soda scan -d warehouse -c soda/config.yml soda/scan.yml- Great Expectations if you prefer Python
# expectations/orders_clean_expectations.py
from great_expectations.dataset import PandasDataset
class OrdersClean(PandasDataset):
_data_asset_type = "OrdersClean"
def validate(self):
self.expect_column_values_to_not_be_null("order_id")
self.expect_column_values_to_be_unique("order_id")
self.expect_column_values_to_be_in_set("currency", ["USD","EUR","GBP"])
self.expect_column_values_to_be_between("order_total", 0, 10000)
self.expect_column_kl_divergence_to_be_less_than(
column="order_total", partition_object={"bins": 50}, threshold=0.3
)- Data contracts at ingestion
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "https://schemas.acme.com/orders/1.2.0",
"type": "object",
"properties": {
"order_id": {"type": "string"},
"event_time": {"type": "string", "format": "date-time"},
"currency": {"type": "string", "enum": ["USD","EUR","GBP"]},
"order_total_cents": {"type": "integer", "minimum": 0}
},
"required": ["order_id", "event_time", "currency", "order_total_cents"],
"additionalProperties": false
}Validate on the edge (Kafka consumers, CDC jobs) with confluent-schema-registry or fastjsonschema and reject payloads that don’t match. No silent coercions.
Gate the pipeline in your orchestrator
Here’s where most teams fail: they run checks but don’t enforce them. Gate promotions in Airflow, Dagster, or Prefect.
- Airflow example with a ShortCircuit gate
from airflow import DAG
from airflow.utils.dates import days_ago
from airflow.operators.bash import BashOperator
from airflow.operators.python import ShortCircuitOperator
with DAG(
dag_id="orders_quality",
start_date=days_ago(1),
schedule_interval="*/15 * * * *",
catchup=False,
) as dag:
run_dbt = BashOperator(
task_id="dbt_run",
bash_command="dbt run --models orders_clean"
)
soda_check = BashOperator(
task_id="soda_scan",
bash_command="soda scan -d warehouse -c soda/config.yml soda/scan.yml"
)
def pass_if_quality_ok(**context):
# Simplified: parse Soda scan JSON result placed at /tmp/soda_result.json
import json
with open('/tmp/soda_result.json') as f:
result = json.load(f)
return result.get('overall_status') == 'PASS'
gate = ShortCircuitOperator(
task_id="quality_gate",
python_callable=pass_if_quality_ok,
provide_context=True,
)
promote = BashOperator(
task_id="promote_model",
bash_command="dbt run --models orders_prod && dbt source snapshot-freshness"
)
run_dbt >> soda_check >> gate >> promoteCanary then promote
- Stage data into
orders_canaryfrom a 5–10% sample. - Run the same checks; if PASS, swap to
orders_prodvia view swap orALTER TABLE ... SWAP WITH(Snowflake) orCREATE OR REPLACE VIEW(BigQuery).
- Stage data into
Block on contract
- At ingest, if JSON Schema/Avro check fails, put to dead-letter and page the producer. Don’t “fix in transform.”
Make it observable and ownable
Checks without context become noise. Publish metrics and alert on SLO burn, not single failures.
Export metrics
- Emit
freshness_lag_minutes,row_count,null_rate,test_failures_totalas Prometheus gauges/counters. - Use labels like
table="orders_clean",env="prod",owner="revops".
- Emit
Prometheus alert on multi-signal issues
# prometheus/alerts.yml
- alert: DataFreshnessSLOBreach
expr: (
freshness_lag_minutes{table="orders_clean",env="prod"} > 15
) and on(table) (
increase(test_failures_total{table="orders_clean"}[15m]) > 0
)
for: 10m
labels:
severity: page
annotations:
summary: "Orders data freshness SLO breach"
description: "Freshness > 15m with concurrent test failures in last 15m."Grafana dashboard
- SLO burn down, last swap time, last contract violation, top failing checks.
- Link to run logs and data owner PagerDuty schedule.
Ownership
- Each tier-1 model has an on-call owner (Eng + Data) and a budget. If you don’t staff it, it’s not tier-1.
If you’re routing alerts to a channel no one watches, you’ve chosen “eventual truth.”
Results you can take to the CFO (and pitfalls)
What we’ve delivered with this setup at GitPlumbers:
- 65–85% reduction in downstream incident count within two quarters.
- <10 min MTTR for data regressions detected pre-promotion.
- 0 surprise schema drifts in prod for contracted sources.
- Freshness SLO from “whenever Airflow finishes” to 99.9% within 15 minutes.
- Finance-facing dashboard “apology rate” near zero in Q3.
Common pitfalls I’ve seen:
- Warn-only tests on tier-1 tables. If it’s worth checking, it’s worth blocking.
- Too many flaky anomaly checks. Start with deterministic rules, then add anomaly detection with baselines and backtesting.
- No canary path. You either block everything or let junk flow. Canary gives you a safe middle.
- Hidden owners. If every team owns it, no one does. Put names on models.
A fast path you can execute this quarter
- Identify 5–10 tier-1 tables (revenue, payments, inventory). Define SLOs and owners.
- Add dbt tests for schema/null/unique and relationships. Fail the build on error.
- Add Soda or GE checks for freshness, volume, and distribution of key measures.
- Create a canary stage; add an orchestrator gate that blocks promotion on failures.
- Export metrics to Prometheus; alert on SLO breach with multi-signal logic.
- Enforce a JSON Schema/Avro contract at ingestion for your noisiest source.
- Publish the SLOs in Confluence/Backstage. Hold a weekly review; burn down tech debt.
If you want help, this is literally what we do all day at GitPlumbers. We fix the brittle bits and leave you with guardrails, not a mystery box.
Key takeaways
- Quality monitoring without enforcement is noise—wire checks to gates that stop promotions.
- Define SLOs for your data products: freshness, completeness, and accuracy tied to dollars.
- Use layered checks: schema, freshness, volume, nulls/uniqueness, referential integrity, distribution, and business rules.
- Put checks next to code (dbt/Soda/GE) and run them in CI and your orchestrator.
- Publish metrics to Prometheus/Grafana and alert on SLO burn, not single failures.
- Start with tier-1 tables. Aim for <15 min freshness lag, <1% null spike, and zero schema drift without a version bump.
Implementation checklist
- Pick tier-1 tables and define explicit SLOs (freshness, completeness, accuracy).
- Add schema + null/unique tests in dbt; add distribution and referential checks via Soda/GE.
- Create a canary pipeline and gate promotions with a ShortCircuit/Fail task.
- Publish metrics to Prometheus; alert on budget burn and multiple-signal correlation.
- Enforce data contracts at ingestion using JSON Schema/Avro + Schema Registry.
- Run everything in CI on PRs and in prod on every run; fail fast when SLOs are at risk.
Questions we hear from teams
- Do I need to buy a data observability platform to start?
- No. Start with dbt tests for schema/null/unique, add Soda Core or Great Expectations for distributions and freshness, wire gates in your orchestrator, and export metrics to Prometheus/Grafana. You can add a managed tool later if you need lineage and auto-detection at scale.
- How do I handle schema changes without breaking everything?
- Use versioned data contracts. Introduce `orders_v1_3` with a clear migration path, backfill, and a cutover window. Block producers from mutating `v1_2` without a version bump. In the warehouse, use views to present a stable interface while you migrate.
- What SLOs should I start with?
- Start simple: Freshness (99.9% within 15 min), Completeness (row count within ±10% of 7-day average), and Accuracy (one or two core business invariants). Publish them, attach owners, and review weekly. Add more only when they change business outcomes.
- How do I avoid alert fatigue?
- Alert on multi-signal and SLO burn, not individual test failures. Route pages to the owning team, and send everything else as FYI. Track precision/recall of alerts: if >30% are false positives, tune thresholds or move to canary-only paging.
- Where should the checks live?
- With the code. Put dbt tests next to dbt models, Soda/GE scans in the repo, and contract validators in the ingest service. Run them in CI on PRs and in prod on every run. Checks that aren’t versioned drift out of truth.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
