The Lineage System That Turned 3‑Hour Fire Drills Into 15‑Minute Fixes
If you can’t trace where a number came from, you can’t trust it. Here’s how we build data lineage that makes impact analysis and debugging fast, boring, and reliable — without boiling the ocean.
“If you can’t trace where a number came from in under five minutes, you don’t own it — it owns you.”Back to all posts
The 2 a.m. dashboard page I couldn’t explain
Two quarters ago, an exec pinged me in Slack at 2:03 a.m.: “Why did revenue drop 18% today in the board deck?” I’d seen this movie. We had Snowflake, dbt, Airflow, Databricks, Kafka — all the right stickers. But no one could tell me which upstream change nuked the number. We had a pretty catalog, but it was aspirational, not operational.
We turned it around by installing runtime lineage end to end. Not a spreadsheet. Not a tribal-knowledge Miro board. Actual, queryable lineage emitted by the jobs that ran, tied to data quality checks and SLOs. Our MTTR fell from hours to minutes, and the “who touched what?” witch hunts stopped. Here’s exactly how we build that at GitPlumbers.
What good lineage actually looks like
Most lineage decks talk about boxes and arrows. The systems that work in production share a few non-negotiables:
- Runtime fidelity: lineage comes from what executed (Airflow, Spark, dbt) using
OpenLineage
events — not from hand-curated docs. - Column-level detail: table-level lineage is fine for posters. For impact analysis, you need to know that
orders.total_usd
depends onfx_rates.usd_rate
andorder_items.qty
. - Cross-mode coverage: batch (dbt, Spark), streaming (Kafka, Flink), and BI extracts (Looker, Tableau) at minimum for your critical paths.
- Versioning and ownership: dataset
contract_version
,owner
,pii
, andcriticality
tags are queryable and enforced in CI. - Queryable graph: lineage is stored in a system you can query via API to drive alerts, dashboards, and runbooks.
- Actionable hooks: tie lineage to
dbt test
,Great Expectations
, Soda/Monte Carlo alerts, and SLOs like freshness and null rate.
If you don’t have these, you don’t have lineage — you have art.
Architecture that doesn’t crumble under load
I’ve watched teams spend a year building custom lineage that collapses on first contact with Databricks notebooks. Use the standards and glue that exist:
- Event schema:
OpenLineage
(1.16+) for consistent events across tools. - Collectors:
openlineage-airflow
for Airflow 2.7+ (captures DAG/task in/outs).openlineage-dbt
for dbt Core 1.7+ (captures column lineage from DAG + parser).openlineage-spark
for Spark 3.3+/Databricks 13+ (autocollects read/write ops).- Kafka Connect/Flink exporters where applicable.
- Lineage store / UI:
- Open-source:
Marquez
(0.30+) — solid API, graph, and UI. - Enterprise:
Apache Atlas
orDataHub
if you already run them.
- Open-source:
- Quality and contracts:
dbt test
,Great Expectations
, orSoda
for checks.- Data contracts in repo (
schema.yml
,expectations.yml
) with CI gates.
- Cloud data platforms: Snowflake, BigQuery, or Databricks Lakehouse — use tags, masking policies, and audit logs to enrich lineage.
Choose one lineage store and make it the source of truth. Forking events into five catalogs guarantees drift.
Instrumentation: ship events, not hope
Get events flowing in a week. You don’t need to rewrite your stack.
Airflow (2.7+) with OpenLineage
- Install the provider and configure the backend:
pip install openlineage-airflow
- Set
OPENLINEAGE_URL
to your Marquez/DataHub/Atlas endpoint andOPENLINEAGE_NAMESPACE
to something stable (e.g.,prod
).
- Airflow will emit events for tasks with
inlets
/outlets
. For SQL operators, define inputs/outputs or wrap with lineage-aware operators.
Example Airflow task with explicit datasets:
from airflow.models.baseoperator import chain from airflow.operators.python import PythonOperator from airflow.lineage.entities import Table extract = PythonOperator( task_id="load_fx_rates", python_callable=load_rates, outlets=[Table(dataset="warehouse.public.fx_rates")], ) transform = PythonOperator( task_id="compute_orders", python_callable=build_orders, inlets=[Table(dataset="warehouse.public.fx_rates"), Table(dataset="raw.orders")], outlets=[Table(dataset="analytics.orders")], ) chain(extract, transform)
- Install the provider and configure the backend:
dbt (Core 1.7+) with OpenLineage
pip install openlineage-dbt
dbt --log-format json
and setOPENLINEAGE_*
env vars.- You’ll get model, source, and column-level lineage emitted per run.
Spark/Databricks (3.3+/DBR 13+)
spark-submit --packages io.openlineage:openlineage-spark_2.12:1.16.0
- Configure
spark.openlineage.url
and namespace. - Autocollects reads/writes from
DataFrame
ops, Delta I/O, etc.
Streaming with Kafka
- Emit lineage for connectors (e.g., Debezium -> Kafka -> Spark Structured Streaming -> Delta Lake) via connectors’ OpenLineage plugins or a lightweight wrapper that reports topic in/out and schemas.
Enrich with tags and versions
- Add
contract_version
,owner
,pii
, andcriticality
as dataset facets. For Snowflake:
alter table analytics.orders set tag contract_version = '3'; alter table analytics.orders set tag owner = 'data-platform@company.com'; alter table analytics.orders set tag criticality = 'tier1';
- Add
Now your lineage graph is being built from what actually runs, not hopes and dreams.
Make impact analysis a 5‑minute task
Once events flow, wire impact analysis into your on-call. The playbook we deploy:
- When a test fails or an SLO is violated, query the lineage store to find all downstreams of the broken node at the column level.
- Page the right owner using tags and include the affected dashboards/models in the alert.
Example: investigate a sudden dip in revenue_daily.total_usd
.
Find upstream dependencies of the column
- Query Marquez API for downstream of
analytics.revenue_daily.total_usd
to the root sources. You’ll typically seefx_rates.usd_rate
andorders.total_usd
.
- Query Marquez API for downstream of
Check recent changes
- Pull the last successful runs that touched those columns:
dbt run results
, Airflow DAG runs, Spark job IDs. - We often surface this via a simple Slack command:
!lineage analytics.revenue_daily.total_usd --since 24h
.
- Pull the last successful runs that touched those columns:
Diff schema/contracts
- Did someone rename
rate
tousd_rate
? Did null rate spike onorders.total_usd
? Lineage + tests will show the exact node where distribution or schema changed.
- Did someone rename
Contain blast radius
- Pause downstream DAGs that depend on the broken column (Airflow
dag_pause
), or deploy a canary fix withdbt --select state:modified
.
- Pause downstream DAGs that depend on the broken column (Airflow
Fix forward with confidence
- Update the transformation, add a test to prevent regression, and verify freshness SLO is green before unpausing.
With this in place, the exec page becomes: “Revenue dip traced to missing fx_rates
update; deploying fix; ETA 12 minutes.”
Quality and governance that ride on lineage
Lineage is the backbone; quality and governance make it useful.
SLOs linked to lineage
- Freshness:
analytics.revenue_daily
must be < 30m stale in business hours. - Null rate:
orders.total_usd
nulls < 0.5%. - Volume: row count delta within 3σ of trailing 14 days.
- Alert rules include downstream impacted assets via lineage graph query.
- Freshness:
Tests that block bad merges
dbt test
orGreat Expectations
runs in CI on PRs that modify contracts.- Use lineage to compute the minimum test set: only run tests for impacted models plus critical downstreams.
Data contracts and schema control
- Store contracts next to code (
schema.yml
). Fail CI if a PR changes acontract_version
without migration steps. - For Snowflake, enforce tags via
TAG
policies and check in CI withINFORMATION_SCHEMA.TAG_REFERENCES_ALL_COLUMNS
.
- Store contracts next to code (
PII and access
- Mark PII columns (
customer.email
) and propagate masks/policies downstream based on lineage. With Snowflake, attachMASKING POLICY
to source columns; use lineage to validate that derived columns retain policies or get blocked at review.
- Mark PII columns (
AI features sanity
- If you ship features to models via
feature_store.customer_ltv
, tie lineage to model registry (mlflow
) so you can answer: “Which data drifted?” and “Which upstream contract broke the model?”
- If you ship features to models via
A tiny bit of code that pays for itself
Sometimes you need to see the shape of the event. Here’s a trimmed OpenLineage
event you’ll see from a dbt run touching column lineage:
{
"eventType": "COMPLETE",
"job": {"namespace": "prod", "name": "dbt.analytics.orders"},
"run": {"runId": "9e6c..."},
"inputs": [{
"namespace": "prod",
"name": "raw.orders",
"facets": {"schema": {"fields": [{"name": "total_usd"}]}}
}],
"outputs": [{
"namespace": "prod",
"name": "analytics.orders",
"facets": {
"columnLineage": {
"fields": {
"total_usd": {"inputFields": [{"namespace": "prod","name": "raw.orders","field": "amount"}, {"namespace": "prod","name": "warehouse.fx_rates","field": "usd_rate"}]}
}
}
}
}]
}
This is what lets you answer “If fx_rates.usd_rate
is stale, which columns are wrong?” without guesswork.
Results you can take to the CFO
We’ve implemented this pattern at a fintech on Snowflake/dbt/Airflow and an adtech on Databricks/Spark/Dagster. The numbers were consistent:
- MTTR for data incidents: 3–6 hours ➝ 15–30 minutes (70–90% reduction).
- Data downtime (hours per month): down 40–60% once SLOs were wired to lineage.
- On-call load: 30% fewer pages because alerts were scoped to actually impacted assets.
- Deployment speed: PRs merged faster because CI surfaced contract breaks before they hit prod.
- Trust: “Unknown source” dashboard issues dropped to near zero within 60 days.
And yes, cloud costs went down. When you can see exactly which models feed a dashboard no one uses, you turn off a lot of waste.
30‑day rollout plan that survives reality
Week 1
- Stand up
Marquez
or point to your existingAtlas
/DataHub
. - Instrument Airflow and dbt with
openlineage-*
collectors in the lowest-risk env. - Pick 3 critical assets (revenue, signups, AI features) and tag them
criticality=tier1
.
Week 2
- Add Spark/Databricks collector for jobs that touch the 3 assets.
- Add basic SLOs (freshness, null) and route alerts to a lineage-aware Slack channel.
- Publish the lineage UI link in every relevant runbook.
Week 3
- Turn on column-level lineage for the 3 assets. Add
contract_version
,owner
,pii
tags. - Wire CI checks to fail PRs that break contracts or remove owners.
- Add a Slack slash command to query lineage for a dataset/column.
Week 4
- Expand coverage to top 20 Tier 1 datasets.
- Start weekly coverage and MTTR reporting.
- Do one planned incident game day: break a contract on purpose and measure response.
If you can’t get this done in 30 days, your problem isn’t tooling — it’s ownership. Call us; we’ve fixed that, too.
Key takeaways
- Stop “catalog-first” lineage. Instrument at runtime with `OpenLineage` so lineage reflects what actually ran.
- Target **column-level** lineage across batch and streaming; table-level isn’t enough for impact analysis.
- Store lineage in a queryable graph (Marquez/Atlas/DataHub) and wire it into your incident workflows (Slack, on-call, runbooks).
- Make lineage actionable: connect to tests (`dbt test`, `Great Expectations`) and SLOs (data freshness, null rate, schema stability).
- Version and tag datasets (`contract_version`, `pii`, owners) to prevent silent breakage and accelerate debugging.
- Measure success with MTTR, % of assets covered by lineage, and reduction in “unknown source” incidents.
Implementation checklist
- Adopt `OpenLineage` event schema and instrument orchestrators (Airflow/Dagster) and compute (Spark/Databricks, dbt).
- Stand up a lineage store: `Marquez` (open-source) or enterprise (`Atlas`, `DataHub`).
- Enable column-level lineage for critical paths (revenue metrics, AI features).
- Connect tests and SLOs to lineage and alert on upstream impacts.
- Expose a self-serve graph UI + API search for analysts, PMs, and on-call.
- Add lineage checks to CI/CD: fail merges that break contracts or owners.
- Publish weekly coverage and MTTR reports to keep momentum.
Questions we hear from teams
- Do we need a new catalog to get lineage?
- No. Start by emitting `OpenLineage` events from Airflow/dbt/Spark into a single store. If you already run Atlas or DataHub, use it. If not, stand up Marquez in a day and move on. Don’t build custom collectors unless a vendor lacks support.
- Is table-level lineage enough?
- Not for impact analysis. You’ll waste time triaging false positives. Column-level lineage for Tier 1 assets is the difference between a 15-minute fix and a multi-hour blame game.
- How do we handle notebooks (the perennial Databricks pain)?
- Use `openlineage-spark` configured at cluster/job level so reads/writes are captured regardless of notebook style. For ad-hoc notebooks, gate promotion into scheduled jobs unless they emit lineage and pass tests.
- Won’t this slow down our pipelines or cost a fortune?
- OpenLineage events are tiny. Collectors add negligible overhead. The real costs are in storage/compute from bad jobs; lineage reduces that by catching breakages early and scoping reruns. Most teams see net cost down and MTTR down 70–90%.
- What KPIs should we track to prove value?
- - MTTR for data incidents - % of Tier 1 assets with column-level lineage - Data downtime hours/month per domain - % of alerts with identified root cause within 30 minutes - Contract violation rate per month
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.