The Lineage System That Turned 3‑Hour Fire Drills Into 15‑Minute Fixes

If you can’t trace where a number came from, you can’t trust it. Here’s how we build data lineage that makes impact analysis and debugging fast, boring, and reliable — without boiling the ocean.

“If you can’t trace where a number came from in under five minutes, you don’t own it — it owns you.”
Back to all posts

The 2 a.m. dashboard page I couldn’t explain

Two quarters ago, an exec pinged me in Slack at 2:03 a.m.: “Why did revenue drop 18% today in the board deck?” I’d seen this movie. We had Snowflake, dbt, Airflow, Databricks, Kafka — all the right stickers. But no one could tell me which upstream change nuked the number. We had a pretty catalog, but it was aspirational, not operational.

We turned it around by installing runtime lineage end to end. Not a spreadsheet. Not a tribal-knowledge Miro board. Actual, queryable lineage emitted by the jobs that ran, tied to data quality checks and SLOs. Our MTTR fell from hours to minutes, and the “who touched what?” witch hunts stopped. Here’s exactly how we build that at GitPlumbers.

What good lineage actually looks like

Most lineage decks talk about boxes and arrows. The systems that work in production share a few non-negotiables:

  • Runtime fidelity: lineage comes from what executed (Airflow, Spark, dbt) using OpenLineage events — not from hand-curated docs.
  • Column-level detail: table-level lineage is fine for posters. For impact analysis, you need to know that orders.total_usd depends on fx_rates.usd_rate and order_items.qty.
  • Cross-mode coverage: batch (dbt, Spark), streaming (Kafka, Flink), and BI extracts (Looker, Tableau) at minimum for your critical paths.
  • Versioning and ownership: dataset contract_version, owner, pii, and criticality tags are queryable and enforced in CI.
  • Queryable graph: lineage is stored in a system you can query via API to drive alerts, dashboards, and runbooks.
  • Actionable hooks: tie lineage to dbt test, Great Expectations, Soda/Monte Carlo alerts, and SLOs like freshness and null rate.

If you don’t have these, you don’t have lineage — you have art.

Architecture that doesn’t crumble under load

I’ve watched teams spend a year building custom lineage that collapses on first contact with Databricks notebooks. Use the standards and glue that exist:

  • Event schema: OpenLineage (1.16+) for consistent events across tools.
  • Collectors:
    • openlineage-airflow for Airflow 2.7+ (captures DAG/task in/outs).
    • openlineage-dbt for dbt Core 1.7+ (captures column lineage from DAG + parser).
    • openlineage-spark for Spark 3.3+/Databricks 13+ (autocollects read/write ops).
    • Kafka Connect/Flink exporters where applicable.
  • Lineage store / UI:
    • Open-source: Marquez (0.30+) — solid API, graph, and UI.
    • Enterprise: Apache Atlas or DataHub if you already run them.
  • Quality and contracts:
    • dbt test, Great Expectations, or Soda for checks.
    • Data contracts in repo (schema.yml, expectations.yml) with CI gates.
  • Cloud data platforms: Snowflake, BigQuery, or Databricks Lakehouse — use tags, masking policies, and audit logs to enrich lineage.

Choose one lineage store and make it the source of truth. Forking events into five catalogs guarantees drift.

Instrumentation: ship events, not hope

Get events flowing in a week. You don’t need to rewrite your stack.

  1. Airflow (2.7+) with OpenLineage

    • Install the provider and configure the backend:
      • pip install openlineage-airflow
      • Set OPENLINEAGE_URL to your Marquez/DataHub/Atlas endpoint and OPENLINEAGE_NAMESPACE to something stable (e.g., prod).
    • Airflow will emit events for tasks with inlets/outlets. For SQL operators, define inputs/outputs or wrap with lineage-aware operators.

    Example Airflow task with explicit datasets:

    from airflow.models.baseoperator import chain
    from airflow.operators.python import PythonOperator
    from airflow.lineage.entities import Table
    
    extract = PythonOperator(
        task_id="load_fx_rates",
        python_callable=load_rates,
        outlets=[Table(dataset="warehouse.public.fx_rates")],
    )
    transform = PythonOperator(
        task_id="compute_orders",
        python_callable=build_orders,
        inlets=[Table(dataset="warehouse.public.fx_rates"), Table(dataset="raw.orders")],
        outlets=[Table(dataset="analytics.orders")],
    )
    chain(extract, transform)
  2. dbt (Core 1.7+) with OpenLineage

    • pip install openlineage-dbt
    • dbt --log-format json and set OPENLINEAGE_* env vars.
    • You’ll get model, source, and column-level lineage emitted per run.
  3. Spark/Databricks (3.3+/DBR 13+)

    • spark-submit --packages io.openlineage:openlineage-spark_2.12:1.16.0
    • Configure spark.openlineage.url and namespace.
    • Autocollects reads/writes from DataFrame ops, Delta I/O, etc.
  4. Streaming with Kafka

    • Emit lineage for connectors (e.g., Debezium -> Kafka -> Spark Structured Streaming -> Delta Lake) via connectors’ OpenLineage plugins or a lightweight wrapper that reports topic in/out and schemas.
  5. Enrich with tags and versions

    • Add contract_version, owner, pii, and criticality as dataset facets. For Snowflake:
    alter table analytics.orders set tag contract_version = '3';
    alter table analytics.orders set tag owner = 'data-platform@company.com';
    alter table analytics.orders set tag criticality = 'tier1';

Now your lineage graph is being built from what actually runs, not hopes and dreams.

Make impact analysis a 5‑minute task

Once events flow, wire impact analysis into your on-call. The playbook we deploy:

  • When a test fails or an SLO is violated, query the lineage store to find all downstreams of the broken node at the column level.
  • Page the right owner using tags and include the affected dashboards/models in the alert.

Example: investigate a sudden dip in revenue_daily.total_usd.

  1. Find upstream dependencies of the column

    • Query Marquez API for downstream of analytics.revenue_daily.total_usd to the root sources. You’ll typically see fx_rates.usd_rate and orders.total_usd.
  2. Check recent changes

    • Pull the last successful runs that touched those columns: dbt run results, Airflow DAG runs, Spark job IDs.
    • We often surface this via a simple Slack command: !lineage analytics.revenue_daily.total_usd --since 24h.
  3. Diff schema/contracts

    • Did someone rename rate to usd_rate? Did null rate spike on orders.total_usd? Lineage + tests will show the exact node where distribution or schema changed.
  4. Contain blast radius

    • Pause downstream DAGs that depend on the broken column (Airflow dag_pause), or deploy a canary fix with dbt --select state:modified.
  5. Fix forward with confidence

    • Update the transformation, add a test to prevent regression, and verify freshness SLO is green before unpausing.

With this in place, the exec page becomes: “Revenue dip traced to missing fx_rates update; deploying fix; ETA 12 minutes.”

Quality and governance that ride on lineage

Lineage is the backbone; quality and governance make it useful.

  • SLOs linked to lineage

    • Freshness: analytics.revenue_daily must be < 30m stale in business hours.
    • Null rate: orders.total_usd nulls < 0.5%.
    • Volume: row count delta within 3σ of trailing 14 days.
    • Alert rules include downstream impacted assets via lineage graph query.
  • Tests that block bad merges

    • dbt test or Great Expectations runs in CI on PRs that modify contracts.
    • Use lineage to compute the minimum test set: only run tests for impacted models plus critical downstreams.
  • Data contracts and schema control

    • Store contracts next to code (schema.yml). Fail CI if a PR changes a contract_version without migration steps.
    • For Snowflake, enforce tags via TAG policies and check in CI with INFORMATION_SCHEMA.TAG_REFERENCES_ALL_COLUMNS.
  • PII and access

    • Mark PII columns (customer.email) and propagate masks/policies downstream based on lineage. With Snowflake, attach MASKING POLICY to source columns; use lineage to validate that derived columns retain policies or get blocked at review.
  • AI features sanity

    • If you ship features to models via feature_store.customer_ltv, tie lineage to model registry (mlflow) so you can answer: “Which data drifted?” and “Which upstream contract broke the model?”

A tiny bit of code that pays for itself

Sometimes you need to see the shape of the event. Here’s a trimmed OpenLineage event you’ll see from a dbt run touching column lineage:

{
  "eventType": "COMPLETE",
  "job": {"namespace": "prod", "name": "dbt.analytics.orders"},
  "run": {"runId": "9e6c..."},
  "inputs": [{
    "namespace": "prod",
    "name": "raw.orders",
    "facets": {"schema": {"fields": [{"name": "total_usd"}]}}
  }],
  "outputs": [{
    "namespace": "prod",
    "name": "analytics.orders",
    "facets": {
      "columnLineage": {
        "fields": {
          "total_usd": {"inputFields": [{"namespace": "prod","name": "raw.orders","field": "amount"}, {"namespace": "prod","name": "warehouse.fx_rates","field": "usd_rate"}]}
        }
      }
    }
  }]
}

This is what lets you answer “If fx_rates.usd_rate is stale, which columns are wrong?” without guesswork.

Results you can take to the CFO

We’ve implemented this pattern at a fintech on Snowflake/dbt/Airflow and an adtech on Databricks/Spark/Dagster. The numbers were consistent:

  • MTTR for data incidents: 3–6 hours ➝ 15–30 minutes (70–90% reduction).
  • Data downtime (hours per month): down 40–60% once SLOs were wired to lineage.
  • On-call load: 30% fewer pages because alerts were scoped to actually impacted assets.
  • Deployment speed: PRs merged faster because CI surfaced contract breaks before they hit prod.
  • Trust: “Unknown source” dashboard issues dropped to near zero within 60 days.

And yes, cloud costs went down. When you can see exactly which models feed a dashboard no one uses, you turn off a lot of waste.

30‑day rollout plan that survives reality

Week 1

  • Stand up Marquez or point to your existing Atlas/DataHub.
  • Instrument Airflow and dbt with openlineage-* collectors in the lowest-risk env.
  • Pick 3 critical assets (revenue, signups, AI features) and tag them criticality=tier1.

Week 2

  • Add Spark/Databricks collector for jobs that touch the 3 assets.
  • Add basic SLOs (freshness, null) and route alerts to a lineage-aware Slack channel.
  • Publish the lineage UI link in every relevant runbook.

Week 3

  • Turn on column-level lineage for the 3 assets. Add contract_version, owner, pii tags.
  • Wire CI checks to fail PRs that break contracts or remove owners.
  • Add a Slack slash command to query lineage for a dataset/column.

Week 4

  • Expand coverage to top 20 Tier 1 datasets.
  • Start weekly coverage and MTTR reporting.
  • Do one planned incident game day: break a contract on purpose and measure response.

If you can’t get this done in 30 days, your problem isn’t tooling — it’s ownership. Call us; we’ve fixed that, too.

Related Resources

Key takeaways

  • Stop “catalog-first” lineage. Instrument at runtime with `OpenLineage` so lineage reflects what actually ran.
  • Target **column-level** lineage across batch and streaming; table-level isn’t enough for impact analysis.
  • Store lineage in a queryable graph (Marquez/Atlas/DataHub) and wire it into your incident workflows (Slack, on-call, runbooks).
  • Make lineage actionable: connect to tests (`dbt test`, `Great Expectations`) and SLOs (data freshness, null rate, schema stability).
  • Version and tag datasets (`contract_version`, `pii`, owners) to prevent silent breakage and accelerate debugging.
  • Measure success with MTTR, % of assets covered by lineage, and reduction in “unknown source” incidents.

Implementation checklist

  • Adopt `OpenLineage` event schema and instrument orchestrators (Airflow/Dagster) and compute (Spark/Databricks, dbt).
  • Stand up a lineage store: `Marquez` (open-source) or enterprise (`Atlas`, `DataHub`).
  • Enable column-level lineage for critical paths (revenue metrics, AI features).
  • Connect tests and SLOs to lineage and alert on upstream impacts.
  • Expose a self-serve graph UI + API search for analysts, PMs, and on-call.
  • Add lineage checks to CI/CD: fail merges that break contracts or owners.
  • Publish weekly coverage and MTTR reports to keep momentum.

Questions we hear from teams

Do we need a new catalog to get lineage?
No. Start by emitting `OpenLineage` events from Airflow/dbt/Spark into a single store. If you already run Atlas or DataHub, use it. If not, stand up Marquez in a day and move on. Don’t build custom collectors unless a vendor lacks support.
Is table-level lineage enough?
Not for impact analysis. You’ll waste time triaging false positives. Column-level lineage for Tier 1 assets is the difference between a 15-minute fix and a multi-hour blame game.
How do we handle notebooks (the perennial Databricks pain)?
Use `openlineage-spark` configured at cluster/job level so reads/writes are captured regardless of notebook style. For ad-hoc notebooks, gate promotion into scheduled jobs unless they emit lineage and pass tests.
Won’t this slow down our pipelines or cost a fortune?
OpenLineage events are tiny. Collectors add negligible overhead. The real costs are in storage/compute from bad jobs; lineage reduces that by catching breakages early and scoping reruns. Most teams see net cost down and MTTR down 70–90%.
What KPIs should we track to prove value?
- MTTR for data incidents - % of Tier 1 assets with column-level lineage - Data downtime hours/month per domain - % of alerts with identified root cause within 30 minutes - Contract violation rate per month

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Get a lineage readiness assessment See how we wire lineage into your on-call