The KPI Broke at 9:03 AM — Lineage Was the Only Thing Between Us and Guesswork

If you can’t answer “what depends on this table/column?” in under 60 seconds, you don’t have data reliability — you have data vibes. Here’s how we implement lineage that actually enables impact analysis and fast debugging.

If you can’t answer “what depends on this table/column?” in under 60 seconds, you don’t have data reliability — you have data vibes.
Back to all posts

The morning your CEO pings you about “wrong numbers”

I’ve watched this exact scene play out at fintechs, retail, and a couple of “we’re basically a data company” SaaS shops: it’s 9:03 AM, the exec dashboard is off by 12%, and someone asks the question that should be easy — “what changed?”

If your answer is some blend of SELECT *, Slack archaeology, and hoping the one staff engineer who “knows the pipelines” isn’t on PTO, you don’t have observability. You have folklore.

Data lineage is what turns that chaos into a tractable debugging exercise: you can trace the metric back to the datasets and jobs that produced it, see what ran, what changed, and what else is now at risk.

Lineage that matters: impact analysis and debugging (not slideware)

Let’s be blunt: lineage diagrams that look pretty in a governance deck but don’t help on-call are a waste of money.

The lineage you want has two operational use cases:

  • Impact analysis (pre-change): “If I alter orders.discount_amount, what dashboards/models break?”
  • Debugging (post-incident): “This metric is wrong; show me the upstream run, inputs, and last known good state.”

To do that, lineage needs a few things beyond a DAG screenshot:

  • Stable dataset identity: warehouse.schema.table (and ideally column IDs)
  • Run context: a unique run_id, timestamps, orchestrator task IDs, git SHA, and environment (prod/staging)
  • Ownership + tiering: who gets paged, and whether this dataset is a Tier-0 KPI or a nice-to-have
  • Queryable graph: an API/graph store you can query in seconds, not a PDF

The reliability payoff is measurable. When teams implement this well, we typically see:

  • 30–70% reduction in MTTR for data incidents (because you stop guessing)
  • Fewer “blast radius” surprises during schema changes (change failure rate drops)
  • Higher trust and faster delivery: analytics teams spend less time reconciling and more time shipping

The minimal viable lineage architecture (that doesn’t become a science project)

Here’s what actually works in the real world when you have Airflow, dbt, some Spark, and a warehouse like Snowflake or BigQuery.

  1. Emit lineage events from the systems that know what happened
    • Orchestrator: Airflow (or Dagster)
    • Transformation layer: dbt
    • Compute: Spark (including EMR/Databricks)
  2. Ingest into a lineage backend
    • Pragmatic options: Marquez (OpenLineage native) or DataHub (broader metadata)
  3. Enrich the graph
    • Ownership from your org model (PagerDuty, Opsgenie, Teams)
    • Business metadata (domain, KPI tier, PII tags)
  4. Expose “blast radius”
    • CLI/API that answers dependency questions quickly
    • Optional: wire into CI for “unsafe change” detection

The trick: start with job-level lineage. It’s fast and gives you 80% of the value. Add column-level lineage only where it pays (finance, rev-rec, core product KPIs). I’ve seen teams boil the ocean with column-level parsing and end up with nothing shippable.

Implementation: OpenLineage with Airflow + Spark + dbt (concrete wiring)

Airflow: emit OpenLineage events

If you’re on apache-airflow>=2.3, you can add OpenLineage via provider integrations. A common pattern is to use the OpenLineage listener so every task run produces consistent events.

pip install "openlineage-airflow>=1.9.0"  # version as of many 2024/2025 deployments
# airflow.cfg
[openlineage]
transport = http
url = http://marquez:5000/api/v1/namespaces/prod/events
api_key =

Now each task run can emit:

  • inputs: tables read
  • outputs: tables written
  • run: runId, startTime, endTime, status

Spark: capture datasets read/written

For Spark jobs (Databricks, EMR, K8s Spark Operator), attach an OpenLineage listener.

spark-submit \
  --packages io.openlineage:openlineage-spark_2.12:1.9.0 \
  --conf spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener \
  --conf spark.openlineage.transport.type=http \
  --conf spark.openlineage.transport.url=http://marquez:5000/api/v1/namespaces/prod/events \
  --conf spark.openlineage.namespace=prod \
  your_job.py

This is where lineage starts paying off in debugging. When someone “optimizes” a Spark job and accidentally changes join semantics, you can see exactly which inputs/outputs changed and correlate to the run.

dbt: ingest artifacts (and don’t forget git SHA)

dbt already knows model dependencies. You just need to publish them. At minimum, store manifest.json + run_results.json somewhere durable (S3/GCS) and ingest into your metadata store.

dbt deps
DBT_ENV_CUSTOM_ENV_run_sha=$(git rev-parse --short HEAD) dbt build --target prod
aws s3 cp target/manifest.json s3://data-metadata/dbt/prod/manifest.json
aws s3 cp target/run_results.json s3://data-metadata/dbt/prod/run_results-$(date +%F).json

If you use DataHub, you can ingest dbt artifacts with its ingestion framework.

# datahub-dbt-ingest.yaml
source:
  type: dbt
  config:
    manifest_path: "s3://data-metadata/dbt/prod/manifest.json"
    catalog_path: "s3://data-metadata/dbt/prod/catalog.json"
    run_results_paths:
      - "s3://data-metadata/dbt/prod/run_results-*.json"
    target_platform: "snowflake"
    env: "PROD"

sink:
  type: datahub-rest
  config:
    server: "http://datahub-gms:8080"

Non-negotiable: propagate a single run identifier across orchestrator → dbt/spark → lineage events. If you can’t correlate runs, you’ll still be guessing.

Impact analysis: make “blast radius” a first-class check

Impact analysis is where senior engineering leaders see ROI fast, because it turns risky changes into controlled ones.

A pattern we implement at GitPlumbers:

  1. On PRs that touch dbt models or warehouse DDL, run a CI job that:
    • Detects changed models/columns
    • Queries lineage to compute downstream dependencies
    • Fails the build (or requires approval) if Tier-0 assets are impacted
  2. Post the blast radius into the PR as a comment (engineers actually read it)

Here’s an example using DataHub GraphQL to fetch downstream datasets.

curl -s -X POST "http://datahub-gms:8080/api/graphql" \
  -H 'Content-Type: application/json' \
  -d '{
    "query": "query downstream($urn: String!) {\n  lineage(input: { urn: $urn, direction: DOWNSTREAM, start: 0, count: 50 }) {\n    relationships { entity { urn type } }\n  }\n}",
    "variables": {
      "urn": "urn:li:dataset:(urn:li:dataPlatform:snowflake,PROD.analytics.orders,PROD)"
    }
  }' | jq

Operationalize it with simple policy:

  • If downstream includes finance.* or exec_kpis.*, require:
    • a migration plan
    • a backfill estimate
    • and an owner sign-off

That’s how you stop “harmless refactors” from nuking quarter-end reporting.

Debugging workflow: from broken dashboard to the exact bad run

When lineage is wired correctly, the incident workflow becomes boring (the best kind of reliability).

A practical playbook:

  1. Start at the symptom
    • Dashboard panel → metric definition → source dataset(s)
  2. Jump to lineage graph
    • Find upstream jobs/models that produced the dataset
  3. Identify the last known good run
    • Compare run timestamps and git SHAs
  4. Check data quality signals
    • Freshness, row counts, null rates, distribution drift
  5. Fix forward with confidence
    • Roll back the offending change or patch the transformation

You can attach data quality checks to lineage runs so you can answer: “did this dataset degrade, or did the business actually change?”

Example with Great Expectations checkpoint tied to a pipeline run:

# great_expectations/checkpoints/orders_reliability.yml
name: orders_reliability
config_version: 1.0
class_name: Checkpoint
run_name_template: "%Y%m%d-%H%M%S-${PIPELINE_RUN_ID}"
validations:
  - batch_request:
      datasource_name: snowflake_ds
      data_connector_name: default_inferred_data_connector_name
      data_asset_name: analytics.orders
    expectation_suite_name: orders_suite
action_list:
  - name: store_validation_result
    action:
      class_name: StoreValidationResultAction
  - name: notify_slack
    action:
      class_name: SlackNotificationAction
      slack_webhook: ${SLACK_WEBHOOK}

Now your lineage backend can store a link from the dataset run to the validation result. During an incident, you’re not asking “do we have tests?” — you’re clicking straight to the failing expectation.

Measurable outcomes we’ve seen after implementing this loop:

  • Data incident MTTR drops from hours to tens of minutes (common: 2.5h → 45m)
  • Fewer repeat incidents because the root cause is visible and documented
  • Better on-call load: fewer “all-hands data fire drills”

What to do when your lineage is garbage (and it will be at first)

Every lineage rollout hits the same landmines:

  • Inconsistent naming (prod_analytics.orders vs analytics.orders_prod): fix with conventions and normalization.
  • Phantom dependencies from dynamic SQL: tag them as “unknown edges” until you can instrument query logs.
  • Missing ownership: no owner means no reliability. Add owner as metadata and enforce it for Tier-0.
  • AI-generated pipelines (“vibe-coded dbt macros”): they often hide dependencies in Jinja spaghetti. We’ve had to refactor these into explicit models just to get trustworthy lineage.

Here’s what actually works:

  • Start with Tier-0 datasets (exec KPIs, billing, rev-rec) and instrument end-to-end.
  • Create a simple contract: no Tier-0 table without an owner, SLO tier, and lineage coverage.
  • Use warehouse query logs (Snowflake ACCOUNT_USAGE, BigQuery INFORMATION_SCHEMA) to backfill lineage when instrumentation is incomplete.

The reliability ROI: fewer broken dashboards, safer changes, faster delivery

Lineage isn’t “governance.” It’s a reliability control.

When you can answer impact analysis questions quickly, you ship changes with less fear. When you can debug from a metric back to a run ID and git SHA, you stop burning senior time on guesswork.

At GitPlumbers, we usually implement a first usable lineage system in 2–4 weeks for an existing stack (Airflow + dbt + Snowflake/BigQuery), then iterate:

  • Week 1: instrument orchestrator + dbt artifacts, normalize dataset IDs
  • Week 2: lineage backend + ownership + basic blast radius query
  • Weeks 3–4: wire into CI, add Tier-0 quality gates, tighten run correlation

If you’re dealing with legacy pipelines, half-migrated “modern data stack,” or AI-assisted code that nobody fully trusts, this is one of the highest-leverage reliability upgrades you can make.

If you want a second set of eyes: GitPlumbers helps teams retrofit lineage, observability, and quality gates into messy real-world stacks — without stopping delivery. Start with a lineage/impact-analysis assessment and we’ll tell you what’s worth doing, what’s theater, and where the bodies are buried.

Related Resources

Key takeaways

  • Lineage is only useful if it’s tied to operational workflows: incident response, change management, and ownership.
  • Start with **job-level lineage** (fast to implement), then graduate to **column-level lineage** where it pays for itself (finance metrics, core KPIs).
  • OpenLineage events + a lineage backend (Marquez or DataHub) is a pragmatic spine for multi-tool stacks (Airflow/dbt/Spark).
  • Impact analysis becomes a product feature for engineers: “show me blast radius” before merges and before deploys.
  • Teams that operationalize lineage typically cut data incident MTTR by 30–70% within a quarter.

Implementation checklist

  • Pick a lineage backbone: `OpenLineage` + `Marquez` or `DataHub`
  • Instrument orchestrator runs (`Airflow`/`Dagster`) to emit run IDs and lineage events
  • Ingest transformation lineage (`dbt` artifacts) and compute lineage (`Spark` listener)
  • Normalize dataset identifiers (warehouse + schema + table) and enforce naming conventions
  • Attach ownership (`oncall`, `slack`, `team`) and SLO tier to critical datasets
  • Create a blast-radius query/API for impact analysis and wire it into PR checks
  • Add incident drill workflow: dashboard → metric → dataset → upstream job runs → bad change
  • Track outcomes: MTTR, number of incidents, change failure rate, time-to-impact analysis

Questions we hear from teams

Do we need column-level lineage to get value?
No. Start with **job-level lineage** (tables in/out + run IDs). Add **column-level lineage** only for Tier-0 domains (finance, billing, core KPIs) where it materially reduces incident time or change risk.
Marquez or DataHub — which should we pick?
If your main goal is OpenLineage-based run + dataset lineage with minimal surface area, `Marquez` is straightforward. If you also want broader metadata (ownership, glossary, search, schema history) across many systems, `DataHub` usually wins. We often start with the simplest option and integrate upward.
How do we handle dynamic SQL and “unknown” dependencies?
Treat them as first-class: emit lineage edges marked as **unknown/unresolved**, then backfill with warehouse query logs (`Snowflake ACCOUNT_USAGE`, `BigQuery INFORMATION_SCHEMA`). Over time, refactor the worst offenders (often AI-generated macros) into explicit, testable models.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Get a lineage + impact-analysis assessment See how we fix unreliable data pipelines

Related resources