Do we need column-level lineage to get value?

No. Start with **job-level lineage** (tables in/out + run IDs). Add **column-level lineage** only for Tier-0 domains (finance, billing, core KPIs) where it materially reduces incident time or change risk.

Marquez or DataHub — which should we pick?

If your main goal is OpenLineage-based run + dataset lineage with minimal surface area, `Marquez` is straightforward. If you also want broader metadata (ownership, glossary, search, schema history) across many systems, `DataHub` usually wins. We often start with the simplest option and integrate upward.

How do we handle dynamic SQL and “unknown” dependencies?

Treat them as first-class: emit lineage edges marked as **unknown/unresolved**, then backfill with warehouse query logs (`Snowflake ACCOUNT_USAGE`, `BigQuery INFORMATION_SCHEMA`). Over time, refactor the worst offenders (often AI-generated macros) into explicit, testable models.

Data-engineering · Feb 5, 2026 · 7 minute read

The KPI Broke at 9:03 AM — Lineage Was the Only Thing Between Us and Guesswork

If you can’t answer “what depends on this table/column?” in under 60 seconds, you don’t have data reliability — you have data vibes. Here’s how we implement lineage that actually enables impact analysis and fast debugging.

GitPlumbers Editorial Team

Data & Reliability Engineering (20-year practitioners)

We’ve operated data platforms through the bad old days of nightly batch, the “modern data stack” boom, and today’s AI-assisted code tsunami. GitPlumbers helps teams clean up lineage, quality, and observability so data products ship safely — and stay correct.

If you can’t answer “what depends on this table/column?” in under 60 seconds, you don’t have data reliability — you have data vibes.

Back to all posts

The morning your CEO pings you about “wrong numbers”

I’ve watched this exact scene play out at fintechs, retail, and a couple of “we’re basically a data company” SaaS shops: it’s 9:03 AM, the exec dashboard is off by 12%, and someone asks the question that should be easy — “what changed?”

If your answer is some blend of SELECT *, Slack archaeology, and hoping the one staff engineer who “knows the pipelines” isn’t on PTO, you don’t have observability. You have folklore.

Data lineage is what turns that chaos into a tractable debugging exercise: you can trace the metric back to the datasets and jobs that produced it, see what ran, what changed, and what else is now at risk.

Lineage that matters: impact analysis and debugging (not slideware)

Let’s be blunt: lineage diagrams that look pretty in a governance deck but don’t help on-call are a waste of money.

The lineage you want has two operational use cases:

Impact analysis (pre-change): “If I alter orders.discount_amount, what dashboards/models break?”
Debugging (post-incident): “This metric is wrong; show me the upstream run, inputs, and last known good state.”

To do that, lineage needs a few things beyond a DAG screenshot:

Stable dataset identity: warehouse.schema.table (and ideally column IDs)
Run context: a unique run_id, timestamps, orchestrator task IDs, git SHA, and environment (prod/staging)
Ownership + tiering: who gets paged, and whether this dataset is a Tier-0 KPI or a nice-to-have
Queryable graph: an API/graph store you can query in seconds, not a PDF

The reliability payoff is measurable. When teams implement this well, we typically see:

30–70% reduction in MTTR for data incidents (because you stop guessing)
Fewer “blast radius” surprises during schema changes (change failure rate drops)
Higher trust and faster delivery: analytics teams spend less time reconciling and more time shipping

The minimal viable lineage architecture (that doesn’t become a science project)

Here’s what actually works in the real world when you have Airflow, dbt, some Spark, and a warehouse like Snowflake or BigQuery.

Emit lineage events from the systems that know what happened
- Orchestrator: Airflow (or Dagster)
- Transformation layer: dbt
- Compute: Spark (including EMR/Databricks)
Ingest into a lineage backend
- Pragmatic options: Marquez (OpenLineage native) or DataHub (broader metadata)
Enrich the graph
- Ownership from your org model (PagerDuty, Opsgenie, Teams)
- Business metadata (domain, KPI tier, PII tags)
Expose “blast radius”
- CLI/API that answers dependency questions quickly
- Optional: wire into CI for “unsafe change” detection

The trick: start with job-level lineage. It’s fast and gives you 80% of the value. Add column-level lineage only where it pays (finance, rev-rec, core product KPIs). I’ve seen teams boil the ocean with column-level parsing and end up with nothing shippable.

Implementation: OpenLineage with Airflow + Spark + dbt (concrete wiring)

Airflow: emit OpenLineage events

If you’re on apache-airflow>=2.3, you can add OpenLineage via provider integrations. A common pattern is to use the OpenLineage listener so every task run produces consistent events.

pip install "openlineage-airflow>=1.9.0"  # version as of many 2024/2025 deployments

# airflow.cfg
[openlineage]
transport = http
url = http://marquez:5000/api/v1/namespaces/prod/events
api_key =

Now each task run can emit:

inputs: tables read
outputs: tables written
run: runId, startTime, endTime, status

Spark: capture datasets read/written

For Spark jobs (Databricks, EMR, K8s Spark Operator), attach an OpenLineage listener.

spark-submit \
  --packages io.openlineage:openlineage-spark_2.12:1.9.0 \
  --conf spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener \
  --conf spark.openlineage.transport.type=http \
  --conf spark.openlineage.transport.url=http://marquez:5000/api/v1/namespaces/prod/events \
  --conf spark.openlineage.namespace=prod \
  your_job.py

This is where lineage starts paying off in debugging. When someone “optimizes” a Spark job and accidentally changes join semantics, you can see exactly which inputs/outputs changed and correlate to the run.

dbt: ingest artifacts (and don’t forget git SHA)

dbt already knows model dependencies. You just need to publish them. At minimum, store manifest.json + run_results.json somewhere durable (S3/GCS) and ingest into your metadata store.

dbt deps
DBT_ENV_CUSTOM_ENV_run_sha=$(git rev-parse --short HEAD) dbt build --target prod
aws s3 cp target/manifest.json s3://data-metadata/dbt/prod/manifest.json
aws s3 cp target/run_results.json s3://data-metadata/dbt/prod/run_results-$(date +%F).json

If you use DataHub, you can ingest dbt artifacts with its ingestion framework.

# datahub-dbt-ingest.yaml
source:
  type: dbt
  config:
    manifest_path: "s3://data-metadata/dbt/prod/manifest.json"
    catalog_path: "s3://data-metadata/dbt/prod/catalog.json"
    run_results_paths:
      - "s3://data-metadata/dbt/prod/run_results-*.json"
    target_platform: "snowflake"
    env: "PROD"

sink:
  type: datahub-rest
  config:
    server: "http://datahub-gms:8080"

Non-negotiable: propagate a single run identifier across orchestrator → dbt/spark → lineage events. If you can’t correlate runs, you’ll still be guessing.

Impact analysis: make “blast radius” a first-class check

Impact analysis is where senior engineering leaders see ROI fast, because it turns risky changes into controlled ones.

A pattern we implement at GitPlumbers:

On PRs that touch dbt models or warehouse DDL, run a CI job that:
- Detects changed models/columns
- Queries lineage to compute downstream dependencies
- Fails the build (or requires approval) if Tier-0 assets are impacted
Post the blast radius into the PR as a comment (engineers actually read it)

Here’s an example using DataHub GraphQL to fetch downstream datasets.

curl -s -X POST "http://datahub-gms:8080/api/graphql" \
  -H 'Content-Type: application/json' \
  -d '{
    "query": "query downstream($urn: String!) {\n  lineage(input: { urn: $urn, direction: DOWNSTREAM, start: 0, count: 50 }) {\n    relationships { entity { urn type } }\n  }\n}",
    "variables": {
      "urn": "urn:li:dataset:(urn:li:dataPlatform:snowflake,PROD.analytics.orders,PROD)"
    }
  }' | jq

Operationalize it with simple policy:

If downstream includes finance.* or exec_kpis.*, require:
- a migration plan
- a backfill estimate
- and an owner sign-off

That’s how you stop “harmless refactors” from nuking quarter-end reporting.

Debugging workflow: from broken dashboard to the exact bad run

When lineage is wired correctly, the incident workflow becomes boring (the best kind of reliability).

A practical playbook:

Start at the symptom
- Dashboard panel → metric definition → source dataset(s)
Jump to lineage graph
- Find upstream jobs/models that produced the dataset
Identify the last known good run
- Compare run timestamps and git SHAs
Check data quality signals
- Freshness, row counts, null rates, distribution drift
Fix forward with confidence
- Roll back the offending change or patch the transformation

You can attach data quality checks to lineage runs so you can answer: “did this dataset degrade, or did the business actually change?”

Example with Great Expectations checkpoint tied to a pipeline run:

# great_expectations/checkpoints/orders_reliability.yml
name: orders_reliability
config_version: 1.0
class_name: Checkpoint
run_name_template: "%Y%m%d-%H%M%S-${PIPELINE_RUN_ID}"
validations:
  - batch_request:
      datasource_name: snowflake_ds
      data_connector_name: default_inferred_data_connector_name
      data_asset_name: analytics.orders
    expectation_suite_name: orders_suite
action_list:
  - name: store_validation_result
    action:
      class_name: StoreValidationResultAction
  - name: notify_slack
    action:
      class_name: SlackNotificationAction
      slack_webhook: ${SLACK_WEBHOOK}

Now your lineage backend can store a link from the dataset run to the validation result. During an incident, you’re not asking “do we have tests?” — you’re clicking straight to the failing expectation.

Measurable outcomes we’ve seen after implementing this loop:

Data incident MTTR drops from hours to tens of minutes (common: 2.5h → 45m)
Fewer repeat incidents because the root cause is visible and documented
Better on-call load: fewer “all-hands data fire drills”

What to do when your lineage is garbage (and it will be at first)

Every lineage rollout hits the same landmines:

Inconsistent naming (prod_analytics.orders vs analytics.orders_prod): fix with conventions and normalization.
Phantom dependencies from dynamic SQL: tag them as “unknown edges” until you can instrument query logs.
Missing ownership: no owner means no reliability. Add owner as metadata and enforce it for Tier-0.
AI-generated pipelines (“vibe-coded dbt macros”): they often hide dependencies in Jinja spaghetti. We’ve had to refactor these into explicit models just to get trustworthy lineage.

Here’s what actually works:

Start with Tier-0 datasets (exec KPIs, billing, rev-rec) and instrument end-to-end.
Create a simple contract: no Tier-0 table without an owner, SLO tier, and lineage coverage.
Use warehouse query logs (Snowflake ACCOUNT_USAGE, BigQuery INFORMATION_SCHEMA) to backfill lineage when instrumentation is incomplete.

The reliability ROI: fewer broken dashboards, safer changes, faster delivery

Lineage isn’t “governance.” It’s a reliability control.

When you can answer impact analysis questions quickly, you ship changes with less fear. When you can debug from a metric back to a run ID and git SHA, you stop burning senior time on guesswork.

At GitPlumbers, we usually implement a first usable lineage system in 2–4 weeks for an existing stack (Airflow + dbt + Snowflake/BigQuery), then iterate:

Week 1: instrument orchestrator + dbt artifacts, normalize dataset IDs
Week 2: lineage backend + ownership + basic blast radius query
Weeks 3–4: wire into CI, add Tier-0 quality gates, tighten run correlation

If you’re dealing with legacy pipelines, half-migrated “modern data stack,” or AI-assisted code that nobody fully trusts, this is one of the highest-leverage reliability upgrades you can make.

If you want a second set of eyes: GitPlumbers helps teams retrofit lineage, observability, and quality gates into messy real-world stacks — without stopping delivery. Start with a lineage/impact-analysis assessment and we’ll tell you what’s worth doing, what’s theater, and where the bodies are buried.

Related Resources

Key takeaways

Lineage is only useful if it’s tied to operational workflows: incident response, change management, and ownership.
Start with **job-level lineage** (fast to implement), then graduate to **column-level lineage** where it pays for itself (finance metrics, core KPIs).
OpenLineage events + a lineage backend (Marquez or DataHub) is a pragmatic spine for multi-tool stacks (Airflow/dbt/Spark).
Impact analysis becomes a product feature for engineers: “show me blast radius” before merges and before deploys.
Teams that operationalize lineage typically cut data incident MTTR by 30–70% within a quarter.

Implementation checklist

Pick a lineage backbone: `OpenLineage` + `Marquez` or `DataHub`
Instrument orchestrator runs (`Airflow`/`Dagster`) to emit run IDs and lineage events
Ingest transformation lineage (`dbt` artifacts) and compute lineage (`Spark` listener)
Normalize dataset identifiers (warehouse + schema + table) and enforce naming conventions
Attach ownership (`oncall`, `slack`, `team`) and SLO tier to critical datasets
Create a blast-radius query/API for impact analysis and wire it into PR checks
Add incident drill workflow: dashboard → metric → dataset → upstream job runs → bad change
Track outcomes: MTTR, number of incidents, change failure rate, time-to-impact analysis

Questions we hear from teams

Do we need column-level lineage to get value?: No. Start with **job-level lineage** (tables in/out + run IDs). Add **column-level lineage** only for Tier-0 domains (finance, billing, core KPIs) where it materially reduces incident time or change risk.
Marquez or DataHub — which should we pick?: If your main goal is OpenLineage-based run + dataset lineage with minimal surface area, `Marquez` is straightforward. If you also want broader metadata (ownership, glossary, search, schema history) across many systems, `DataHub` usually wins. We often start with the simplest option and integrate upward.
How do we handle dynamic SQL and “unknown” dependencies?: Treat them as first-class: emit lineage edges marked as **unknown/unresolved**, then backfill with warehouse query logs (`Snowflake ACCOUNT_USAGE`, `BigQuery INFORMATION_SCHEMA`). Over time, refactor the worst offenders (often AI-generated macros) into explicit, testable models.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Get a lineage + impact-analysis assessment See how we fix unreliable data pipelines