The KPI Broke at 9:03 AM — Lineage Was the Only Thing Between Us and Guesswork
If you can’t answer “what depends on this table/column?” in under 60 seconds, you don’t have data reliability — you have data vibes. Here’s how we implement lineage that actually enables impact analysis and fast debugging.
If you can’t answer “what depends on this table/column?” in under 60 seconds, you don’t have data reliability — you have data vibes.Back to all posts
The morning your CEO pings you about “wrong numbers”
I’ve watched this exact scene play out at fintechs, retail, and a couple of “we’re basically a data company” SaaS shops: it’s 9:03 AM, the exec dashboard is off by 12%, and someone asks the question that should be easy — “what changed?”
If your answer is some blend of SELECT *, Slack archaeology, and hoping the one staff engineer who “knows the pipelines” isn’t on PTO, you don’t have observability. You have folklore.
Data lineage is what turns that chaos into a tractable debugging exercise: you can trace the metric back to the datasets and jobs that produced it, see what ran, what changed, and what else is now at risk.
Lineage that matters: impact analysis and debugging (not slideware)
Let’s be blunt: lineage diagrams that look pretty in a governance deck but don’t help on-call are a waste of money.
The lineage you want has two operational use cases:
- Impact analysis (pre-change): “If I alter
orders.discount_amount, what dashboards/models break?” - Debugging (post-incident): “This metric is wrong; show me the upstream run, inputs, and last known good state.”
To do that, lineage needs a few things beyond a DAG screenshot:
- Stable dataset identity:
warehouse.schema.table(and ideally column IDs) - Run context: a unique
run_id, timestamps, orchestrator task IDs, git SHA, and environment (prod/staging) - Ownership + tiering: who gets paged, and whether this dataset is a Tier-0 KPI or a nice-to-have
- Queryable graph: an API/graph store you can query in seconds, not a PDF
The reliability payoff is measurable. When teams implement this well, we typically see:
- 30–70% reduction in MTTR for data incidents (because you stop guessing)
- Fewer “blast radius” surprises during schema changes (change failure rate drops)
- Higher trust and faster delivery: analytics teams spend less time reconciling and more time shipping
The minimal viable lineage architecture (that doesn’t become a science project)
Here’s what actually works in the real world when you have Airflow, dbt, some Spark, and a warehouse like Snowflake or BigQuery.
- Emit lineage events from the systems that know what happened
- Orchestrator:
Airflow(orDagster) - Transformation layer:
dbt - Compute:
Spark(including EMR/Databricks)
- Orchestrator:
- Ingest into a lineage backend
- Pragmatic options:
Marquez(OpenLineage native) orDataHub(broader metadata)
- Pragmatic options:
- Enrich the graph
- Ownership from your org model (PagerDuty, Opsgenie, Teams)
- Business metadata (domain, KPI tier, PII tags)
- Expose “blast radius”
- CLI/API that answers dependency questions quickly
- Optional: wire into CI for “unsafe change” detection
The trick: start with job-level lineage. It’s fast and gives you 80% of the value. Add column-level lineage only where it pays (finance, rev-rec, core product KPIs). I’ve seen teams boil the ocean with column-level parsing and end up with nothing shippable.
Implementation: OpenLineage with Airflow + Spark + dbt (concrete wiring)
Airflow: emit OpenLineage events
If you’re on apache-airflow>=2.3, you can add OpenLineage via provider integrations. A common pattern is to use the OpenLineage listener so every task run produces consistent events.
pip install "openlineage-airflow>=1.9.0" # version as of many 2024/2025 deployments# airflow.cfg
[openlineage]
transport = http
url = http://marquez:5000/api/v1/namespaces/prod/events
api_key =Now each task run can emit:
inputs: tables readoutputs: tables writtenrun:runId,startTime,endTime, status
Spark: capture datasets read/written
For Spark jobs (Databricks, EMR, K8s Spark Operator), attach an OpenLineage listener.
spark-submit \
--packages io.openlineage:openlineage-spark_2.12:1.9.0 \
--conf spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener \
--conf spark.openlineage.transport.type=http \
--conf spark.openlineage.transport.url=http://marquez:5000/api/v1/namespaces/prod/events \
--conf spark.openlineage.namespace=prod \
your_job.pyThis is where lineage starts paying off in debugging. When someone “optimizes” a Spark job and accidentally changes join semantics, you can see exactly which inputs/outputs changed and correlate to the run.
dbt: ingest artifacts (and don’t forget git SHA)
dbt already knows model dependencies. You just need to publish them. At minimum, store manifest.json + run_results.json somewhere durable (S3/GCS) and ingest into your metadata store.
dbt deps
DBT_ENV_CUSTOM_ENV_run_sha=$(git rev-parse --short HEAD) dbt build --target prod
aws s3 cp target/manifest.json s3://data-metadata/dbt/prod/manifest.json
aws s3 cp target/run_results.json s3://data-metadata/dbt/prod/run_results-$(date +%F).jsonIf you use DataHub, you can ingest dbt artifacts with its ingestion framework.
# datahub-dbt-ingest.yaml
source:
type: dbt
config:
manifest_path: "s3://data-metadata/dbt/prod/manifest.json"
catalog_path: "s3://data-metadata/dbt/prod/catalog.json"
run_results_paths:
- "s3://data-metadata/dbt/prod/run_results-*.json"
target_platform: "snowflake"
env: "PROD"
sink:
type: datahub-rest
config:
server: "http://datahub-gms:8080"Non-negotiable: propagate a single run identifier across orchestrator → dbt/spark → lineage events. If you can’t correlate runs, you’ll still be guessing.
Impact analysis: make “blast radius” a first-class check
Impact analysis is where senior engineering leaders see ROI fast, because it turns risky changes into controlled ones.
A pattern we implement at GitPlumbers:
- On PRs that touch
dbtmodels or warehouse DDL, run a CI job that:- Detects changed models/columns
- Queries lineage to compute downstream dependencies
- Fails the build (or requires approval) if Tier-0 assets are impacted
- Post the blast radius into the PR as a comment (engineers actually read it)
Here’s an example using DataHub GraphQL to fetch downstream datasets.
curl -s -X POST "http://datahub-gms:8080/api/graphql" \
-H 'Content-Type: application/json' \
-d '{
"query": "query downstream($urn: String!) {\n lineage(input: { urn: $urn, direction: DOWNSTREAM, start: 0, count: 50 }) {\n relationships { entity { urn type } }\n }\n}",
"variables": {
"urn": "urn:li:dataset:(urn:li:dataPlatform:snowflake,PROD.analytics.orders,PROD)"
}
}' | jqOperationalize it with simple policy:
- If downstream includes
finance.*orexec_kpis.*, require:- a migration plan
- a backfill estimate
- and an owner sign-off
That’s how you stop “harmless refactors” from nuking quarter-end reporting.
Debugging workflow: from broken dashboard to the exact bad run
When lineage is wired correctly, the incident workflow becomes boring (the best kind of reliability).
A practical playbook:
- Start at the symptom
- Dashboard panel → metric definition → source dataset(s)
- Jump to lineage graph
- Find upstream jobs/models that produced the dataset
- Identify the last known good run
- Compare run timestamps and git SHAs
- Check data quality signals
- Freshness, row counts, null rates, distribution drift
- Fix forward with confidence
- Roll back the offending change or patch the transformation
You can attach data quality checks to lineage runs so you can answer: “did this dataset degrade, or did the business actually change?”
Example with Great Expectations checkpoint tied to a pipeline run:
# great_expectations/checkpoints/orders_reliability.yml
name: orders_reliability
config_version: 1.0
class_name: Checkpoint
run_name_template: "%Y%m%d-%H%M%S-${PIPELINE_RUN_ID}"
validations:
- batch_request:
datasource_name: snowflake_ds
data_connector_name: default_inferred_data_connector_name
data_asset_name: analytics.orders
expectation_suite_name: orders_suite
action_list:
- name: store_validation_result
action:
class_name: StoreValidationResultAction
- name: notify_slack
action:
class_name: SlackNotificationAction
slack_webhook: ${SLACK_WEBHOOK}Now your lineage backend can store a link from the dataset run to the validation result. During an incident, you’re not asking “do we have tests?” — you’re clicking straight to the failing expectation.
Measurable outcomes we’ve seen after implementing this loop:
- Data incident MTTR drops from hours to tens of minutes (common: 2.5h → 45m)
- Fewer repeat incidents because the root cause is visible and documented
- Better on-call load: fewer “all-hands data fire drills”
What to do when your lineage is garbage (and it will be at first)
Every lineage rollout hits the same landmines:
- Inconsistent naming (
prod_analytics.ordersvsanalytics.orders_prod): fix with conventions and normalization. - Phantom dependencies from dynamic SQL: tag them as “unknown edges” until you can instrument query logs.
- Missing ownership: no owner means no reliability. Add owner as metadata and enforce it for Tier-0.
- AI-generated pipelines (“vibe-coded dbt macros”): they often hide dependencies in Jinja spaghetti. We’ve had to refactor these into explicit models just to get trustworthy lineage.
Here’s what actually works:
- Start with Tier-0 datasets (exec KPIs, billing, rev-rec) and instrument end-to-end.
- Create a simple contract: no Tier-0 table without an owner, SLO tier, and lineage coverage.
- Use warehouse query logs (
Snowflake ACCOUNT_USAGE,BigQuery INFORMATION_SCHEMA) to backfill lineage when instrumentation is incomplete.
The reliability ROI: fewer broken dashboards, safer changes, faster delivery
Lineage isn’t “governance.” It’s a reliability control.
When you can answer impact analysis questions quickly, you ship changes with less fear. When you can debug from a metric back to a run ID and git SHA, you stop burning senior time on guesswork.
At GitPlumbers, we usually implement a first usable lineage system in 2–4 weeks for an existing stack (Airflow + dbt + Snowflake/BigQuery), then iterate:
- Week 1: instrument orchestrator + dbt artifacts, normalize dataset IDs
- Week 2: lineage backend + ownership + basic blast radius query
- Weeks 3–4: wire into CI, add Tier-0 quality gates, tighten run correlation
If you’re dealing with legacy pipelines, half-migrated “modern data stack,” or AI-assisted code that nobody fully trusts, this is one of the highest-leverage reliability upgrades you can make.
If you want a second set of eyes: GitPlumbers helps teams retrofit lineage, observability, and quality gates into messy real-world stacks — without stopping delivery. Start with a lineage/impact-analysis assessment and we’ll tell you what’s worth doing, what’s theater, and where the bodies are buried.
Key takeaways
- Lineage is only useful if it’s tied to operational workflows: incident response, change management, and ownership.
- Start with **job-level lineage** (fast to implement), then graduate to **column-level lineage** where it pays for itself (finance metrics, core KPIs).
- OpenLineage events + a lineage backend (Marquez or DataHub) is a pragmatic spine for multi-tool stacks (Airflow/dbt/Spark).
- Impact analysis becomes a product feature for engineers: “show me blast radius” before merges and before deploys.
- Teams that operationalize lineage typically cut data incident MTTR by 30–70% within a quarter.
Implementation checklist
- Pick a lineage backbone: `OpenLineage` + `Marquez` or `DataHub`
- Instrument orchestrator runs (`Airflow`/`Dagster`) to emit run IDs and lineage events
- Ingest transformation lineage (`dbt` artifacts) and compute lineage (`Spark` listener)
- Normalize dataset identifiers (warehouse + schema + table) and enforce naming conventions
- Attach ownership (`oncall`, `slack`, `team`) and SLO tier to critical datasets
- Create a blast-radius query/API for impact analysis and wire it into PR checks
- Add incident drill workflow: dashboard → metric → dataset → upstream job runs → bad change
- Track outcomes: MTTR, number of incidents, change failure rate, time-to-impact analysis
Questions we hear from teams
- Do we need column-level lineage to get value?
- No. Start with **job-level lineage** (tables in/out + run IDs). Add **column-level lineage** only for Tier-0 domains (finance, billing, core KPIs) where it materially reduces incident time or change risk.
- Marquez or DataHub — which should we pick?
- If your main goal is OpenLineage-based run + dataset lineage with minimal surface area, `Marquez` is straightforward. If you also want broader metadata (ownership, glossary, search, schema history) across many systems, `DataHub` usually wins. We often start with the simplest option and integrate upward.
- How do we handle dynamic SQL and “unknown” dependencies?
- Treat them as first-class: emit lineage edges marked as **unknown/unresolved**, then backfill with warehouse query logs (`Snowflake ACCOUNT_USAGE`, `BigQuery INFORMATION_SCHEMA`). Over time, refactor the worst offenders (often AI-generated macros) into explicit, testable models.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
