Do we need a new catalog to get lineage?

No. Start by emitting `OpenLineage` events from Airflow/dbt/Spark into a single store. If you already run Atlas or DataHub, use it. If not, stand up Marquez in a day and move on. Don’t build custom collectors unless a vendor lacks support.

Is table-level lineage enough?

Not for impact analysis. You’ll waste time triaging false positives. Column-level lineage for Tier 1 assets is the difference between a 15-minute fix and a multi-hour blame game.

How do we handle notebooks (the perennial Databricks pain)?

Use `openlineage-spark` configured at cluster/job level so reads/writes are captured regardless of notebook style. For ad-hoc notebooks, gate promotion into scheduled jobs unless they emit lineage and pass tests.

Won’t this slow down our pipelines or cost a fortune?

OpenLineage events are tiny. Collectors add negligible overhead. The real costs are in storage/compute from bad jobs; lineage reduces that by catching breakages early and scoping reruns. Most teams see net cost down and MTTR down 70–90%.

What KPIs should we track to prove value?

- MTTR for data incidents - % of Tier 1 assets with column-level lineage - Data downtime hours/month per domain - % of alerts with identified root cause within 30 minutes - Contract violation rate per month

Data-engineering · Oct 1, 2025 · 9 minute read

The Lineage System That Turned 3‑Hour Fire Drills Into 15‑Minute Fixes

If you can’t trace where a number came from, you can’t trust it. Here’s how we build data lineage that makes impact analysis and debugging fast, boring, and reliable — without boiling the ocean.

Back to all posts

The Lineage System That Turned 3‑Hour Fire Drills Into 15‑Minute Fixes

Key takeaways

Implementation checklist