Do we need OpenLineage if we already run DataHub?

No. If you’re already invested in DataHub or OpenMetadata, use them. The key is to emit machine-readable lineage events and ensure you can join them with traces and your model registry. We often integrate OpenTelemetry traces with DataHub’s lineage graph just fine.

How do we track lineage for hosted LLMs (OpenAI, Anthropic) we don’t control?

Wrap provider calls. Emit spans and lineage facets with `model_name`, `model_version`, parameters, token counts, and the exact prompt template version. Log retrieval inputs, not raw PII. You won’t get provider internals, but your side of the chain is enough for RCA and rollback.

Isn’t all this expensive?

Cheaper than an outage. Start small: one service, one pipeline, one canary. The infra footprint is modest: Marquez (or DataHub), Prometheus, a trace backend, and MLflow—likely stuff you already run. The payoff is MTTR reduction and avoided incidents.

What about privacy and compliance?

Mask or hash sensitive fields in lineage facets, and store full context only in restricted stores. Use `OPA` to block promotions without a privacy review, and use dataset-level tags (PII, PCI) to prevent cross-domain joins in training.

How do we measure hallucination reliably?

Use golden sets for critical tasks and supplement with model-graded evaluations. Track schema violations, tool execution failures, and factuality checks where possible. The goal is a leading indicator you can gate on during canaries—not perfect truth.

Ai-delivery · Oct 2, 2025 · 10 minute read

The RCA That Ate Our Weekend: Data Lineage for AI Training and Inference That Actually Works

If you can’t answer “which data, which model, which prompt, which version, which trace” in 60 seconds, you don’t have AI in production—you have risk. Here’s the lineage stack we deploy so leaders sleep at night.

Back to all posts

The RCA That Ate Our Weekend: Data Lineage for AI Training and Inference That Actually Works

Key takeaways

Implementation checklist