The RCA That Ate Our Weekend: Data Lineage for AI Training and Inference That Actually Works
If you can’t answer “which data, which model, which prompt, which version, which trace” in 60 seconds, you don’t have AI in production—you have risk. Here’s the lineage stack we deploy so leaders sleep at night.
Lineage isn’t a diagram—it’s the receipt for every decision your AI makes, linked to a trace you can click in the middle of an incident.Back to all posts
The incident you can’t RCA without lineage
We had a fintech client roll out a slick RAG assistant to explain fees. Friday afternoon, p95 latency spikes to 3s and the bot starts inventing fee tiers that don’t exist. We couldn’t answer basic questions:
Which
model_version
served the bad responses?Which
vector_index
andcorpus_version
fueled retrieval?Which
prompt_template
revision went live?
No lineage. No trace IDs across components. We spent the weekend diffing S3 folders and Slack archeology. The root cause was a silent index rebuild from the staging corpus and a prompt tweak merged without a template version bump. Classic.
If you can’t get from a single incident to the exact data, prompt, and model that produced it in under a minute, you’re flying blind.
This is the playbook we now deploy at GitPlumbers. It’s boring on purpose: strong instrumentation, queryable lineage, and guardrails that trip before customers do.
What “lineage” actually means for AI (training and inference)
Lineage isn’t a PDF diagram. It’s machine-emitted events and immutable IDs that let you reconstruct any run. For AI, think two planes:
Training plane:
raw_data
→feature_engineering
→train_run
→evaluation
→model_registry
→artifact_store
Inference plane:
request
→retrieval
/features
→prompt_template
+model_version
→guardrails
→response
What you must capture:
Datasets and features:
dataset_uri
,data_version
(e.g.,lakeFS
commit,DVC
tag),schema_hash
,owner
.Transform runs:
job_id
,code_sha
,dbt
/Spark
lineage, input/output datasets.Training runs:
mlflow_run_id
,code_sha
, hyperparams,model_uri
, evaluation metrics.Registry events:
model_name
,model_version
, stage (Staging
/Prod
), signer/attestation (usesigstore
if you’re fancy).Inference context:
trace_id
,request_id
,prompt_template_id
+template_version
,retrieval_corpus_version
, top-K doc IDs,temperature
,provider
, token counts, guardrail outcomes.Operational metadata: region, instance type,
Istio
revision,docker_image_sha
,ArgoCD
app revision.
The trick is to stitch all of this with a shared trace—OpenTelemetry
spans that reference OpenLineage
run IDs and your model registry. That’s the difference between a slide and an RCA.
The stack that actually works (and why)
I’ve seen teams glue together spreadsheets and Confluence pages. Don’t. Use tools that emit lineage for you:
Backbone:
OpenLineage
+Marquez
(orDataHub
/OpenMetadata
if you’re already in that ecosystem). They understand datasets, jobs, and runs out of the box.Pipelines:
Airflow
+openlineage-airflow
,dbt
+openlineage-dbt
,Spark
+openlineage-spark
. They push lineage automatically.Model registry:
MLflow
(self-hosted or Databricks). Tag runs withgit_sha
,data_version
, andtraining_code_sha
.Data versioning:
lakeFS
orDVC
so your “dataset” is an immutable commit instead of “s3://bucket/latest”.Tracing and metrics:
OpenTelemetry
for distributed traces;Prometheus
+Grafana
for SRE-grade metrics.Deployment controls:
ArgoCD
+Argo Rollouts
for GitOps and canaries.Service mesh:
Istio
for mTLS and out-of-the-box telemetry.Safety rails:
Guardrails
/jsonschema
/pydantic
for schema adherence,OPA
for policy, plus content safety (e.g.,Azure AI Content Safety
) where needed.
Here’s the minimum viable wiring in an inference service to connect traces and lineage:
# fastapi_middleware.py
from fastapi import Request
from opentelemetry import trace
from openlineage.client import OpenLineageClient
from datetime import datetime
ol = OpenLineageClient(url="http://marquez:5000")
tracer = trace.get_tracer(__name__)
async def with_lineage(request: Request, call_next):
with tracer.start_as_current_span("inference_request") as span:
trace_id = format(span.get_span_context().trace_id, '032x')
request_id = request.headers.get('x-request-id', trace_id[:16])
response = await call_next(request)
# Emit minimal OpenLineage run linking prompt/model/retrieval
ol.emit(
eventType="COMPLETE",
eventTime=datetime.utcnow().isoformat() + "Z",
run={"runId": trace_id},
job={"namespace": "inference", "name": "fee-assistant"},
inputs=[{"namespace": "vector", "name": "fees_corpus@commit:abc123"}],
outputs=[{"namespace": "responses", "name": f"fee_assistant_out/{request_id}"}],
producer="service://fee-assistant/1.4.2",
facets={
"custom": {
"model_version": "gpt-4o-2024-06-18",
"prompt_template_version": "fees_v7",
"temperature": 0.2,
"request_id": request_id,
}
}
)
return response
And Prometheus metrics to watch what matters:
from prometheus_client import Counter, Histogram
TOKENS = Counter('llm_tokens_total', 'LLM tokens', ['role','model'])
LATENCY = Histogram('llm_latency_seconds', 'End-to-end latency', buckets=[.1,.2,.5,1,2,5])
HITS = Counter('rag_retrieval_hits_total', 'Docs retrieved', ['index_version'])
# In code around your LLM call
start = time.time()
# ... make call ...
LATENCY.observe(time.time()-start)
TOKENS.labels('prompt','gpt-4o-2024-06-18').inc(prompt_tokens)
TOKENS.labels('completion','gpt-4o-2024-06-18').inc(completion_tokens)
HITS.labels('fees_corpus@commit:abc123').inc(top_k)
Implement it in a week (no heroics required)
- Stand up the lineage backbone:
- Deploy
Marquez
viahelm
or useDataHub
if your org standardizes there. - Enable
openlineage-airflow
on your schedulers; turn on lineage fordbt
/Spark
jobs. - Tag datasets with
lakeFS
commits orDVC
tags so you’re not chasing “latest”.
- Deploy
- Thread a trace through the stack:
- Add
OpenTelemetry
SDK to the inference service; propagatetraceparent
headers. - Emit
OpenLineage
events using thetrace_id
asrunId
in both training and inference paths. - Capture request metadata:
tenant_id
,request_id
,authz_subject
(watch privacy!).
- Add
- Stamp everything with versions:
MLflow
: addparams
/tags
fordata_version
,code_sha
,training_env
.Prompt templates
: store in Git; expose atemplate_version
in responses and logs.Vector index
: version your corpus; e.g.,fees_corpus@commit:abc123
.
- Wire SRE-grade observability:
- Export
Prometheus
metrics for latency histograms, error rate, token usage, retrieval hit-rate. - Create
Grafana
dashboards that pivot from metrics → trace → lineage (via links toJaeger
/Tempo
andMarquez
). - Define SLOs: p95 latency, max hallucination rate, drift thresholds.
- Export
- Add guardrails that fail closed:
- Use
pydantic
/jsonschema
orGuardrails
to force function-call/JSON shapes. - Content/PII filters before and after LLM calls; quarantine violations.
- Canary with
Argo Rollouts
; auto-rollback on SLO burn or regression metrics.
- Use
- Close the loop with evaluation:
- Batch evals nightly with
promptfoo
/Giskard
/Deepchecks
using golden sets. - Log eval scores as lineage facets on the model and prompt template.
- Require a green eval and passing guardrails before promoting to
Prod
(GitOps check inArgoCD
).
- Batch evals nightly with
Example canary policy tied to SLOs:
# argo-rollouts canary snippet
strategy:
canary:
steps:
- setWeight: 10
- pause: {duration: 300}
- analysis:
templates:
- templateName: llm-slo-check
args:
- name: p95
value: "<250ms"
- name: hallucination_rate
value: "<2%"
Guardrails for real failure modes: hallucination, drift, latency spikes
- Hallucination: You’ll see confident nonsense when retrieval recall drops or templates change.
- Mitigation: enforce structured outputs (
pydantic
), run post-response validators (e.g., math/URL resolvers), and attach provenance in responses. - Measurement: model-graded evals on golden questions; target hallucination rate < 2%. Fail canary if exceeded.
- Lineage hook: log
retrieval_docs
IDs andtemplate_version
so you can reproduce the context that hallucinated.
- Mitigation: enforce structured outputs (
- Drift: Data shifts break both retrieval and model behavior.
- Mitigation: compute PSI/KL on feature distributions; monitor embedding centroid shifts for RAG corpora.
- Measurement: alert when drift z-score > threshold; trigger shadow reindexing.
- Lineage hook: tie drift alerts to specific
data_version
/index_commit
and pause promotions.
- Latency spikes: Providers throttle, vector DBs GC, or you added a silent N+1 in tools.
- Mitigation:
Istio
circuit breakers,retry+backoff
, batch embedding fetches, captop_k
. - Measurement:
Prometheus
histograms and per-hop spans; watch p95 and tail (p99.9). - Lineage hook: every response carries
trace_id
→ jump to slow span → find culprit (provider, vector store, tool) and associated versions.
- Mitigation:
Quick OPA policy that blocks promotion if evals regress:
package llm.promotion
default allow = false
allow {
input.target == "prod"
input.metrics.hallucination_rate < 0.02
input.metrics.p95_latency_ms < 250
input.metrics.schema_violations == 0
}
Make RCAs boring: dashboards and runbooks
Build dashboards with hyperlinks between systems. When p95 fires:
Click from
Grafana
panel toJaeger
trace usingtrace_id
label.From trace, open
Marquez
run with the samerunId
to seeinputs
/outputs
and facets.Jump to
MLflow
run viamlflow_run_id
facet for exactmodel_version
anddata_version
.
Your runbook should say:
Confirm SLO burn (latency/hallucination/drift).
Identify the last successful canary step and rollback via
argorollouts rollback <name>
.Use trace → lineage to capture
template_version
,index_commit
,model_version
.File a PR to pin or revert the offending artifact (template, index, model).
Add a regression test/eval so it never ships again.
This is how you cut MTTR from hours to minutes and stop guessing.
A real-world save: what changed after we instrumented
At that fintech, we rolled out the stack above:
OpenLineage
events fromAirflow
,dbt
, and the FastAPI inference service.lakeFS
commits for the RAG corpus;MLflow
for model versions.Prometheus
tokens/latency metrics and an eval job withpromptfoo
.
Results in 30 days:
p95 latency stabilized at < 220ms; p99 < 600ms after fixing a vector DB GC pause found via trace spans.
Hallucination rate on the fee Q&A fell from ~9% to 1.4%; canary blocked two bad template merges automatically.
MTTR on incidents dropped from 4h median to 18m. Engineering regained their weekends.
We didn’t invent anything. We instrumented ruthlessly and wired the lineage so the system could defend itself.
Do this next (small, repeatable, defensible)
Pick one high-traffic endpoint and one training pipeline. Instrument both end-to-end.
Emit
OpenLineage
with thetrace_id
asrunId
; storetemplate_version
andindex_commit
in facets.Add two SLOs (latency + hallucination) and gate releases with
Argo Rollouts
.Put a golden set in
promptfoo
and fail the pipeline if accuracy regresses.Socialize the dashboard and runbook; drill once.
If you want a hand, this is our bread and butter at GitPlumbers. We’ve cleaned up enough AI messes to know where bodies are buried and which knobs actually matter.
Key takeaways
- Lineage is not a spreadsheet—emit events at every hop (ingest→transform→train→register→deploy→serve→retrieve→respond) and link them with a shared trace ID.
- Use a backbone like `OpenLineage` + `Marquez` or `DataHub` to capture dataset, model, and prompt/template versions—especially for inference context and RAG artifacts.
- Make lineage queryable across traces with `OpenTelemetry` so you can jump from an incident in `Prometheus` to the exact training run and dataset snapshot in `MLflow`/`lakeFS`.
- Turn lineage into safety: define SLOs for hallucination, drift, latency; gate rollouts with `Argo Rollouts` and enforce policies with `OPA`/guardrails.
- Start small: instrument one training pipeline, one inference endpoint, and a canary path. Prove MTTR drops before you go wall-to-wall.
Implementation checklist
- Adopt `OpenLineage` events and stand up `Marquez` (or use `DataHub` if you already run it).
- Propagate `trace_id` via `OpenTelemetry` from request to model call to vector store to output.
- Stamp every artifact (dataset, feature set, prompt template, model) with `git_sha`, `data_version`, `template_version`, and `model_version`.
- Ship `Prometheus` metrics for tokens, latency histograms, retrieval hit-rate, and moderation/guardrail outcomes.
- Set SLOs and wire `Argo Rollouts` for canaries; trigger rollbacks on SLO burns or drift alerts.
- Add safety checks: schema validators, PII detectors, and model-graded evals with human review for high-risk paths.
- Automate dashboards and runbooks that pivot from alert → trace → lineage → remediation PR.
Questions we hear from teams
- Do we need OpenLineage if we already run DataHub?
- No. If you’re already invested in DataHub or OpenMetadata, use them. The key is to emit machine-readable lineage events and ensure you can join them with traces and your model registry. We often integrate OpenTelemetry traces with DataHub’s lineage graph just fine.
- How do we track lineage for hosted LLMs (OpenAI, Anthropic) we don’t control?
- Wrap provider calls. Emit spans and lineage facets with `model_name`, `model_version`, parameters, token counts, and the exact prompt template version. Log retrieval inputs, not raw PII. You won’t get provider internals, but your side of the chain is enough for RCA and rollback.
- Isn’t all this expensive?
- Cheaper than an outage. Start small: one service, one pipeline, one canary. The infra footprint is modest: Marquez (or DataHub), Prometheus, a trace backend, and MLflow—likely stuff you already run. The payoff is MTTR reduction and avoided incidents.
- What about privacy and compliance?
- Mask or hash sensitive fields in lineage facets, and store full context only in restricted stores. Use `OPA` to block promotions without a privacy review, and use dataset-level tags (PII, PCI) to prevent cross-domain joins in training.
- How do we measure hallucination reliably?
- Use golden sets for critical tasks and supplement with model-graded evaluations. Track schema violations, tool execution failures, and factuality checks where possible. The goal is a leading indicator you can gate on during canaries—not perfect truth.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.