The RCA That Ate Our Weekend: Data Lineage for AI Training and Inference That Actually Works

If you can’t answer “which data, which model, which prompt, which version, which trace” in 60 seconds, you don’t have AI in production—you have risk. Here’s the lineage stack we deploy so leaders sleep at night.

Lineage isn’t a diagram—it’s the receipt for every decision your AI makes, linked to a trace you can click in the middle of an incident.
Back to all posts

The incident you can’t RCA without lineage

We had a fintech client roll out a slick RAG assistant to explain fees. Friday afternoon, p95 latency spikes to 3s and the bot starts inventing fee tiers that don’t exist. We couldn’t answer basic questions:

  • Which model_version served the bad responses?

  • Which vector_index and corpus_version fueled retrieval?

  • Which prompt_template revision went live?

No lineage. No trace IDs across components. We spent the weekend diffing S3 folders and Slack archeology. The root cause was a silent index rebuild from the staging corpus and a prompt tweak merged without a template version bump. Classic.

If you can’t get from a single incident to the exact data, prompt, and model that produced it in under a minute, you’re flying blind.

This is the playbook we now deploy at GitPlumbers. It’s boring on purpose: strong instrumentation, queryable lineage, and guardrails that trip before customers do.

What “lineage” actually means for AI (training and inference)

Lineage isn’t a PDF diagram. It’s machine-emitted events and immutable IDs that let you reconstruct any run. For AI, think two planes:

  • Training plane: raw_datafeature_engineeringtrain_runevaluationmodel_registryartifact_store

  • Inference plane: requestretrieval/featuresprompt_template + model_versionguardrailsresponse

What you must capture:

  • Datasets and features: dataset_uri, data_version (e.g., lakeFS commit, DVC tag), schema_hash, owner.

  • Transform runs: job_id, code_sha, dbt/Spark lineage, input/output datasets.

  • Training runs: mlflow_run_id, code_sha, hyperparams, model_uri, evaluation metrics.

  • Registry events: model_name, model_version, stage (Staging/Prod), signer/attestation (use sigstore if you’re fancy).

  • Inference context: trace_id, request_id, prompt_template_id + template_version, retrieval_corpus_version, top-K doc IDs, temperature, provider, token counts, guardrail outcomes.

  • Operational metadata: region, instance type, Istio revision, docker_image_sha, ArgoCD app revision.

The trick is to stitch all of this with a shared trace—OpenTelemetry spans that reference OpenLineage run IDs and your model registry. That’s the difference between a slide and an RCA.

The stack that actually works (and why)

I’ve seen teams glue together spreadsheets and Confluence pages. Don’t. Use tools that emit lineage for you:

  • Backbone: OpenLineage + Marquez (or DataHub/OpenMetadata if you’re already in that ecosystem). They understand datasets, jobs, and runs out of the box.

  • Pipelines: Airflow + openlineage-airflow, dbt + openlineage-dbt, Spark + openlineage-spark. They push lineage automatically.

  • Model registry: MLflow (self-hosted or Databricks). Tag runs with git_sha, data_version, and training_code_sha.

  • Data versioning: lakeFS or DVC so your “dataset” is an immutable commit instead of “s3://bucket/latest”.

  • Tracing and metrics: OpenTelemetry for distributed traces; Prometheus + Grafana for SRE-grade metrics.

  • Deployment controls: ArgoCD + Argo Rollouts for GitOps and canaries.

  • Service mesh: Istio for mTLS and out-of-the-box telemetry.

  • Safety rails: Guardrails/jsonschema/pydantic for schema adherence, OPA for policy, plus content safety (e.g., Azure AI Content Safety) where needed.

Here’s the minimum viable wiring in an inference service to connect traces and lineage:

# fastapi_middleware.py
from fastapi import Request
from opentelemetry import trace
from openlineage.client import OpenLineageClient
from datetime import datetime

ol = OpenLineageClient(url="http://marquez:5000")
tracer = trace.get_tracer(__name__)

async def with_lineage(request: Request, call_next):
    with tracer.start_as_current_span("inference_request") as span:
        trace_id = format(span.get_span_context().trace_id, '032x')
        request_id = request.headers.get('x-request-id', trace_id[:16])
        response = await call_next(request)

        # Emit minimal OpenLineage run linking prompt/model/retrieval
        ol.emit(
          eventType="COMPLETE",
          eventTime=datetime.utcnow().isoformat() + "Z",
          run={"runId": trace_id},
          job={"namespace": "inference", "name": "fee-assistant"},
          inputs=[{"namespace": "vector", "name": "fees_corpus@commit:abc123"}],
          outputs=[{"namespace": "responses", "name": f"fee_assistant_out/{request_id}"}],
          producer="service://fee-assistant/1.4.2",
          facets={
            "custom": {
              "model_version": "gpt-4o-2024-06-18",
              "prompt_template_version": "fees_v7",
              "temperature": 0.2,
              "request_id": request_id,
            }
          }
        )
        return response

And Prometheus metrics to watch what matters:

from prometheus_client import Counter, Histogram

TOKENS = Counter('llm_tokens_total', 'LLM tokens', ['role','model'])
LATENCY = Histogram('llm_latency_seconds', 'End-to-end latency', buckets=[.1,.2,.5,1,2,5])
HITS = Counter('rag_retrieval_hits_total', 'Docs retrieved', ['index_version'])

# In code around your LLM call
start = time.time()
# ... make call ...
LATENCY.observe(time.time()-start)
TOKENS.labels('prompt','gpt-4o-2024-06-18').inc(prompt_tokens)
TOKENS.labels('completion','gpt-4o-2024-06-18').inc(completion_tokens)
HITS.labels('fees_corpus@commit:abc123').inc(top_k)

Implement it in a week (no heroics required)

  1. Stand up the lineage backbone:
    • Deploy Marquez via helm or use DataHub if your org standardizes there.
    • Enable openlineage-airflow on your schedulers; turn on lineage for dbt/Spark jobs.
    • Tag datasets with lakeFS commits or DVC tags so you’re not chasing “latest”.
  2. Thread a trace through the stack:
    • Add OpenTelemetry SDK to the inference service; propagate traceparent headers.
    • Emit OpenLineage events using the trace_id as runId in both training and inference paths.
    • Capture request metadata: tenant_id, request_id, authz_subject (watch privacy!).
  3. Stamp everything with versions:
    • MLflow: add params/tags for data_version, code_sha, training_env.
    • Prompt templates: store in Git; expose a template_version in responses and logs.
    • Vector index: version your corpus; e.g., fees_corpus@commit:abc123.
  4. Wire SRE-grade observability:
    • Export Prometheus metrics for latency histograms, error rate, token usage, retrieval hit-rate.
    • Create Grafana dashboards that pivot from metrics → trace → lineage (via links to Jaeger/Tempo and Marquez).
    • Define SLOs: p95 latency, max hallucination rate, drift thresholds.
  5. Add guardrails that fail closed:
    • Use pydantic/jsonschema or Guardrails to force function-call/JSON shapes.
    • Content/PII filters before and after LLM calls; quarantine violations.
    • Canary with Argo Rollouts; auto-rollback on SLO burn or regression metrics.
  6. Close the loop with evaluation:
    • Batch evals nightly with promptfoo/Giskard/Deepchecks using golden sets.
    • Log eval scores as lineage facets on the model and prompt template.
    • Require a green eval and passing guardrails before promoting to Prod (GitOps check in ArgoCD).

Example canary policy tied to SLOs:

# argo-rollouts canary snippet
strategy:
  canary:
    steps:
    - setWeight: 10
    - pause: {duration: 300}
    - analysis:
        templates:
        - templateName: llm-slo-check
          args:
          - name: p95
            value: "<250ms"
          - name: hallucination_rate
            value: "<2%"

Guardrails for real failure modes: hallucination, drift, latency spikes

  • Hallucination: You’ll see confident nonsense when retrieval recall drops or templates change.
    • Mitigation: enforce structured outputs (pydantic), run post-response validators (e.g., math/URL resolvers), and attach provenance in responses.
    • Measurement: model-graded evals on golden questions; target hallucination rate < 2%. Fail canary if exceeded.
    • Lineage hook: log retrieval_docs IDs and template_version so you can reproduce the context that hallucinated.
  • Drift: Data shifts break both retrieval and model behavior.
    • Mitigation: compute PSI/KL on feature distributions; monitor embedding centroid shifts for RAG corpora.
    • Measurement: alert when drift z-score > threshold; trigger shadow reindexing.
    • Lineage hook: tie drift alerts to specific data_version/index_commit and pause promotions.
  • Latency spikes: Providers throttle, vector DBs GC, or you added a silent N+1 in tools.
    • Mitigation: Istio circuit breakers, retry+backoff, batch embedding fetches, cap top_k.
    • Measurement: Prometheus histograms and per-hop spans; watch p95 and tail (p99.9).
    • Lineage hook: every response carries trace_id → jump to slow span → find culprit (provider, vector store, tool) and associated versions.

Quick OPA policy that blocks promotion if evals regress:

package llm.promotion

default allow = false

allow {
  input.target == "prod"
  input.metrics.hallucination_rate < 0.02
  input.metrics.p95_latency_ms < 250
  input.metrics.schema_violations == 0
}

Make RCAs boring: dashboards and runbooks

Build dashboards with hyperlinks between systems. When p95 fires:

  • Click from Grafana panel to Jaeger trace using trace_id label.

  • From trace, open Marquez run with the same runId to see inputs/outputs and facets.

  • Jump to MLflow run via mlflow_run_id facet for exact model_version and data_version.

Your runbook should say:

  1. Confirm SLO burn (latency/hallucination/drift).

  2. Identify the last successful canary step and rollback via argorollouts rollback <name>.

  3. Use trace → lineage to capture template_version, index_commit, model_version.

  4. File a PR to pin or revert the offending artifact (template, index, model).

  5. Add a regression test/eval so it never ships again.

This is how you cut MTTR from hours to minutes and stop guessing.

A real-world save: what changed after we instrumented

At that fintech, we rolled out the stack above:

  • OpenLineage events from Airflow, dbt, and the FastAPI inference service.

  • lakeFS commits for the RAG corpus; MLflow for model versions.

  • Prometheus tokens/latency metrics and an eval job with promptfoo.

Results in 30 days:

  • p95 latency stabilized at < 220ms; p99 < 600ms after fixing a vector DB GC pause found via trace spans.

  • Hallucination rate on the fee Q&A fell from ~9% to 1.4%; canary blocked two bad template merges automatically.

  • MTTR on incidents dropped from 4h median to 18m. Engineering regained their weekends.

We didn’t invent anything. We instrumented ruthlessly and wired the lineage so the system could defend itself.

Do this next (small, repeatable, defensible)

  • Pick one high-traffic endpoint and one training pipeline. Instrument both end-to-end.

  • Emit OpenLineage with the trace_id as runId; store template_version and index_commit in facets.

  • Add two SLOs (latency + hallucination) and gate releases with Argo Rollouts.

  • Put a golden set in promptfoo and fail the pipeline if accuracy regresses.

  • Socialize the dashboard and runbook; drill once.

If you want a hand, this is our bread and butter at GitPlumbers. We’ve cleaned up enough AI messes to know where bodies are buried and which knobs actually matter.

Related Resources

Key takeaways

  • Lineage is not a spreadsheet—emit events at every hop (ingest→transform→train→register→deploy→serve→retrieve→respond) and link them with a shared trace ID.
  • Use a backbone like `OpenLineage` + `Marquez` or `DataHub` to capture dataset, model, and prompt/template versions—especially for inference context and RAG artifacts.
  • Make lineage queryable across traces with `OpenTelemetry` so you can jump from an incident in `Prometheus` to the exact training run and dataset snapshot in `MLflow`/`lakeFS`.
  • Turn lineage into safety: define SLOs for hallucination, drift, latency; gate rollouts with `Argo Rollouts` and enforce policies with `OPA`/guardrails.
  • Start small: instrument one training pipeline, one inference endpoint, and a canary path. Prove MTTR drops before you go wall-to-wall.

Implementation checklist

  • Adopt `OpenLineage` events and stand up `Marquez` (or use `DataHub` if you already run it).
  • Propagate `trace_id` via `OpenTelemetry` from request to model call to vector store to output.
  • Stamp every artifact (dataset, feature set, prompt template, model) with `git_sha`, `data_version`, `template_version`, and `model_version`.
  • Ship `Prometheus` metrics for tokens, latency histograms, retrieval hit-rate, and moderation/guardrail outcomes.
  • Set SLOs and wire `Argo Rollouts` for canaries; trigger rollbacks on SLO burns or drift alerts.
  • Add safety checks: schema validators, PII detectors, and model-graded evals with human review for high-risk paths.
  • Automate dashboards and runbooks that pivot from alert → trace → lineage → remediation PR.

Questions we hear from teams

Do we need OpenLineage if we already run DataHub?
No. If you’re already invested in DataHub or OpenMetadata, use them. The key is to emit machine-readable lineage events and ensure you can join them with traces and your model registry. We often integrate OpenTelemetry traces with DataHub’s lineage graph just fine.
How do we track lineage for hosted LLMs (OpenAI, Anthropic) we don’t control?
Wrap provider calls. Emit spans and lineage facets with `model_name`, `model_version`, parameters, token counts, and the exact prompt template version. Log retrieval inputs, not raw PII. You won’t get provider internals, but your side of the chain is enough for RCA and rollback.
Isn’t all this expensive?
Cheaper than an outage. Start small: one service, one pipeline, one canary. The infra footprint is modest: Marquez (or DataHub), Prometheus, a trace backend, and MLflow—likely stuff you already run. The payoff is MTTR reduction and avoided incidents.
What about privacy and compliance?
Mask or hash sensitive fields in lineage facets, and store full context only in restricted stores. Use `OPA` to block promotions without a privacy review, and use dataset-level tags (PII, PCI) to prevent cross-domain joins in training.
How do we measure hallucination reliably?
Use golden sets for critical tasks and supplement with model-graded evaluations. Track schema violations, tool execution failures, and factuality checks where possible. The goal is a leading indicator you can gate on during canaries—not perfect truth.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Schedule a lineage + observability architecture review Download the AI Lineage Implementation Checklist

Related resources