The One Time a “Harmless” SQL Change Made Our LLM Lie in Production
Data lineage for AI isn’t compliance theater. It’s the only reliable way to answer: “What data trained this model?” and “Why did this user get that output?”—without a week-long war room.
AI incidents are rarely novel. They’re the same old data + ops failures—just with a chatbot narrating the blast radius.Back to all posts
The incident pattern I keep seeing: the model gets blamed, the data did it
A few years back, I watched a team roll out an “easy” change: a LEFT JOIN became an INNER JOIN in a dbt model feeding training data. Looked fine in CI. No tests broke. The dashboard even got faster.
Two weeks later, the LLM-powered support assistant started confidently hallucinating refund policies. Leadership was convinced “the model degraded.” The vendor got blamed. Engineers started tweaking prompts like it was a Ouija board.
The root cause was boring: the join change dropped a slice of examples that contained edge-case policy language. The model didn’t “get worse.” We trained it on different reality, and nobody could prove it quickly because there was no lineage.
If you’re shipping AI-enabled flows, you need to answer—fast:
- What exact data trained this model? (tables, partitions, files, transforms)
- What exact version served this output? (weights, code SHA, prompt, retrieval corpus)
- What changed between “good” and “bad”? (upstream schema, feature logic, index refresh, vendor model revision)
Without lineage + observability, your MTTR is basically “whenever the smartest person gets time to spelunk.”
What “data lineage for AI” actually means (training + inference)
Most orgs stop at warehouse lineage (“table A feeds table B”). AI production needs a wider graph that includes:
Training lineage
- Raw sources:
s3://…, Kafka topics, vendor exports - Transforms:
dbt,Spark,Flink,Beam - Feature definitions:
FeastFeatureViews, SQL features, windowing logic - Training run: code SHA, Docker image, hyperparameters, random seeds
- Model artifact: registry version + approval state
- Raw sources:
Inference lineage (this is where teams fall down)
- Request metadata: tenant, cohort, app version
- Model version actually loaded (not “latest” in your head)
- Prompt template version + system prompt
- RAG inputs: retrieval corpus version, embedding model version, top-k docs + doc IDs
- Post-processing: safety filters, business rules, caching layers
If you can’t tie an output to model + prompt + retrieval corpus + upstream data versions, hallucinations and drift become un-debuggable folklore.
Minimum viable lineage: the 5 fields that save you in a war room
You can build a cathedral later. Start with metadata you can consistently propagate.
At GitPlumbers, we usually standardize these fields first:
model_version(fromMLflow/registry)training_dataset_fingerprint(hash of partitions/files + row counts)prompt_version(Git SHA or semantic version)retrieval_corpus_version(index build ID + source snapshot)feature_view_version(for structured features /Feast)
Then we do two things with them:
- Attach to every run (training, indexing, batch scoring)
- Attach to every request (online inference)
Here’s a concrete pattern for inference: propagate lineage fields as OpenTelemetry baggage so every downstream span sees it.
import { context, propagation, trace } from "@opentelemetry/api";
export function withLineage<T>(lineage: Record<string, string>, fn: () => Promise<T>) {
const baggage = propagation.createBaggage(
Object.entries(lineage).map(([k, v]) => [k, { value: v }])
);
return context.with(propagation.setBaggage(context.active(), baggage), fn);
}
// Usage in an API handler
await withLineage(
{
model_version: process.env.MODEL_VERSION!,
prompt_version: process.env.PROMPT_VERSION!,
retrieval_corpus_version: process.env.CORPUS_VERSION!,
},
async () => {
const span = trace.getTracer("ai").startSpan("llm.generate");
try {
// call LLM / vector DB / etc
} finally {
span.end();
}
}
);That one move turns “why did this happen?” into a queryable trace instead of archaeology.
Lineage instrumentation that works in the real world (OpenLineage + OTel)
I’ve seen lineage projects die because someone tried to boil the ocean: pick a fancy graph tool, mandate perfect metadata, and spend six months wiring it up—only to learn nobody trusts it.
The pragmatic route is:
- Use OpenLineage for pipeline/job-level lineage (Airflow, Dagster, dbt)
- Use OpenTelemetry for request-level lineage (online inference)
- Store artifacts + metadata in a model registry (
MLflow, SageMaker Model Registry, Vertex AI Model Registry)
OpenLineage with Airflow (example)
If you’re on Airflow, the quickest win is emitting OpenLineage events to Marquez.
pip install openlineage-airflow# airflow.cfg (or env vars)
OPENLINEAGE_URL: http://marquez:5000
OPENLINEAGE_NAMESPACE: prod
OPENLINEAGE_EXTRACTORS: "dbt,postgres,bigquery,snowflake"With this, your DAG runs produce lineage events: inputs, outputs, job facets. The key is to enrich events with dataset versions (partition IDs, snapshots) and code versions (Git SHA).
dbt exposures for AI datasets
A lot of “training sets” are just dbt models someone exported. Put a name on them.
version: 2
models:
- name: training_customer_support
description: "Curated dataset for LLM fine-tuning and evals"
config:
tags: ["ai_training", "pii_reviewed"]
exposures:
- name: support_assistant_model
type: application
owner:
name: ai-platform
depends_on:
- ref('training_customer_support')Now lineage tools (and humans) can see: “this model depends on that dataset.” Boring, but it prevents the classic “someone refactored a model that nobody knew fed training.”
Guardrails that stop hallucinations, drift, and latency spikes from turning into incidents
Lineage tells you what happened. Guardrails stop it from happening (or at least limit blast radius).
1) Data quality gates before training/indexing
If your training set or retrieval corpus shifts silently, the model will “drift” in ways that look like hallucination.
Put hard checks in the pipeline using Great Expectations (or Soda, Deequ). Example: null-rate bounds + distribution checks.
import great_expectations as gx
context = gx.get_context()
validator = context.sources.pandas_default.read_csv("/data/training.csv").get_validator()
validator.expect_column_values_to_not_be_null("policy_text")
validator.expect_column_values_to_be_between("refund_days", 0, 60)
validator.expect_column_kl_divergence_to_be_less_than(
column_name="issue_type",
partition_object=None,
threshold=0.1,
bootstrap_samples=1000,
)
results = validator.validate()
if not results["success"]:
raise SystemExit("Data quality gate failed")Tie the validation run ID into lineage metadata so you can prove “this model trained on a dataset that passed checks.”
2) Drift detection you can actually operate
Everyone loves drift dashboards until they page nobody.
What works:
- Define 2–3 drift signals that correlate with user pain
- Feature distribution PSI/KL for key features
- Retrieval hit-rate / “no relevant docs found” rate
- Eval regression on a fixed golden set
- Page on rate of change + sustained breaches, not noise
- Record drift metrics with
Prometheusand alert inGrafana
3) Latency SLOs + circuit breakers around LLM calls
Latency spikes usually come from:
- Vector DB compaction/replication (hello,
pgvectorvacuum storms) - Token explosion from prompt growth
- Provider-side throttling (
429s) + retries
Guardrails:
- Enforce
timeout_msper dependency - Cap tokens (
max_output_tokens) and prompt size - Use a circuit breaker + fallback response path
# Example policy you can enforce at the gateway or service layer
llm:
timeout_ms: 2500
max_prompt_tokens: 6000
max_output_tokens: 512
retries:
max_attempts: 2
backoff_ms: 200
circuit_breaker:
open_after_failures: 10
reset_after_ms: 30000
fallback:
mode: "retrieve_only" # return citations/snippets without generationWhen the breaker opens, your traces should still include model_version and retrieval_corpus_version so you can see exactly which deploy caused the spike.
A concrete end-to-end pattern: from training run to a single bad user answer
Here’s the flow that has saved teams real weekends:
Training pipeline emits lineage
- OpenLineage event includes: source datasets + partitions, dbt model version, Git SHA
- Training run registers model in
MLflowwith tags:training_dataset_fingerprintcode_shafeature_view_version
Indexing pipeline (RAG) emits lineage
- Corpus snapshot ID + embedding model version
- Vector index build ID (your
retrieval_corpus_version)
Inference service emits OTel traces
- Root span:
ai.request - Child spans:
vector.search,llm.generate,policy.filter - Baggage/tags:
model_version,prompt_version,retrieval_corpus_version
- Root span:
When an incident happens (hallucination report)
- Grab the
request_idfrom logs or UI - Pull the trace in Grafana Tempo / Jaeger
- Read the versions right off the span attributes
- Jump to lineage graph to see which upstream datasets changed
- Grab the
This turns “LLM is lying” into an actionable diff:
- Prompt changed? Roll back
prompt_version. - Corpus snapshot changed? Rebuild index or roll back
retrieval_corpus_version. - Training data changed? Trace to dbt job + commit SHA; revert or patch.
I’ve watched teams go from 3–5 day incident cycles to same-day fixes once this plumbing is in place.
What we do at GitPlumbers when a team asks for lineage (without a 6-month platform rebuild)
The fastest path is usually a staged rollout:
Instrument inference first (because that’s where the pages come from)
- Add OTel tracing + consistent lineage tags
- Add latency/error SLOs for AI endpoints
Add lineage to the top 3 pipelines feeding training/corpus
- dbt + Airflow/Dagster emitting OpenLineage
- Dataset fingerprints + validation gates
Lock down change control
- Model/prompt/corpus versions are deploy artifacts, not tribal knowledge
- Canary releases + automatic rollback on SLO burn
Run a lineage fire drill
- Pick a bad output and trace it end-to-end in under 30 minutes
If this sounds like “basic engineering hygiene,” that’s because it is. AI just punishes you faster when you don’t do it.
AI incidents are rarely novel. They’re the same old data + ops failures—just with a chatbot narrating the blast radius.
If you want a second set of eyes, GitPlumbers does this kind of AI in Production rescue work: instrument the system you already have, make it observable, and put guardrails in the places that actually fail.
Key takeaways
- If you can’t trace an output back to exact datasets/features/prompts/model weights, you don’t have an AI system—you have a slot machine with on-call rotation.
- Lineage has to cover both training and inference: datasets, transforms, feature definitions, model versions, prompts, retrieval corpora, and runtime dependencies.
- Instrument once, reuse everywhere: OpenTelemetry traces + consistent IDs (dataset/model/prompt/feature) make incidents solvable in hours, not days.
- Guardrails belong in the pipeline: data quality gates, drift monitors, canary releases, and runtime circuit breakers for hallucinations and latency.
- Start with the minimum viable lineage: 3–5 metadata fields attached to every run and request. Expand after you’ve survived your first incident.
Implementation checklist
- Define a stable set of IDs: `dataset_version`, `feature_view_version`, `model_version`, `prompt_version`, `retrieval_corpus_version`.
- Emit OpenLineage events for every ETL/ELT job (Airflow/Dagster/dbt).
- Register every trained model with MLflow (or equivalent) including training data fingerprints and code SHA.
- Propagate lineage metadata into inference requests and OpenTelemetry traces.
- Add pre-deploy gates: schema checks, null-rate bounds, distribution checks, and embedding/retrieval index freshness.
- Add runtime guardrails: timeouts, fallbacks, rate limits, prompt/response validators, and circuit breakers around LLM calls.
- Set SLOs for AI endpoints (latency, error rate, “unsafe output” rate) and page on violations.
- Run at least one “lineage fire drill”: pick a bad output and prove you can trace it end-to-end in under 30 minutes.
Questions we hear from teams
- Do we really need OpenLineage, or can we rely on warehouse lineage tools?
- Warehouse lineage (BigQuery, Snowflake, Monte Carlo, etc.) is a good start, but it typically won’t capture model registry versions, prompt versions, RAG index builds, or request-level context. For AI production you need both: job-level lineage (OpenLineage or equivalent) and request-level tracing (OpenTelemetry).
- What’s the first thing to implement if we’re already in production and getting paged?
- Instrument inference with OpenTelemetry and propagate `model_version`, `prompt_version`, and `retrieval_corpus_version` on every request. It’s the fastest way to cut MTTR because you can immediately correlate bad outputs with deploy artifacts and dependency latency.
- How do we fingerprint a training dataset without copying it?
- Use metadata: table/view name + partition IDs + row counts + min/max event timestamps + a hash of file manifests (for S3/GCS). Store that fingerprint as a tag in your model registry entry so it’s queryable later.
- How do we keep prompt changes from becoming untraceable “prompt drift”?
- Treat prompts as deploy artifacts: version them in Git, inject `prompt_version` at runtime, and log the rendered prompt template ID (not necessarily the full prompt if it contains sensitive data). Canary prompt changes and roll back on SLO or eval regression.
- What’s a practical fallback when the LLM path is unhealthy?
- For RAG systems, a common fallback is `retrieve_only`: return citations/snippets and skip generation when latency/error budgets are blown. It’s not pretty, but it keeps the user experience honest and prevents confident hallucinations under duress.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
