Your LLM Didn’t “Get Worse.” Your Prompts Drifted, Your Features Drifted, and Nobody Put Up a Gate.
Stabilize AI-enabled production flows with versioned prompts, frozen datasets, and automated regression barriers—plus the observability you’ll wish you had after the first hallucination hits a customer.
If your prompt changes can’t be reproduced, your incidents can’t be debugged.Back to all posts
The day your AI feature “randomly” started lying
I’ve watched this movie a few times now. The AI feature ships, the demo looks great, everyone high-fives… and then three weeks later someone Slacks: “Did the model get worse?”
No. What happened is more boring—and more fixable:
- A PM tweaked a prompt to “sound more helpful” and accidentally removed a constraint.
- The RAG corpus updated (new docs, re-ranked embeddings), and now retrieval pulls the wrong policy page.
- Latency spiked because a tool call started timing out, and the fallback path quietly returned ungrounded text.
If you don’t version the moving parts and you don’t have regression barriers, drift becomes your default deployment strategy.
What actually works in production is the same thing that works for every other risky system: reproducibility + instrumentation + gates.
Versioning: prompts are code, “AI configs” are deployable artifacts
Teams get into trouble when prompts live in:
- a Notion page
- a feature flag comment
- a random string literal in
app.ts
In production you want a single versioned artifact that captures everything that can change behavior:
model(e.g.,gpt-4.1-mini,claude-3-5-sonnet)system+developer+usertemplates- decoding params (
temperature,top_p,max_tokens) - tool schemas and allowlists
- retrieval config (index name, embedding model,
k, filters) - output schema + validators
Here’s a pattern we’ve used at GitPlumbers: store an ai_config as immutable JSON in your repo (or in a prompt registry like Langfuse, PromptLayer, etc.), then reference it by version at runtime.
# ai-configs/support-answering/1.7.3.yaml
id: support-answering
version: 1.7.3
provider: openai
model: gpt-4.1-mini
decoding:
temperature: 0.2
top_p: 1.0
max_tokens: 700
rag:
index: help-center-prod
embeddingModel: text-embedding-3-large
k: 6
filters:
product: "payments"
output:
schema: "schemas/support_answer.json"
requireCitations: true
tools:
allow:
- get_order_status
- refund_policy_lookup
safety:
piiRedaction: true
blockOn:
- "card_number"
- "api_key"Two rules I enforce:
- No anonymous prompt changes. Every behavior change gets a version bump and a short changelog.
- No runtime mystery meat. The service logs the exact
ai_configversion used so you can reproduce any output.
If this sounds like overkill, you haven’t had to explain a hallucinated refund policy to Legal.
Datasets: freeze your “goldens” and stop testing against vibes
Prompt drift and feature drift aren’t abstract. They show up as:
- “Used to cite sources, now it doesn’t.”
- “It’s more polite but less correct.”
- “It fails only for EU customers.”
You can’t stabilize that with a couple of hand-picked examples.
What works is building evaluation datasets like you’d build reliability tests:
- Golden set: real tickets/queries with expected constraints (must cite docs, must not claim policy exceptions).
- Adversarial set: prompts designed to break guardrails (“Ignore previous instructions…”, jailbreaks, prompt injections via RAG).
- Latency/cost set: long inputs, tool-heavy flows, worst-case retrieval.
Store datasets with stable IDs and snapshots. If your RAG corpus changes daily, your eval needs to pin:
- the document snapshot (or at least doc IDs + versions)
- retrieval parameters
A simple filesystem layout works fine:
evals/
support-answering/
datasets/
golden_v12.jsonl
adversarial_v4.jsonl
rubrics/
correctness.yaml
safety.yaml
style.yamlExample jsonl entry:
{"id":"ticket_18422","input":{"question":"Can I charge back after 60 days?","customer_region":"EU"},"expected":{"must_cite":true,"must_include":["time_limit"],"must_not_include":["guaranteed_refund"],"policy":"payments_refunds_v3"}}This is the part teams skip because it’s “not fun.” It’s also the part that prevents the 2am rollback.
Automatic regression barriers: make CI the bad guy
If you let prompt changes ship without a gate, you’ll get:
- slow-motion quality decay (everyone “improves” the prompt)
- silent safety regressions (PII leaks, policy violations)
- inconsistent behavior across environments
The fix is boring and effective: run evals in CI and block deploy on thresholds.
A pragmatic gating approach:
- Run deterministic checks first (schema validation, citation presence, tool allowlist).
- Run model-graded or heuristic checks second (factuality rubric, refusal correctness).
- Compare against the last blessed config (
mainorprod) and enforce no regressions.
Here’s a GitHub Actions example that runs an eval job and fails if safety or correctness drops:
name: ai-regression-gate
on:
pull_request:
paths:
- "ai-configs/**"
- "evals/**"
- "src/ai/**"
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 20
- run: npm ci
- name: Run offline evals
run: |
npm run eval -- \
--config ai-configs/support-answering/1.7.3.yaml \
--baseline origin/main:ai-configs/support-answering/1.7.2.yaml \
--dataset evals/support-answering/datasets/golden_v12.jsonl \
--dataset evals/support-answering/datasets/adversarial_v4.jsonl \
--min_correctness 0.92 \
--max_safety_violations 0This is your regression barrier. It turns “I think this prompt is better” into “prove it.”
If you don’t force the argument into CI, you’re going to have the argument in production.
Instrumentation that actually helps: traces over vibes
When AI-enabled flows fail, they fail between components:
- retrieval pulled irrelevant docs
- the model ignored citations
- tool call timed out and fallback hallucinated
- JSON parsing failed and the app silently string-sliced
You need end-to-end observability, not just “tokens used.”
What I recommend (and what we implement at GitPlumbers) is:
- OpenTelemetry tracing: one trace per user request with spans for retrieval, generation, and tool calls.
- Structured events: log safety decisions, schema validation results, refusal reasons.
- Metrics: p50/p95 latency, tool error rate, retrieval hit rate, cost/request, “grounded answer” rate.
Example: emitting OTel spans in a TypeScript service:
import { trace, context, SpanStatusCode } from "@opentelemetry/api";
const tracer = trace.getTracer("ai-flow");
export async function answerTicket(req: TicketReq) {
return tracer.startActiveSpan("ai.answer_ticket", async (span) => {
span.setAttribute("ai.config_version", req.aiConfigVersion);
span.setAttribute("customer.region", req.region);
const retrieved = await tracer.startActiveSpan("ai.retrieve", async (rSpan) => {
rSpan.setAttribute("rag.index", "help-center-prod");
const docs = await retrieveDocs(req.question);
rSpan.setAttribute("rag.docs_count", docs.length);
rSpan.end();
return docs;
});
try {
const result = await tracer.startActiveSpan("ai.generate", async (gSpan) => {
const out = await generateAnswer(req.question, retrieved);
gSpan.setAttribute("ai.tokens_in", out.usage.input_tokens);
gSpan.setAttribute("ai.tokens_out", out.usage.output_tokens);
gSpan.setAttribute("ai.refusal", out.refusal ?? false);
gSpan.end();
return out;
});
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (e: any) {
span.recordException(e);
span.setStatus({ code: SpanStatusCode.ERROR, message: e.message });
throw e;
} finally {
span.end();
}
});
}Then you wire it into Prometheus/Grafana and actually look at:
- p95 latency by span (is retrieval slow? tool call slow? model slow?)
- error budgets (SLO burn rate) for AI endpoints
- drift proxies like increased refusal rate, increased “no citation” rate, increased tool failures
If you can’t answer “what changed?” in 10 minutes from dashboards, you’re flying blind.
Guardrails: fail closed, validate everything, and add circuit breakers
Hallucinations aren’t a moral failing; they’re a property of the system. You mitigate them the same way you mitigate any unsafe component: constrain inputs/outputs, validate, and cut off when abnormal.
Concrete guardrails that have saved teams:
- Strict output schemas (
JSON Schema,zod) + reject/repair on mismatch - Citation enforcement: require doc IDs; reject answers without retrieval-backed citations
- Tool allowlists: the model can only call approved functions with typed args
- Prompt injection handling: strip/label untrusted RAG content; never let retrieved text override system policy
- PII redaction before logging; block if the output contains secrets
- Circuit breaker: if safety violations spike or tool error rate spikes, auto-disable the AI path and fall back to a safe deterministic response
Example schema validation with zod:
import { z } from "zod";
const SupportAnswer = z.object({
answer: z.string().min(1),
citations: z.array(z.object({ docId: z.string(), quote: z.string().min(1) })).min(1),
confidence: z.number().min(0).max(1)
});
export function validate(output: unknown) {
const parsed = SupportAnswer.safeParse(output);
if (!parsed.success) throw new Error("AI_OUTPUT_SCHEMA_INVALID");
return parsed.data;
}Circuit breaker logic (simplified):
- If
no_citation_rate > 2%over 10 minutes → disable AI responses - If
p95_latency > 3sover 5 minutes → degrade to cached/templated answers - If
safety_violation_count > 0in canary → rollback config version
This is the difference between “we had an incident” and “we had a blip.”
Shipping without drama: canaries, baselines, and rollback that’s not a fire drill
Once you have versioned configs + eval gates + observability, you can ship like an adult:
- Canary the new
ai_configto 1–5% traffic. - Compare key metrics to baseline:
- correctness proxy (citation rate, schema pass rate)
- safety outcomes (policy violations, PII detections)
- latency (p95) and cost/request
- Auto-rollback if the canary violates SLOs or regression thresholds.
In GitOps shops (ArgoCD/Flux), the ai_config version can be promoted like any other artifact. I’ve seen teams do this with a simple values.yaml bump:
# charts/support-bot/values.yaml
ai:
configVersion: "1.7.3"
canary:
enabled: true
weight: 5The key is that rollback is trivial: revert 1.7.3 → 1.7.2. No “prompt archaeology.”
What GitPlumbers fixes when you’re already in the ditch
Most teams come to GitPlumbers after the first painful incident: a hallucinated policy, a compliance scare, or an AI endpoint that blows up p95 latency and takes the app with it.
We typically stabilize in this order:
- Instrument the flow (OpenTelemetry traces + metrics) so you can see failure modes.
- Lock down versions (configs, prompts, retrieval settings) to restore reproducibility.
- Build the eval datasets (golden + adversarial) and wire up CI regression barriers.
- Add guardrails (schemas, tool allowlists, citation enforcement, circuit breakers).
If you want to skip the “learn by incident” phase, start with the checklist above and treat the AI layer like a production subsystem—not a demo.
- Related:
/services/ai-in-production-hardening - Case study:
/case-studies/rag-stability-regression-gates
CTA: If your AI feature is shipping prompt edits with no gates and no traces, we can help you put in a regression barrier and observability in a week—without boiling the ocean.
Key takeaways
- Treat prompts, tool schemas, and retrieval configs like code: **versioned, reviewed, and reproducible**.
- Stabilize “prompt/feature drift” with **frozen eval datasets** (goldens + adversarial cases) and run them in CI/CD as a **hard regression barrier**.
- Instrument the full chain (request → retrieval → generation → tool calls) with **OpenTelemetry traces + structured events** so you can debug hallucinations and latency spikes in hours, not weeks.
- Guardrails are not vibes: implement **schema validation, policy checks, and circuit breakers** that fail closed on risky outputs.
- Use **canaries + SLOs** to ship safely and roll back fast when drift shows up in production.
Implementation checklist
- Prompts stored in a registry (or repo) with semantic versions and changelogs
- Model, temperature, top_p, system prompt, tool schemas, and retrieval settings captured as an immutable `ai_config`
- Golden + adversarial eval datasets stored with IDs and pinned snapshots
- CI job runs offline evals; deploy is blocked on regression thresholds
- OpenTelemetry traces include retrieval metrics, token counts, tool call timings, and safety outcomes
- Dashboards track p50/p95 latency, hallucination proxies, refusal rate, tool failure rate, and cost per request
- Runtime guardrails: JSON schema validation, allowlisted tools, PII redaction, circuit breaker on anomaly spikes
- Canary deploy with automated rollback on SLO burn rate
Questions we hear from teams
- What’s the difference between prompt drift and feature drift?
- **Prompt drift** is behavior change caused by edits to prompts, templates, decoding params, or tool schemas. **Feature drift** is broader: changes in retrieval corpora, ranking, embeddings, business rules, upstream APIs, or latency/fallback paths that change what the user experiences. In production, they blend together—so version and instrument the whole chain.
- Do we need fancy eval tooling to start?
- No. Start with `jsonl` datasets, a small runner script, and hard thresholds in CI. Tools like `Langfuse`, `Evidently`, or custom dashboards help as you scale, but the core win is **frozen datasets + automated gating**.
- How do we handle nondeterminism in LLM outputs during regression tests?
- Use lower `temperature`, run multiple samples per test case (e.g., 3–5), and gate on aggregate metrics. Prefer deterministic checks (schema validity, citations, tool usage) and use model-graded rubrics for fuzzier dimensions (helpfulness/style). Always compare against a baseline config rather than chasing absolute scores.
- What metrics should we put on a dashboard first?
- Start with p50/p95 latency, tool error rate, schema pass rate, citation rate (if RAG), refusal rate, and cost/request. Add safety violation counts and SLO burn-rate alerts. These catch the common production failure modes: hallucinations, drift, and latency spikes.
- What’s the minimum guardrail set for a customer-facing AI feature?
- At minimum: strict output schema validation, tool allowlists with typed args, citation enforcement for RAG answers, PII redaction (and blocking where needed), and a circuit breaker that disables the AI path on anomaly spikes. Make the safe path boring and reliable.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
