A/B Testing LLMs in Production Without Burning Customers
The hard-won blueprint for shipping LLM experiments with real observability, safety guardrails, and metrics that actually move the business.
Speed with a seatbelt: ship LLM experiments fast without betting the company on a hunch.Back to all posts
The night the “smart” chatbot went dumb
We shipped a new LLM variant to 10% of customer support traffic at a fintech. Five minutes later, p95 latency doubled, the bot hallucinated ACH cutoff times, and escalation tickets spiked. We had traces, guardrails, and a kill switch. We rolled back in 90 seconds and learned more in 24 hours than two weeks of offline evals. That’s what a real A/B framework buys you: speed with a seatbelt.
If you’ve been burned by consultants waving an eval spreadsheet, this one’s for you. Here’s the production-grade playbook GitPlumbers uses when we wire A/B testing for AI flows that cannot fail in front of customers.
What to measure (and what actually matters)
Your model doesn’t live in a notebook; it lives in a business flow. Measure both.
- Business metrics: containment rate (no human handoff), deflection from support, conversion/activation, NPS/CSAT proxy, refund rate, average handle time.
- Safety metrics: guardrail violation rate, policy block rate, PII leakage rate, jailbreak detection rate.
- Performance metrics: p50/p95 latency, provider error rate, token usage, cost per resolution, cache hit rate.
- Quality metrics: factuality score (LLM-as-judge with spot human audit), citation coverage for RAG, structured output validity.
Don’t ship a variant that wins an offline “judge score” but loses on containment or doubles your cost per resolved ticket.
For LLMs, I like these baseline SLOs:
- p95 latency < 2s for autocomplete, < 5s for RAG answers
- guardrail violations < 0.1% of responses
- 7-day error budget for provider timeouts < 0.5%
- schema-invalid response rate < 1%
Put those into your alerts before you run your first A/B.
An A/B architecture that doesn’t melt in prod
I’ve seen folks bolt Optimizely onto a prompt template and call it a day. That’s how you end up with users swapping variants mid-conversation. Use a simple, boring architecture:
- Traffic shaping at the edge: route to an
llm-routerservice via API Gateway/Envoy. Attach experiment and user/session IDs at the edge. - Deterministic assignment: use
GrowthBook,Statsig,Eppo, or a simple consistent hash of(experiment_key, user_id)to pick A/B. Make it sticky for the entire session/conversation. - Guardrail middleware: input redaction, policy checks, token budget enforcement before you hit the provider.
- Model provider abstraction: one interface for
openai,anthropic,vertexai,bedrock, so you can swap models without a deploy. - Observability baked in: OpenTelemetry spans across the whole flow; Prometheus metrics; logs with trace IDs.
- Kill switch + rollouts: feature flag for hard off; Argo Rollouts or LaunchDarkly for controlled ramps and auto-rollback.
Here’s a minimal TypeScript slice of the router with instrumentation and guardrails:
import express from 'express';
import crypto from 'crypto';
import { trace, context, SpanStatusCode } from '@opentelemetry/api';
import { z } from 'zod';
import OpenAI from 'openai';
const tracer = trace.getTracer('llm-router');
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY! });
const OutputSchema = z.object({
answer: z.string(),
citations: z.array(z.string()).optional(),
});
function assignVariant(experimentKey: string, userId: string) {
const h = crypto.createHash('sha256').update(`${experimentKey}:${userId}`).digest('hex');
const n = parseInt(h.slice(0, 8), 16) / 0xffffffff; // 0..1
return n < 0.5 ? 'A' : 'B';
}
const app = express();
app.use(express.json());
app.post('/v1/answer', async (req, res) => {
const { userId, question, contextDocs } = req.body;
const experimentKey = 'rag_answer_v3';
const variant = assignVariant(experimentKey, userId);
await tracer.startActiveSpan('llm.flow', async (span) => {
try {
span.setAttributes({
'exp.key': experimentKey,
'exp.variant': variant,
'user.id': userId,
'llm.provider': 'openai',
'llm.model': variant === 'A' ? 'gpt-4o-mini' : 'gpt-4.1-mini',
});
// Input guardrails
if (!question || question.length > 2000) throw new Error('invalid_input');
const prompt = `Use the supplied docs to answer. Cite sources if used.\n\nQ: ${question}\nDocs: ${JSON.stringify(contextDocs).slice(0, 8000)}`;
const start = Date.now();
const resp = await openai.chat.completions.create({
model: span.attributes['llm.model'] as string,
messages: [
{ role: 'system', content: 'You are a factual assistant. If unsure, say you do not know.' },
{ role: 'user', content: prompt },
],
temperature: 0.2,
response_format: { type: 'json_object' },
});
const raw = resp.choices[0]?.message?.content ?? '{}';
const parsed = OutputSchema.safeParse(JSON.parse(raw));
if (!parsed.success) throw new Error('schema_invalid');
const durationMs = Date.now() - start;
span.setAttributes({ 'llm.latency_ms': durationMs, 'llm.tokens.total': resp.usage?.total_tokens ?? 0 });
// Cheap circuit-breaker: punt to human if too slow or unsure
if (durationMs > 5000 || /do not know/i.test(parsed.data.answer)) {
span.setAttribute('fallback', true);
res.status(202).json({ handoff: true });
span.setStatus({ code: SpanStatusCode.OK });
span.end();
return;
}
res.json({ variant, ...parsed.data });
span.setStatus({ code: SpanStatusCode.OK });
} catch (err: any) {
span.recordException(err);
span.setStatus({ code: SpanStatusCode.ERROR, message: err.message });
res.status(503).json({ error: 'fallback' });
} finally {
span.end();
}
});
});
app.listen(8080, () => console.log('llm-router listening on 8080'));Yes, it’s simplified. The point: variant assignment is sticky, guardrails run pre/post, and you emit the attributes you’ll need when the pager goes off.
Instrumentation that pays for itself
If it’s not in traces, it didn’t happen. Use OpenTelemetry everywhere and export to whatever you actually look at (Grafana Tempo, Honeycomb, Datadog, New Relic). Minimal useful attributes per span:
exp.key,exp.variant,user.id(hashed),conversation.idllm.provider,llm.model,llm.version,llm.endpointllm.tokens.prompt,llm.tokens.completion,llm.tokens.total,llm.cost_usdllm.cache.hit(if using Redis/LLM cache),llm.retries,llm.timeoutguard.violations(count),schema.valid(bool)
An otel-collector that fans out traces/metrics is table stakes:
receivers:
otlp:
protocols:
http:
grpc:
processors:
batch: {}
memory_limiter: {}
resource:
attributes:
- key: service.name
value: llm-router
action: upsert
exporters:
otlphttp/tempo:
endpoint: http://tempo:4318
prometheus:
endpoint: 0.0.0.0:9464
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch, resource]
exporters: [otlphttp/tempo]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]Prometheus alert for p95 latency spikes per model:
- alert: LLM_P95_Latency_Spike
expr: |
histogram_quantile(
0.95,
sum(rate(llm_inference_duration_seconds_bucket[5m])) by (le, model)
) > 5
for: 10m
labels:
severity: page
annotations:
summary: p95 latency above 5s for {{ $labels.model }}
runbook: https://internal.wiki.ai/runbooks/llm-latencyIf your team already knows Honeycomb, ship traces there. Tools don’t matter; consistent signals do.
Guardrails before you measure
A/B tests assume you’re comparing valid outputs. With LLMs, you first need to make “invalid” rare.
- Input guardrails: PII redaction (
regex+piidetectors), profanity filters, domain policy checks. Reject/transform pre-inference. - Output guardrails: JSON schema validation (
zod,pydantic), constrained decoding, tool-call whitelists, “I don’t know” policy. - Safety services:
Azure AI Content Safety,Google Vertex Safety,OpenAI moderation,Llama Guard. Log violations with trace IDs. - Circuit breakers:
resilience4j/opossumfor provider timeouts and retry budgets. Hard handoff to human agent after N seconds or on policy hit. - RAG grounding: require citations; reject answers that cite nothing when docs exist. Cache by
(prompt_hash, doc_version).
Your A/B is not meaningful until schema-invalid and policy-violating responses are under control. Do that first.
Running experiments without lying to yourself
This is where most teams get cute and end up with bad decisions.
- Sticky assignment: Once a user/session is in B, keep them in B across the conversation and across services. Use a shared
exp.variantheader. - Sequential testing: Don’t wait 4 weeks for fixed-horizon p-values. Use sequential tests (Statsig’s CUPED/Bayesian, Eppo’s sequential) or pre-commit to sample size/stop date.
- Primary metric first: Pick one north star (e.g., containment). Everything else is secondary.
- Powering: If your baseline containment is 40% and you want +3pp, you’ll need thousands of conversations. Simulate power before launching.
- Guardrail-aware win criteria: “B must not increase guardrail violation rate beyond 0.1% and must keep cost/resolution within +10%.”
- Holdouts: Keep a small non-LLM or “classic” path as a sanity check when all models regress together (provider outage, prompt bug).
And yes, do offline evals (Promptfoo, HumanLoop, W&B Prompts) to filter bad variants. But only online tells you if it moves the business.
Drift, regressions, and rollback in under two minutes
Two things will hurt you: model/provider drift and prompt/config regressions.
- Data drift: track embedding distribution shift on incoming queries; alert when KL divergence against last week exceeds a threshold.
- Model drift: the same
gpt-4isn’t the same next month. Pinmodel+versionwhen you can; add a smoke test suite that runs hourly. - Prompt drift: version prompts and templates in Git. Tie
prompt_versioninto traces. - Rollouts: use
Argo Rolloutsto ramp traffic with baked-in checks. Auto-rollback when error budgets burn.
Argo Rollouts Experiment example to split 50/50 by service selector:
apiVersion: argoproj.io/v1alpha1
kind: Experiment
metadata:
name: rag-answer-v3
spec:
duration: 2h
templates:
- name: variant-a
service:
name: llm-router-a
- name: variant-b
service:
name: llm-router-b
analyses:
- name: latency-and-violations
templateName: prometheus
args:
- name: query
value: >-
histogram_quantile(0.95, sum(rate(llm_inference_duration_seconds_bucket[5m])) by (le)) < 5
- name: violations
value: sum(rate(guardrail_violations_total[5m])) < 0.001Keep a one-click rollback in your runbook. Practice it. We do game days; they pay for themselves the first time a provider rolls a silent update.
A concrete pattern: safer RAG summarizer
Here’s the minimal pattern that has saved us at two different clients (SaaS and healthcare):
- Index: nightly refresh of source docs; embed with
text-embedding-3-large; stampdoc_version. - Retrieve: top-k BM25 + vector hybrid search; log
retrieval_recall@k. - Guard inputs: redact PII and enforce token budget (
max_prompt_tokens). - Route:
A=GPT-4o-miniwithtemperature=0.2,B=Claude 3 Haikuwithtemperature=0.2. - Constrain outputs: JSON schema with required
citationspointing todoc_ids. - Observe: OTel spans with
doc_version,retrieval.k,citation_count,grounded=true/false. - Decide: Primary metric = containment; guardrail = citation coverage >= 1 when docs exist; p95 latency < 5s; cost/resolution within +10%.
Once this is in place, we’ve consistently turned 2–4 week “model bake-offs” into 48-hour, low-risk experiments with clear decisions. That means faster iteration without waking up the on-call every night.
Related Resources
Key takeaways
- Instrument LLM flows end-to-end with OpenTelemetry; propagate trace IDs through every hop.
- Define hard guardrails (schema validation, moderation, circuit breakers) before measuring anything.
- Use deterministic assignment and sticky bucketing; never flip users mid-journey.
- Track business metrics (containment, CSAT proxy), not just model metrics (BLEU, “judge” scores).
- Detect drift and latency spikes with targeted SLOs and alerts; rehearse rollbacks.
- Mix offline evals with online A/B; gate launches behind feature flags and Argo Rollouts.
Implementation checklist
- Add OTel spans and attributes to every LLM call: provider, model, version, prompt hash, token counts, cost, cache hit.
- Enforce input/output guardrails: PII redaction, content moderation, JSON schema validation, max token budget.
- Implement deterministic assignment via experiment keys and user/session IDs; make it sticky.
- Track key outcomes: containment rate, deflection, escalation rate, guardrail violation rate, p95 latency, cost per resolution.
- Set SLOs and alerts for hallucination rate, drift, and latency spikes; wire to PagerDuty.
- Use Argo Rollouts (or LaunchDarkly/Flags) to ramp traffic and auto-rollback on error budgets.
Questions we hear from teams
- How do I prevent users from switching A/B variants mid-conversation?
- Use deterministic assignment based on an experiment key and a stable user/session ID. Propagate the assigned variant in headers or conversation state across services. Do not reassign on each request.
- What metrics should be primary for a customer support bot?
- Containment rate (no human handoff) should be primary. Track guardrail violation rate and cost per resolved conversation as constraints. Latency is a separate SLO with paging alerts.
- How do I test new models without risking production?
- Run offline evals to screen out bad candidates, then start a 1–5% canary using feature flags or Argo Rollouts. Add strict guardrails and a kill switch. Ramp only if primary metrics improve and constraints hold.
- How do I detect model drift from providers like OpenAI/Anthropic?
- Pin model versions when available. Maintain a smoke test suite of canonical prompts and expected patterns; run hourly and alert on deltas in latency, token usage, or output schema/length. Track response distribution changes (e.g., average tokens) over time.
- What’s the fastest way to add observability to my LLM stack?
- Wrap your LLM client with OpenTelemetry spans. Emit provider, model, version, token counts, and cost. Export to Prometheus (metrics) and Tempo/Honeycomb (traces). Add a p95 latency alert and a guardrail violation alert on day one.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
