The Circuit Breakers Your LLM Stack Should’ve Had Before Last Friday’s Pager Storm
Stop letting model hiccups domino into customer outages. Build circuit breakers, fallbacks, and guardrails into every AI call—instrumented, observable, and testable.
AI won’t page you with a 500—it’ll smile and give you a confident, wrong answer. Break it safely before it breaks you.Back to all posts
The Friday Night We Learned LLMs Need Circuit Breakers
A payments client called us after a Friday deploy: their “smart dispute helper” started inventing merchant policies. CSAT tanked, refunds spiked, and the on-call engineer watched p95 latency jump from 400ms to 6s while the LLM vendor throttled them into oblivion. No 5xxs—just slow, wrong, and expensive.
I’ve seen this movie. We treat AI like a black box that “usually works,” wire it directly into user flows, and forget it’s just another flaky upstream. The fix wasn’t magic. We put circuit breakers around AI calls, defined fallback ladders, and instrumented the hell out of it. The bleeding stopped in a day.
Why This Matters (And How It Fails in the Real World)
AI systems fail differently than your typical REST service:
- Hallucination: syntactically valid, semantically wrong. Think model inventing compliance steps.
- Drift: model updates or data shifts quietly degrade accuracy over weeks.
- Latency spikes: vendor throttling, token bloat, or hot shards push you past your SLOs.
- Cost blowups: prompt creep turns a lookup into a $0.40 call.
- Malformed outputs: JSON that almost parses, invalid enum values, or tool calls that never return.
Treat these as first-class failure modes. Your SLA to users is about outcomes, not tokens generated. That means timeouts, breakers, fallbacks, and guardrails before responses touch business logic.
Design the Breaker: Budgets, Thresholds, Bulkheads
Start with constraints, not vibes:
- Set SLOs per call path: success rate (valid + safe), p95 latency, and cost per 1K requests.
- Timeouts: hard client timeouts at 2–3x median, never unlimited.
- Retries: exponential backoff with jitter; cap attempts; retry only safe codes (
429,5xx,ECONNRESET). - Circuit breaker: open on rolling failure/timeout rate; half-open probes; bulkhead by use case to isolate blast radius.
- Budgets: max tokens and max cost per request; treat overruns as failures.
Here’s a typed Node/TypeScript example using cockatiel for retries and a breaker around an OpenAI call:
import fetch from 'node-fetch';
import { Policy, ConsecutiveBreaker, ExponentialBackoff, handleAll } from 'cockatiel';
const breaker = Policy.handleAll().circuitBreaker(new ConsecutiveBreaker(5)); // open after 5 consecutive failures
const retry = Policy
.handleWhen((err) => ['429', '500', '502', '503', '504'].includes(err?.status?.toString()))
.retry()
.attempts(3)
.exponentialBackoff(new ExponentialBackoff({ initialDelay: 200, maxDelay: 2000, jitter: true }));
const withPolicies = Policy.wrap(breaker, retry);
async function callLLM(prompt: string, signal: AbortSignal) {
const controller = new AbortController();
const timeout = setTimeout(() => controller.abort(), 2500); // hard timeout
try {
return await withPolicies.execute(async () => {
const res = await fetch('https://api.openai.com/v1/chat/completions', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'Authorization': `Bearer ${process.env.OPENAI_API_KEY}`
},
body: JSON.stringify({
model: 'gpt-4o-mini',
temperature: 0.0,
response_format: { type: 'json_object' },
messages: [
{ role: 'system', content: 'You are a JSON-only service. No prose.' },
{ role: 'user', content: prompt }
]
}),
signal: controller.signal
});
if (!res.ok) {
const err = new Error(`Upstream ${res.status}`) as any; (err as any).status = res.status; throw err;
}
const data = await res.json();
return data;
});
} finally {
clearTimeout(timeout);
}
}Set thresholds from real traffic. If your p95 is 600ms, a 2.5s timeout is generous. Open the circuit for 30–60s when failure rate > 30% over a 20–50 call window. Tune with production data, not gut feel.
Build a Fallback Ladder You Can Explain to Legal
When the breaker opens (or validation fails), you need tiered fallbacks that degrade gracefully. For a typical RAG-backed Q&A:
- RAG primary: model + retrieval. If retrieval returns low confidence, skip to tier 2.
- Model-only: smaller/faster model with strict output constraints.
- Template/FAQ: deterministic response from curated content or regex-based extractors.
- Human-in-the-loop: queue for human review; tell the user we’ll email them.
Here’s a Python sketch with validation and fallbacks using pydantic and pybreaker:
# pip install openai pydantic pybreaker tenacity
import json
from pydantic import BaseModel, ValidationError, Field
from pybreaker import CircuitBreaker
from tenacity import retry, stop_after_attempt, wait_exponential_jitter
from openai import OpenAI
client = OpenAI()
class Answer(BaseModel):
answer: str
confidence: float = Field(ge=0.0, le=1.0)
source_ids: list[str]
breaker = CircuitBreaker(fail_max=5, reset_timeout=45)
@retry(stop=stop_after_attempt(2), wait=wait_exponential_jitter(0.2, 2.0))
@breaker
def llm_call(prompt: str) -> Answer:
res = client.chat.completions.create(
model="gpt-4o-mini",
temperature=0.0,
response_format={"type": "json_object"},
messages=[
{"role": "system", "content": "Return JSON matching the schema."},
{"role": "user", "content": prompt},
],
timeout=2.5,
)
obj = json.loads(res.choices[0].message.content)
return Answer.model_validate(obj)
def faq_fallback(question: str) -> Answer:
# Deterministic lookup
return Answer(answer="We’ll follow up within 24h.", confidence=0.3, source_ids=[])
def answer_question(question: str, retrieved_docs: list[dict]) -> Answer:
# Tier 1: RAG if retrieval confidence is high enough
if retrieved_docs and retrieved_docs[0].get("score", 0) > 0.7:
try:
prompt = f"Using these sources: {retrieved_docs[:3]}\nQuestion: {question}\nReturn JSON."
return llm_call(prompt)
except (ValidationError, Exception) as _:
pass
# Tier 2: Model-only with short context
try:
prompt = f"Question: {question}. Answer briefly as JSON with fields answer, confidence, source_ids."
ans = llm_call(prompt)
if ans.confidence < 0.5:
raise ValueError("low confidence")
return ans
except Exception:
# Tier 3: FAQ
return faq_fallback(question)The important part is not the library choice—it’s that the fallback policy is explicit, testable, and auditable. Legal and Risk will ask you to prove why a user saw a particular message. Log the tier chosen, confidence, and validation outcomes.
Guardrails and Observability: Instrument Everything
You can’t manage what you don’t measure. At a minimum:
- Tracing: Use
OpenTelemetryaround every AI call; include attributes for model, tokens in/out, retry count, breaker state, fallback tier. - Metrics: Export
valid_json_rate,hallucination_rate(validator fails),drift_score,p95_latency,cost_per_req, andbreaker_opengauges toPrometheus. - Logging: Store prompts/responses with PII-redaction and hashing of inputs; retain enough for audits.
- Validation: Enforce schema checks before data touches business logic; block or degrade on fail.
- Safety filters: Run content classification (toxicity, PII) pre- and post-model; route to human when risky.
Example OTel pseudo-instrumentation in Node:
import { context, trace, SpanStatusCode } from '@opentelemetry/api';
async function tracedLLMCall(opName: string, fn: () => Promise<any>) {
const tracer = trace.getTracer('ai-gateway');
return await tracer.startActiveSpan(opName, async (span) => {
try {
const start = Date.now();
const res = await fn();
span.setAttributes({
'ai.model': res.model,
'ai.tokens.in': res.usage?.prompt_tokens ?? 0,
'ai.tokens.out': res.usage?.completion_tokens ?? 0,
'ai.fallback.tier': res.meta?.tier ?? 'primary',
'ai.breaker.open': res.meta?.breakerOpen ?? false,
});
span.setAttribute('latency.ms', Date.now() - start);
span.setStatus({ code: SpanStatusCode.OK });
return res;
} catch (e: any) {
span.recordException(e);
span.setStatus({ code: SpanStatusCode.ERROR, message: e?.message });
throw e;
} finally {
span.end();
}
});
}For drift monitoring, track embedding distributions over time. A simple guard: compute Population Stability Index (PSI) between last week’s and this week’s embedding buckets. Alert when PSI > 0.25. It’s not perfect science, but it’ll catch silent shifts.
Platform Guardrails: Put Policies in the Mesh
Don’t rely solely on application code. Push guardrails into the network where possible. With Istio/Envoy you get retries, timeouts, connection pools, and outlier detection:
# istio DestinationRule + VirtualService for an LLM vendor
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: llm-vendor
spec:
host: api.openai.com
trafficPolicy:
connectionPool:
http:
http1MaxPendingRequests: 100
maxRequestsPerConnection: 100
outlierDetection:
consecutive5xxErrors: 5
interval: 5s
baseEjectionTime: 30s
maxEjectionPercent: 50
loadBalancer:
simple: ROUND_ROBIN
---
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: llm-vendor
spec:
hosts: ["api.openai.com"]
http:
- timeout: 3s
retries:
attempts: 2
perTryTimeout: 1s
retryOn: connect-failure,refused-stream,5xx,reset,gateways-error,retriable-status-codes
retriableStatusCodes: [429]
route:
- destination:
host: api.openai.comAdd bulkheads at the queue level: separate workers for “nice-to-have” AI features versus critical checkout flows. If the fancy summarizer gets rate-limited, orders still ship.
For releases: use Argo Rollouts to canary new model versions or prompt templates. Wire a metric gate so rollout proceeds only if valid_json_rate > 99% and p95_latency < 1s over N requests. Control exposure with feature flags (LaunchDarkly, Flagsmith) to instantly freeze bad variants.
Test Like It Will Fail (Because It Will)
If you haven’t chaos-tested your AI paths, you’re one incident away from learning on prod:
- Latency injection: add 2s to upstream calls and watch your timeouts/fallbacks engage.
- Malformed outputs: feed invalid JSON and ensure validators block and degrade.
- Throttling: simulate
429storms and verify breakers open and back off. - Shadow traffic: mirror requests to a new model/prompt and compare offline metrics before promoting.
- Data drift drills: swap a subset of your index with new content and watch drift detectors.
Simple k6 test to spike latency:
k6 run --vus 50 --duration 2m -e LLM_DELAY_MS=2000 load.jsAnd in CI, run contract tests with captured prompts to guarantee schema stability. If your prompt edit breaks Answer validation, block the merge. Yes, treat prompts like code—because they are.
What “Good” Looks Like After the Fix
At the payments client, once we shipped breakers, fallbacks, and observability:
- Incident rate on the AI path dropped 78% month-over-month.
- MTTR fell from 62m to 14m because we could see which tier failed.
- p95 latency stabilized at 850ms under vendor throttling due to fast fallback.
- Refund leakage decreased 41% because validators stopped hallucinated policies at the gate.
- Cost per 1K requests dropped 23% after we enforced token budgets and downgraded on slow paths.
The system didn’t get smarter overnight. It got safer. And that’s what your customers feel.
Blocklists won’t save you. Guardrails, breakers, and observability will.
What I’d Do Tomorrow If I Were You
- Put a breaker on every AI call. If you can’t answer “what opens it?” and “for how long?”, you don’t have one.
- Add a two-tier fallback minimum: smaller model -> deterministic template.
- Validate outputs with a real schema, not regexes. Count validator fails as incidents.
- Trace prompts and fallbacks with OpenTelemetry. Alert on
breaker_openandvalid_json_rate. - Canary model/prompt changes behind flags with metric gates. Roll back automatically.
- Schedule a chaos day for AI: timeouts, 429s, invalid JSON. Invite Risk and Support. Document the runbook.
If you need a second pair of hands, GitPlumbers has done this at fintech, healthtech, and old-school SaaS with more cron than anyone wants to admit. We’ll wire the breakers, build the fallbacks, and leave you with dashboards and runbooks your team owns.
Key takeaways
- AI failure modes aren’t just 5xx—they’re hallucinations, drift, and timeouts. Treat the model like a flaky upstream service.
- Implement circuit breakers at the client and mesh layers; set strict timeouts and backoff with jitter.
- Design a clear fallback ladder: RAG > smaller model > template/FAQ > human-in-the-loop.
- Instrument prompts, responses, and guardrail outcomes with OpenTelemetry; emit Prometheus metrics tied to SLOs.
- Validate and constrain outputs with JSON schemas, Pydantic models, and content classifiers before they touch business logic.
- Test with chaos: inject latency, force invalid JSON, and canary model updates behind flags.
Implementation checklist
- Define SLOs for AI endpoints (success rate, p95 latency, valid JSON rate).
- Add client-side circuit breakers with timeouts and exponential backoff + jitter.
- Configure mesh-level outlier detection and connection pools (Istio/Envoy).
- Implement a fallback ladder including non-ML paths and human escalation.
- Validate outputs against JSON schema/Pydantic; block or degrade on validation failure.
- Capture OTel traces for prompts, tokens, retries, and fallback path taken.
- Alert on drift (PSI/KL on embeddings) and hallucination rate (validator fails).
- Canary model/version changes with Argo Rollouts and feature flags.
- Run chaos drills for timeouts, 429/5xx, and malformed responses.
- Document runbooks: when to trip breakers, how to override, and rollback steps.
Questions we hear from teams
- Will circuit breakers hide real issues with the model or vendor?
- No—if you instrument correctly. Open breakers should raise alerts and attach context in traces (failure rate, codes, latency). Breakers buy you time and reduce blast radius; observability ensures you still see the root cause and act.
- How do I detect hallucinations automatically?
- Validate outputs against strict schemas, cross-check with retrieval citations, and run lightweight classifiers on claims (e.g., regexes for policy IDs, whitelist of allowed enums). Track validator-fail rate as a first-class metric and trigger fallback or human review on fail.
- What’s a pragmatic way to monitor drift?
- Bucket embeddings and compute PSI/KL divergence week over week; correlate with offline eval scores and production validator-fail rate. Alert when PSI > 0.25 or your eval SLOs drop. Canary new model/prompt variants behind flags before full rollout.
- When should I escalate to a human?
- On repeated validator failures, low confidence, or risky content categories (compliance, medical, financial advice). Make the threshold explicit (e.g., 2 consecutive fails or confidence < 0.5) and log the decision with request IDs for auditability.
- Aren’t retries enough?
- Retries help transient errors, but they make latency spikes worse and can amplify vendor throttling. Use retries sparingly, with backoff and jitter, and always combined with timeouts and a breaker to avoid retry storms.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
