Circuit Breakers for LLMs: The Day the Model Latched Up and What Saved Us
Your AI won’t fail like an API. It’ll hallucinate, drift, and stall at the worst possible moment. Here’s how to wire circuit breakers, fallbacks, and guardrails so you degrade gracefully instead of detonating.
AI fails sideways. Wire it like a volatile upstream and give yourself an escape hatch you can flip at 3 a.m.Back to all posts
The outage that taught us to stop trusting happy-path LLMs
We had an LLM summarization service in prod, shipping daily digests to 2M users. One morning latency spiked from 400ms p95 to 7s p95, then hallucinations started: product names invented out of thin air. Root cause: upstream model provider silently rolled a safety filter, throttling certain prompts and changing outputs. Without proper circuit breakers and fallbacks, the digest pipeline kept waiting, then ingesting garbage. We got lucky—postmortem lucky. The second time, we had the breakers, budgets, and guardrails. Users saw a basic deterministic summary instead of word salad, MTTR dropped from hours to minutes, and we didn’t burn another incident review.
This is the playbook we install at GitPlumbers when we wire AI into customer-facing flows.
What fails in AI systems (and how it shows up)
Let’s be blunt: AI doesn’t just return 500s.
- Latency spikes: model queues, token bloat, provider brownouts. Watch p95/p99 and queue depth.
- Hallucination: syntactically perfect nonsense. Watch schema validation failures, moderation flags, and “no evidence” scores.
- Drift: your embeddings age, data changes, prompts creep. Watch accuracy KPIs, feature distributions, and retrieval hit rates.
- Rate limiting: soft throttles disguised as slow responses. Watch 429s, jittery latencies.
- Cost runaways: token explosions from prompt expansions or bad few-shots.
If you don’t instrument: you won’t know which one hit you, and you’ll guess wrong during an incident.
Circuit breakers that actually work in AI pipelines
You need breakers in three places: client, network, and job orchestration.
- Client-side breaker (sync paths)
- Use
resilience4joropossum(Node) with timeouts and retry budgets. Retries must be capped and jittered.
- Use
// Spring Boot + Resilience4j example
@Bean
public Customizer<Resilience4JCircuitBreakerFactory> defaultCustomizer() {
return factory -> factory.configure(builder -> builder
.timeLimiterConfig(TimeLimiterConfig.custom()
.timeoutDuration(Duration.ofMillis(800))
.build())
.circuitBreakerConfig(CircuitBreakerConfig.custom()
.failureRateThreshold(20)
.slowCallRateThreshold(30)
.slowCallDurationThreshold(Duration.ofMillis(600))
.waitDurationInOpenState(Duration.ofSeconds(30))
.permittedNumberOfCallsInHalfOpenState(5)
.slidingWindowSize(50)
.build()), "llmClient");
}- Service mesh breaker (Envoy/Istio/Linkerd)
- Enforce timeouts, outlier detection, and retry budgets close to the wire.
# Istio DestinationRule + VirtualService with outlier detection and retry budget
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: llm-api
spec:
host: llm.vendor.svc.cluster.local
trafficPolicy:
connectionPool:
http:
http1MaxPendingRequests: 100
maxRequestsPerConnection: 100
outlierDetection:
consecutive5xx: 5
interval: 5s
baseEjectionTime: 30s
maxEjectionPercent: 50
---
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: llm-api
spec:
hosts: ["llm.vendor.svc.cluster.local"]
http:
- timeout: 800ms
retries:
attempts: 2
perTryTimeout: 300ms
retryOn: 5xx,connect-failure,reset- Batch/async breaker (jobs and queues)
- Cap concurrency and implement dead-letter queues. If a job exceeds its token/time budget, mark it failed and route to fallback.
If your breaker only trips on HTTP 500s, you’ll miss slow calls and schema violations that matter more in LLM land.
Fallbacks that won’t embarrass you
Pre-decide how you degrade. “We’ll figure it out live” is how you ship nonsense to customers.
- Cached answers: for common prompts, store a curated response in Redis with TTLs. Great for FAQ, product metadata, and templated summaries.
- Deterministic templates: rule-based or
jinja2/handlebars with extracted facts. Boring but safe. - Smaller/local model: a distilled on-box model (e.g.,
gpt-3.5-turboorLlama-3-8Bquantized) for when the big model is slow. Know your quality delta. - Disable gracefully: feature flags to hide AI blurbs, surface “basic mode,” or remove non-critical sections.
// TypeScript wrapper: breaker + fallback + budgets
import CircuitBreaker from 'opossum';
import Redis from 'ioredis';
const redis = new Redis(process.env.REDIS_URL!);
async function callLLM(prompt: string, budgetMs = 800, maxTokens = 400) {
const cached = await redis.get(`faq:${hash(prompt)}`);
if (cached) return { source: 'cache', text: cached };
const breaker = new CircuitBreaker(llmInvoke, {
timeout: budgetMs,
errorThresholdPercentage: 20,
resetTimeout: 30000,
rollingCountBuckets: 10,
});
breaker.fallback(async () => {
return { source: 'template', text: basicTemplate(prompt) };
});
return breaker.fire(prompt, { maxTokens });
}
async function llmInvoke(prompt: string, opts: { maxTokens: number }) {
// call provider with per-request token/latency budgets
const res = await fetch(process.env.LLM_URL!, { /* ... */ });
const out = await res.json();
// validate below; throw to trigger breaker on violation
validateSchemaOrThrow(out);
return { source: 'llm', text: out.text };
}Guardrails: validate, moderate, and cap budgets
Most AI incidents we fix at GitPlumbers weren’t infra bugs; they were guardrails missing.
- Schema validation: if you expect JSON, enforce a
JSON SchemaorPydanticmodel and fail closed. - Content moderation: use provider moderation or your own classifier to block unsafe content.
- Safety budgets: cap tokens, context size, and latency per route. Exceed budget => fail and fallback.
- Prompt fingerprints: hash prompts and model version into traces for forensics.
# Python: Pydantic schema + moderation + budget checks
from pydantic import BaseModel, ValidationError, constr
class Summary(BaseModel):
title: constr(strip_whitespace=True, min_length=3, max_length=180)
bullets: list[constr(max_length=200)]
confidence: float
def validate_or_raise(payload, tokens_used, ms):
if tokens_used > 400 or ms > 800:
raise TimeoutError("Budget exceeded")
if payload.get("moderation", {}).get("flagged"):
raise ValueError("Moderation failed")
try:
Summary(**payload["data"]) # throws if invalid
except ValidationError as e:
raise ValueError(f"Schema violation: {e}")If validation fails, that’s a good thing—it trips your breaker and forces a deterministic fallback rather than shipping trash.
Observability that catches hallucinations and drift
If you can’t see it, you can’t fix it. Make the AI path first-class in traces.
- OpenTelemetry: trace the entire flow; tag spans with
model,model_version,prompt_fingerprint,token_in/out,safety_flags, andfallback_path. - Prometheus: export counters/gauges/histograms for latency, breaker states, validation failures, moderation rejects, and drift detectors.
- Logs: store redacted prompts and outputs with IDs; sample aggressively to control PII.
// OpenTelemetry span attributes on AI call
const span = tracer.startSpan('llm.call');
span.setAttributes({
'ai.model': 'gpt-4o-mini',
'ai.model_version': '2025-06-01',
'ai.prompt_fingerprint': sha256(promptCore),
'ai.tokens.in': inTokens,
'ai.tokens.out': outTokens,
'ai.fallback': fallbackSource, // none|cache|template|small-model
'ai.moderation.flagged': flagged,
});PromQL alert ideas:
# Latency spike
histogram_quantile(0.99, sum(rate(llm_latency_ms_bucket[5m])) by (le, route)) > 1500
# Hallucination proxy: schema validation failures
rate(llm_schema_validation_failures_total[5m]) > 5
# Drift proxy: retrieval miss rate rising
(rate(rag_retrieval_hits_total[5m]) / ignoring(type) rate(rag_retrieval_attempts_total[5m])) < 0.7
# Breaker open events
increase(llm_breaker_open_total[10m]) > 10For RAG, emit retrieval_hits, topk_scores, and index_version. If index_version is old relative to data, you’ll catch drift before users do.
Shipping safely: flags, canaries, and rollback that actually work
Don’t big-bang a model change. Your users aren’t your QA team.
- Feature flags: gate AI output sections. Use LaunchDarkly/Unleash to flip off specific components.
- Argo Rollouts for canary: control traffic weight to new model or prompt template.
- GitOps: treat prompts and safety configs as versioned artifacts.
# Argo Rollouts: canary for new model route
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: llm-service
spec:
strategy:
canary:
steps:
- setWeight: 10
- pause: {duration: 300}
- analysis:
templates:
- templateName: p99-latency-check
- setWeight: 50
- pause: {duration: 600}
- setWeight: 100Runbooks should specify: which flag to flip, which rollout to abort, and how to force the fallback path for critical routes.
Proving it: chaos drills and real SLOs
I’ve seen teams write beautiful breakers…and never test them. Do weekly 30-minute chaos:
- Kill the upstream (HTTP 503) and confirm breaker opens in <2 minutes.
- Inject latency (tc/netem) to 2s and ensure fallbacks trigger.
- Return malformed JSON; confirm schema guardrails fail closed.
- Increase RAG miss rate; verify alert fires and small-model fallback kicks in.
Define SLOs per route:
- p95 latency ≤ 800ms
- valid output rate ≥ 99.5% (post-validation)
- hallucination proxy (validation/moderation fail) ≤ 0.5%
- cost/token budget per 1k requests
When you miss SLOs, your error budget policy should throttle traffic to the fancy path, not the user.
If you need help wiring this in for real—breakers, fallbacks, guardrails, and the dashboards that prove it—GitPlumbers does this weekly. We fix vibe-coded AI glue, put rails around it, and leave you with runbooks your SREs trust.
Key takeaways
- Treat LLMs like flaky upstreams with weird failure modes: add circuit breakers, timeouts, and retry budgets at multiple layers.
- Plan your fallback tree before go-live: cached responses, deterministic templates, smaller local models, and “disable gracefully.”
- Instrument everything: trace prompts, model/version, token counts, latency, and safety scores with OpenTelemetry and Prometheus.
- Guardrails are non-negotiable: validate output schemas, moderate content, and cap token/time budgets per request path.
- Prove it in practice: chaos drills, canary rollouts, SLOs, and runbooks that tell on-call exactly which switch to flip.
Implementation checklist
- Define request-level budgets: max latency, max tokens, max cost.
- Implement client-side and mesh-level circuit breakers with timeouts and retry budgets.
- Precompute safe fallbacks and cache them (Redis) with TTLs and invalidation hooks.
- Validate model outputs via JSON Schema/Pydantic; treat violations as failures and trigger fallback.
- Emit OpenTelemetry spans for prompts, model/version, token usage, safety scores, and fallback path chosen.
- Set Prometheus alerts for latency spikes, hallucination rate, drift indicators, and breaker open events.
- Ship behind feature flags and canary with Argo Rollouts; practice kill-switch drills monthly.
- Document runbooks: when to open the breaker, force fallback, or roll back the model.
Questions we hear from teams
- What’s the minimum viable setup for AI circuit breakers?
- Client-side timeout + breaker, mesh-level timeout with limited retries, schema validation that throws on violation, and a deterministic fallback (template or cache). Add OTel spans and Prometheus counters on day one.
- Should we retry on timeouts?
- Only with a strict retry budget and jitter, and only if the provider’s SLO says it helps. Otherwise you amplify a brownout. Prefer immediate fallback after one quick retry.
- How do we detect hallucinations automatically?
- Use output schema checks, consistency checks against retrieved facts, moderation flags, and a simple verifier model on critical flows. Track validation failure rates as a proxy for hallucination rate.
- How do we plan fallbacks without torpedoing UX?
- Design per-route fallbacks: cached answers for FAQs, templates for summaries, a smaller model for chat, or hide the AI section behind a feature flag. Measure the UX delta and make it visible in dashboards.
- What about cost blowups from token bloat?
- Set token budgets per route, monitor tokens_in/out, cap context size, and evict low-signal few-shot examples. Breaker should treat budget breaches as failures and trigger fallback.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
