Circuit Breakers and Fallbacks for AI: The Guardrails That Save You When Models Misbehave
LLMs will fail in production—hallucinations, drift, latency spikes, provider outages. Here’s how we wire breakers, fallbacks, and observability so those failures don’t take your business down.
AI is powerful, but the novelty wears off fast when you’re paging at 2 a.m. Treat it like any flaky dependency and build for failure from day one.Back to all posts
The Friday-night faceplant that sold me on breakers
I’ve watched a team ship an LLM into checkout without a kill switch. Friday night, provider latency spiked from 300ms p95 to 6s p95, retries piled up, threads blocked, and a seemingly harmless “AI assist” dragged the whole request path down. Cart conversions cratered. The fix wasn’t a smarter prompt. It was old-school resilience: circuit breakers, hard timeouts, and business-aware fallbacks.
If you’re letting an AI call sit inline on a user flow, you need the same guardrails we used in the Hystrix days—plus a few new ones for hallucination and drift. Here’s what actually works in production.
What fails in AI flows (and how it burns you)
AI brings the usual distributed-systems pain and then some:
- Latency spikes: model load, provider brownouts, cold starts, or outrageous context windows. If your p95 > 1.5s in a user path, you’ll feel it.
- Hard failures: 5xx, rate limits, quota exhaustion, DNS issues, TLS handshake fails. Seen them all.
- Hallucination: confident wrongness—bad product data, unsafe text, invented fields breaking downstream JSON.
- Drift: model updates or data skew changing outputs subtly—your rules match yesterday’s shape, not today’s.
- Cost blowouts: unbounded tokens or retries; a single hot prompt can incinerate your monthly budget.
- Dependency cascades: one slow LLM call blocks threads, saturates connection pools, and tanks the rest of your stack.
If you don’t instrument this path end-to-end, your first signal will be customers tweeting screenshots. Don’t do that to yourself.
Put a breaker on every AI hop
This is table stakes: timeouts, retries (careful), and circuit breakers at both the code and mesh layers.
- Timeouts: set per-hop hard timeouts aligned with SLOs. If checkout has 500ms budget for “AI recommendations,” the LLM call gets 250–300ms including retries.
- Retries: only on idempotent calls; cap attempts; use jittered backoff. Never retry on
429without pacing. - Circuit breakers: open on high error rate/timeouts; half-open to probe recovery; limit concurrent calls.
Typescript example with opossum (Node):
import CircuitBreaker from 'opossum';
import OpenAI from 'openai';
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
async function llmCall(prompt: string) {
const resp = await openai.chat.completions.create({
model: 'gpt-4o-mini',
messages: [{ role: 'user', content: prompt }],
timeout: 250 // ms hard cap at client SDK if supported
});
return resp.choices[0].message?.content ?? '';
}
const breaker = new CircuitBreaker(llmCall, {
timeout: 300, // ms per attempt
errorThresholdPercentage: 50, // open if >=50% of recent calls fail
resetTimeout: 30000, // half-open after 30s
rollingCountBuckets: 10,
rollingCountTimeout: 10000, // error window: 10s
});
breaker.fallback(async (prompt: string) => {
// cheap fallback: cached result or rules engine
return cachedRecommendation(prompt) ?? rulesBasedRecommendation(prompt);
});
export async function getRecommendation(prompt: string) {
return breaker.fire(prompt);
}Mesh-level protection (Istio/Envoy):
# DestinationRule: outlier detection & connection pool limits
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: ai-provider
spec:
host: api.openai.com
trafficPolicy:
connectionPool:
http:
http1MaxPendingRequests: 100
maxRequestsPerConnection: 100
idleTimeout: 5s
outlierDetection:
consecutive5xxErrors: 5
interval: 5s
baseEjectionTime: 30s
maxEjectionPercent: 100
---
# VirtualService: timeouts & limited retries
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: ai-provider
spec:
hosts: ["api.openai.com"]
http:
- timeout: 0.3s
retries:
attempts: 1
perTryTimeout: 0.15s
retryOn: 5xx,connect-failure,refused-streamJava? Use resilience4j. .NET? Polly. Ruby? Semian. Same story: set bounds, fail fast, probe recovery.
Fallbacks that don’t embarrass you
You don’t need one fallback—you need a ladder. Order them from fastest/cheapest to richest:
- Cache: serve last-known-good with TTL. Great for product Q&A or recommendations.
- Rules engine: deterministic defaults when you can’t call the model.
- Smaller/alternate model: drop from
gpt-4.xtogpt-3.5or switch to a self-hostedLlamawith narrower context. - RAG-only: return retrieved snippets or top-N answers without generation.
- Human-in-the-loop: queue for review when quality really matters.
Python example with pybreaker and a fallback path:
import json
import requests
from pybreaker import CircuitBreaker, CircuitBreakerError
breaker = CircuitBreaker(fail_max=5, reset_timeout=30)
def rules_based_answer(question: str) -> str:
# trivial example; your domain rules go here
if 'refund' in question.lower():
return 'Visit /account/refunds for policy details.'
return 'We’re getting that info. Check the Help Center.'
@breaker
def call_llm(question: str) -> str:
r = requests.post(
'https://api.openai.com/v1/chat/completions',
headers={'Authorization': f'Bearer TOKEN'},
json={
'model': 'gpt-4o-mini',
'messages': [{'role': 'user', 'content': question}],
'max_tokens': 128
},
timeout=0.3
)
r.raise_for_status()
return r.json()['choices'][0]['message']['content']
def answer(question: str) -> str:
try:
return call_llm(question)
except (requests.Timeout, requests.HTTPError, CircuitBreakerError):
# fallback: cached → rules → smaller model
return cache_get(question) or rules_based_answer(question)Design fallbacks with product in mind. If “AI product copy” fails, serve the last published copy. If “fraud scoring” fails, switch to a conservative deterministic model and raise flags to manual review. Fallbacks should trade quality for safety, not surprise users.
Guardrails and validation: trust, but verify
The model is not your data contract. You need validation and safety checks before anything touches a user-facing surface or a downstream system.
- Schema validation: require strict JSON; reject/repair on shape mismatch.
- Policy filters: profanity/toxicity checks, PII redaction, prompt-injection screening.
- Cost breakers: cap tokens per request/user/tenant; refuse or degrade when exceeded.
- Prompt hygiene: render prompts from templates with versioning and input length limits.
Pydantic example enforcing output schema:
from pydantic import BaseModel, ValidationError
import json
class ProductSummary(BaseModel):
title: str
bullets: list[str]
sentiment: float # -1..1
def validate_ai_output(raw: str) -> ProductSummary | None:
try:
return ProductSummary.model_validate_json(raw)
except ValidationError:
return None
raw = llm_call(prompt)
parsed = validate_ai_output(raw)
if not parsed:
# Re-prompt with a function-calling schema or fall back
parsed = fallback_summary(product)Add a cost breaker:
- Track tokens via provider headers or your tokenizer.
- Maintain per-user/tenant counters in Redis.
- If
tokens_this_minute > quota, short-circuit to fallback.
And always redact: scrub secrets and PII from prompts/outputs in logs and traces. The one time you don’t is the time you get a subpoena.
Observability you can actually operate
If you can’t see it, you can’t save it. Instrument the full AI path with OpenTelemetry tracing and Prometheus metrics, attach model metadata, and expose breaker/fallback state.
- Traces: a span around the AI call with attributes:
model,provider,prompt_version,input_tokens,output_tokens,timeout_ms,retry_count,fallback_used,circuit_state. - Metrics:
ai_llm_request_duration_seconds(histogram)ai_llm_requests_total{status}(counter; status=ok,timeout,5xx,validation_fail)ai_fallback_total{type}(counter)ai_circuit_open(gauge)ai_cost_tokens_total{direction}(counter)
- Logs: sample prompts/responses with redaction; link to traces via
trace_id.
Node OpenTelemetry snippet:
import { context, trace } from '@opentelemetry/api';
async function tracedLLMCall(prompt: string) {
const tracer = trace.getTracer('ai');
return await tracer.startActiveSpan('llm.call', async (span) => {
span.setAttribute('model', 'gpt-4o-mini');
span.setAttribute('prompt_version', 'v23');
try {
const start = Date.now();
const result = await getRecommendation(prompt);
span.setAttribute('latency_ms', Date.now() - start);
span.setStatus({ code: 1 });
return result;
} catch (e: any) {
span.recordException(e);
span.setStatus({ code: 2, message: e.message });
throw e;
} finally {
span.end();
}
});
}Prometheus alerts that actually catch pain early:
# 5m error rate >5%
- alert: AILLMHighErrorRate
expr: sum(rate(ai_llm_requests_total{status=~"timeout|5xx|validation_fail"}[5m]))
/ sum(rate(ai_llm_requests_total[5m])) > 0.05
for: 10m
labels: { severity: page }
annotations:
summary: "AI path error rate >5%"
# p99 latency > 1.5s
- alert: AILLMHighLatency
expr: histogram_quantile(0.99, sum(rate(ai_llm_request_duration_seconds_bucket[5m])) by (le)) > 1.5
for: 10m
labels: { severity: page }
# breaker open or fallback surge
- alert: AICircuitOpen
expr: ai_circuit_open > 0
for: 5m
labels: { severity: ticket }Add drift signals: monitor content embeddings distribution shift or output field distributions. At minimum, track “validation_fail rate” and “fallback usage” trends by prompt version.
Ship safely: flags, canaries, chaos
- Feature flags: wrap AI features with a kill switch. Turn off only the AI path, not the entire endpoint.
- Canary & shadow: use
Argo Rolloutsor similar to gate new prompts/models on metrics; shadow-traffic to compare outputs before users see them. - Chaos drills: inject latency/5xx/rate limits monthly. If no one knows the runbook, it doesn’t exist.
LaunchDarkly kill switch example:
import LDClient from 'launchdarkly-node-server-sdk';
const ld = LDClient.init(process.env.LD_SDK_KEY);
export async function guardedAI(req) {
const enabled = await ld.variation('ai-recos-enabled', { key: req.user.id }, false);
if (!enabled) return rulesBasedRecommendation(req);
return tracedLLMCall(req.prompt);
}Latency injection for drills (dev/staging):
# using toxiproxy to add 400ms latency to provider host
toxiproxy-cli create openai --listen 127.0.0.1:9999 --upstream api.openai.com:443
toxiproxy-cli toxic add --type latency --toxicName slow --toxicity 1.0 --attributes latency=400 jitter=50 openaiRollouts: start at 1% with strict error/latency/cost gates. If your guardrails don’t trip in the first hour of a canary, they probably aren’t wired to anything.
If I had to do it again tomorrow
- Keep the LLM off the critical path unless you can fully degrade without user pain.
- Breakers at code and mesh; timeouts that match your SLOs; retries that don’t DDoS the provider.
- Fallbacks that are boring but reliable.
- Schema validation and safety filters before any output touches prod.
- Deep, linked observability with breaker and fallback state.
- A kill switch you’ve actually used in a drill.
AI is powerful, but the novelty wears off fast when you’re paging at 2 a.m. The teams that win treat AI like any flaky dependency—and build for failure from day one.
Related Resources
Key takeaways
- Put a circuit breaker and hard timeout on every AI hop—provider SDK timeouts are not a strategy.
- Fallbacks must be explicit and business-aware: cached, rules-based, smaller model, or human-in-the-loop.
- Instrument the AI path end-to-end with OpenTelemetry and Prometheus; alert on failure rate, latency, fallback rate, cost, and drift signals.
- Validate outputs with schemas and guardrails; never ship raw LLM text to production surfaces.
- Test breakers and fallbacks with chaos drills, not on paying users.
Implementation checklist
- Define per-hop SLOs and hard timeouts for AI calls.
- Add circuit breakers in code and at the mesh/edge (Envoy/Istio).
- Implement at least two fallbacks: cached/rules and smaller/alternate model.
- Validate outputs against a strict schema; re-prompt or fall back on validation failure.
- Instrument with OpenTelemetry; export Prometheus metrics for latency, errors, fallback rate, and costs.
- Add a cost breaker (token/requests) and a big red kill switch via feature flags.
- Canary and shadow-test any new prompt/model; gate rollout on metrics.
- Run quarterly chaos drills: inject latency, 5xx, schema failures, and provider rate limits.
Questions we hear from teams
- Do I still need a circuit breaker if the provider SDK has retries and timeouts?
- Yes. Provider SDKs protect their edge, not your SLOs. You need application-level timeouts and breakers tuned to your budgets, plus mesh-level protections to stop cascades within your cluster.
- What’s a sane timeout for inline user flows?
- Work backward from your SLO. If the request has 600ms total, give the AI hop ~250–300ms including one retry. Anything slower must fall back or be async.
- How do I detect hallucinations automatically?
- Use schema validation to catch structural issues, retrieval-grounding checks (does the answer cite retrieved docs?), and heuristic classifiers for toxicity/PII. Track a “validation_fail” metric and fall back or re-prompt when it trips.
- What’s the fastest fallback that still looks good?
- Cached last-known-good content with a short TTL, or a deterministic rules-based template. Users prefer slightly stale but correct over slow or wrong.
- How do I prevent cost blowouts?
- Enforce token quotas per request/user/tenant, cap context size, and monitor `ai_cost_tokens_total`. Add a cost breaker that downgrades models or refuses requests when budgets are exhausted.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
