The Day GPT Went Dark: Circuit Breakers and Fallbacks That Saved Our AI (and Our Weekend)
If your AI calls don’t have circuit breakers, fallbacks, and real observability, you’re one provider outage away from a pager storm. Here’s how we actually ship this safely.
A circuit breaker on your LLM isn’t optional—it’s the difference between a blip and a full-blown incident.Back to all posts
Key takeaways
- Instrument AI calls first: model, parameters, token counts, latency, and safety flags must be visible in traces and metrics.
- Put circuit breakers at multiple layers: client, mesh (Envoy/Istio), and queue/worker to prevent cascading failures.
- Design layered fallbacks: cross-provider, downgraded modes (RAG-only), and semantic/cache responses with clear UX signals.
- Add safety guardrails before and after model calls: moderation, PII redaction, schema validation, and hallucination checks.
- Drill outages with fault injection; wire SLOs and error-budget burn alerts so you discover issues before customers do.
Implementation checklist
- Define SLOs for AI calls (availability, p95 latency, correctness proxy).
- Add OpenTelemetry spans for every AI call with model, tokens, provider, and safety attributes.
- Enable circuit breaking and outlier detection in your mesh (Istio/Envoy) for your AI egress.
- Wrap AI calls with a client-side circuit breaker and exponential backoff (opossum/Resilience4j/Polly).
- Implement at least two fallbacks: cross-provider and degraded mode (RAG-only or cached).
- Introduce pre/post safety checks (moderation, schema validation, hallucination gate).
- Create runbooks and fault-inject regularly (Istio fault injection or k6).
Questions we hear from teams
- How do I choose circuit breaker thresholds for AI calls?
- Start with your user-facing SLA and provider behavior. Set timeouts to 60–70% of your SLA (e.g., 1–2s for interactive). Open the breaker when 40–60% of requests fail in a 10–30s window; half-open after 30s. Watch false-open rates in staging chaos drills and adjust.
- What’s a sane fallback order for an enterprise app?
- Cross-provider first (e.g., OpenAI → Azure OpenAI), then self-hosted vLLM (Llama 3.x) for continuity, then degrade to deterministic retrieval/template. Gate outputs with schema/moderation at every step and surface a clear “safe mode” indicator to users.
- How do I detect drift without labels?
- Track embedding distance distributions for queries vs. your corpus, acceptance rates, and guardrail hit rates. Sudden shifts are smoke. Periodically A/B a small slice of traffic to a canary model and compare these proxy metrics before full rollout.
- Is Hystrix still a thing?
- Hystrix is effectively in maintenance. Use Resilience4j for JVM, Polly for .NET, and `opossum` or `cockatiel` for Node. At the mesh edge, Envoy’s outlier detection and connection pools do the heavy lifting.
- How do I avoid retriable storms during provider outages?
- Budget retries: small max (1–2), exponential backoff with jitter, and only on idempotent paths. Combine with app-level and mesh-level circuit breaking, plus rate limiting (token bucket in Redis/Envoy) to cap concurrency under stress.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
