How do I choose circuit breaker thresholds for AI calls?

Start with your user-facing SLA and provider behavior. Set timeouts to 60–70% of your SLA (e.g., 1–2s for interactive). Open the breaker when 40–60% of requests fail in a 10–30s window; half-open after 30s. Watch false-open rates in staging chaos drills and adjust.

What’s a sane fallback order for an enterprise app?

Cross-provider first (e.g., OpenAI → Azure OpenAI), then self-hosted vLLM (Llama 3.x) for continuity, then degrade to deterministic retrieval/template. Gate outputs with schema/moderation at every step and surface a clear “safe mode” indicator to users.

How do I detect drift without labels?

Track embedding distance distributions for queries vs. your corpus, acceptance rates, and guardrail hit rates. Sudden shifts are smoke. Periodically A/B a small slice of traffic to a canary model and compare these proxy metrics before full rollout.

Is Hystrix still a thing?

Hystrix is effectively in maintenance. Use Resilience4j for JVM, Polly for .NET, and `opossum` or `cockatiel` for Node. At the mesh edge, Envoy’s outlier detection and connection pools do the heavy lifting.

How do I avoid retriable storms during provider outages?

Budget retries: small max (1–2), exponential backoff with jitter, and only on idempotent paths. Combine with app-level and mesh-level circuit breaking, plus rate limiting (token bucket in Redis/Envoy) to cap concurrency under stress.

Ai-delivery · Nov 26, 2025 · 9 minute read

The Day GPT Went Dark: Circuit Breakers and Fallbacks That Saved Our AI (and Our Weekend)

If your AI calls don’t have circuit breakers, fallbacks, and real observability, you’re one provider outage away from a pager storm. Here’s how we actually ship this safely.

Alex Mercer

Principal Engineer, GitPlumbers

20 years shipping backend systems, SRE, and ML platforms. Ex-Spotify infra, ex-Square risk, spent the last 5 years rescuing AI-in-prod incidents no one wants to talk about.

A circuit breaker on your LLM isn’t optional—it’s the difference between a blip and a full-blown incident.

Back to all posts

Related Resources

Key takeaways

Instrument AI calls first: model, parameters, token counts, latency, and safety flags must be visible in traces and metrics.
Put circuit breakers at multiple layers: client, mesh (Envoy/Istio), and queue/worker to prevent cascading failures.
Design layered fallbacks: cross-provider, downgraded modes (RAG-only), and semantic/cache responses with clear UX signals.
Add safety guardrails before and after model calls: moderation, PII redaction, schema validation, and hallucination checks.
Drill outages with fault injection; wire SLOs and error-budget burn alerts so you discover issues before customers do.

Implementation checklist

Define SLOs for AI calls (availability, p95 latency, correctness proxy).
Add OpenTelemetry spans for every AI call with model, tokens, provider, and safety attributes.
Enable circuit breaking and outlier detection in your mesh (Istio/Envoy) for your AI egress.
Wrap AI calls with a client-side circuit breaker and exponential backoff (opossum/Resilience4j/Polly).
Implement at least two fallbacks: cross-provider and degraded mode (RAG-only or cached).
Introduce pre/post safety checks (moderation, schema validation, hallucination gate).
Create runbooks and fault-inject regularly (Istio fault injection or k6).

Questions we hear from teams

How do I choose circuit breaker thresholds for AI calls?: Start with your user-facing SLA and provider behavior. Set timeouts to 60–70% of your SLA (e.g., 1–2s for interactive). Open the breaker when 40–60% of requests fail in a 10–30s window; half-open after 30s. Watch false-open rates in staging chaos drills and adjust.
What’s a sane fallback order for an enterprise app?: Cross-provider first (e.g., OpenAI → Azure OpenAI), then self-hosted vLLM (Llama 3.x) for continuity, then degrade to deterministic retrieval/template. Gate outputs with schema/moderation at every step and surface a clear “safe mode” indicator to users.
How do I detect drift without labels?: Track embedding distance distributions for queries vs. your corpus, acceptance rates, and guardrail hit rates. Sudden shifts are smoke. Periodically A/B a small slice of traffic to a canary model and compare these proxy metrics before full rollout.
Is Hystrix still a thing?: Hystrix is effectively in maintenance. Use Resilience4j for JVM, Polly for .NET, and `opossum` or `cockatiel` for Node. At the mesh edge, Envoy’s outlier detection and connection pools do the heavy lifting.
How do I avoid retriable storms during provider outages?: Budget retries: small max (1–2), exponential backoff with jitter, and only on idempotent paths. Combine with app-level and mesh-level circuit breaking, plus rate limiting (token bucket in Redis/Envoy) to cap concurrency under stress.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to GitPlumbers about hardening your AI flows Download our AI Reliability Checklist

Related Resources

Key takeaways

Implementation checklist

Questions we hear from teams

Ready to modernize your codebase?

Related resources