Do I still need a circuit breaker if the provider SDK has retries and timeouts?

Yes. Provider SDKs protect their edge, not your SLOs. You need application-level timeouts and breakers tuned to your budgets, plus mesh-level protections to stop cascades within your cluster.

What’s a sane timeout for inline user flows?

Work backward from your SLO. If the request has 600ms total, give the AI hop ~250–300ms including one retry. Anything slower must fall back or be async.

How do I detect hallucinations automatically?

Use schema validation to catch structural issues, retrieval-grounding checks (does the answer cite retrieved docs?), and heuristic classifiers for toxicity/PII. Track a “validation_fail” metric and fall back or re-prompt when it trips.

What’s the fastest fallback that still looks good?

Cached last-known-good content with a short TTL, or a deterministic rules-based template. Users prefer slightly stale but correct over slow or wrong.

How do I prevent cost blowouts?

Enforce token quotas per request/user/tenant, cap context size, and monitor `ai_cost_tokens_total`. Add a cost breaker that downgrades models or refuses requests when budgets are exhausted.

Ai-delivery · Nov 2, 2025 · 10 minute read

Circuit Breakers and Fallbacks for AI: The Guardrails That Save You When Models Misbehave

LLMs will fail in production—hallucinations, drift, latency spikes, provider outages. Here’s how we wire breakers, fallbacks, and observability so those failures don’t take your business down.

Alex Ramirez

Principal Engineer, GitPlumbers

20 years shipping and fixing distributed systems—from monoliths on bare metal to AI-enabled microservices at scale. Ex-SRE lead, broke prod so you don’t have to.

AI is powerful, but the novelty wears off fast when you’re paging at 2 a.m. Treat it like any flaky dependency and build for failure from day one.

Back to all posts

The Friday-night faceplant that sold me on breakers

I’ve watched a team ship an LLM into checkout without a kill switch. Friday night, provider latency spiked from 300ms p95 to 6s p95, retries piled up, threads blocked, and a seemingly harmless “AI assist” dragged the whole request path down. Cart conversions cratered. The fix wasn’t a smarter prompt. It was old-school resilience: circuit breakers, hard timeouts, and business-aware fallbacks.

If you’re letting an AI call sit inline on a user flow, you need the same guardrails we used in the Hystrix days—plus a few new ones for hallucination and drift. Here’s what actually works in production.

What fails in AI flows (and how it burns you)

AI brings the usual distributed-systems pain and then some:

Latency spikes: model load, provider brownouts, cold starts, or outrageous context windows. If your p95 > 1.5s in a user path, you’ll feel it.
Hard failures: 5xx, rate limits, quota exhaustion, DNS issues, TLS handshake fails. Seen them all.
Hallucination: confident wrongness—bad product data, unsafe text, invented fields breaking downstream JSON.
Drift: model updates or data skew changing outputs subtly—your rules match yesterday’s shape, not today’s.
Cost blowouts: unbounded tokens or retries; a single hot prompt can incinerate your monthly budget.
Dependency cascades: one slow LLM call blocks threads, saturates connection pools, and tanks the rest of your stack.

If you don’t instrument this path end-to-end, your first signal will be customers tweeting screenshots. Don’t do that to yourself.

Put a breaker on every AI hop

This is table stakes: timeouts, retries (careful), and circuit breakers at both the code and mesh layers.

Timeouts: set per-hop hard timeouts aligned with SLOs. If checkout has 500ms budget for “AI recommendations,” the LLM call gets 250–300ms including retries.
Retries: only on idempotent calls; cap attempts; use jittered backoff. Never retry on 429 without pacing.
Circuit breakers: open on high error rate/timeouts; half-open to probe recovery; limit concurrent calls.

Typescript example with opossum (Node):

import CircuitBreaker from 'opossum';
import OpenAI from 'openai';

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

async function llmCall(prompt: string) {
  const resp = await openai.chat.completions.create({
    model: 'gpt-4o-mini',
    messages: [{ role: 'user', content: prompt }],
    timeout: 250 // ms hard cap at client SDK if supported
  });
  return resp.choices[0].message?.content ?? '';
}

const breaker = new CircuitBreaker(llmCall, {
  timeout: 300,                 // ms per attempt
  errorThresholdPercentage: 50, // open if >=50% of recent calls fail
  resetTimeout: 30000,          // half-open after 30s
  rollingCountBuckets: 10,
  rollingCountTimeout: 10000,   // error window: 10s
});

breaker.fallback(async (prompt: string) => {
  // cheap fallback: cached result or rules engine
  return cachedRecommendation(prompt) ?? rulesBasedRecommendation(prompt);
});

export async function getRecommendation(prompt: string) {
  return breaker.fire(prompt);
}

Mesh-level protection (Istio/Envoy):

# DestinationRule: outlier detection & connection pool limits
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: ai-provider
spec:
  host: api.openai.com
  trafficPolicy:
    connectionPool:
      http:
        http1MaxPendingRequests: 100
        maxRequestsPerConnection: 100
        idleTimeout: 5s
    outlierDetection:
      consecutive5xxErrors: 5
      interval: 5s
      baseEjectionTime: 30s
      maxEjectionPercent: 100
---
# VirtualService: timeouts & limited retries
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: ai-provider
spec:
  hosts: ["api.openai.com"]
  http:
    - timeout: 0.3s
      retries:
        attempts: 1
        perTryTimeout: 0.15s
        retryOn: 5xx,connect-failure,refused-stream

Java? Use resilience4j. .NET? Polly. Ruby? Semian. Same story: set bounds, fail fast, probe recovery.

Fallbacks that don’t embarrass you

You don’t need one fallback—you need a ladder. Order them from fastest/cheapest to richest:

Cache: serve last-known-good with TTL. Great for product Q&A or recommendations.
Rules engine: deterministic defaults when you can’t call the model.
Smaller/alternate model: drop from gpt-4.x to gpt-3.5 or switch to a self-hosted Llama with narrower context.
RAG-only: return retrieved snippets or top-N answers without generation.
Human-in-the-loop: queue for review when quality really matters.

Python example with pybreaker and a fallback path:

import json
import requests
from pybreaker import CircuitBreaker, CircuitBreakerError

breaker = CircuitBreaker(fail_max=5, reset_timeout=30)

def rules_based_answer(question: str) -> str:
    # trivial example; your domain rules go here
    if 'refund' in question.lower():
        return 'Visit /account/refunds for policy details.'
    return 'We’re getting that info. Check the Help Center.'

@breaker
def call_llm(question: str) -> str:
    r = requests.post(
        'https://api.openai.com/v1/chat/completions',
        headers={'Authorization': f'Bearer TOKEN'},
        json={
            'model': 'gpt-4o-mini',
            'messages': [{'role': 'user', 'content': question}],
            'max_tokens': 128
        },
        timeout=0.3
    )
    r.raise_for_status()
    return r.json()['choices'][0]['message']['content']

def answer(question: str) -> str:
    try:
        return call_llm(question)
    except (requests.Timeout, requests.HTTPError, CircuitBreakerError):
        # fallback: cached → rules → smaller model
        return cache_get(question) or rules_based_answer(question)

Design fallbacks with product in mind. If “AI product copy” fails, serve the last published copy. If “fraud scoring” fails, switch to a conservative deterministic model and raise flags to manual review. Fallbacks should trade quality for safety, not surprise users.

Guardrails and validation: trust, but verify

The model is not your data contract. You need validation and safety checks before anything touches a user-facing surface or a downstream system.

Schema validation: require strict JSON; reject/repair on shape mismatch.
Policy filters: profanity/toxicity checks, PII redaction, prompt-injection screening.
Cost breakers: cap tokens per request/user/tenant; refuse or degrade when exceeded.
Prompt hygiene: render prompts from templates with versioning and input length limits.

Pydantic example enforcing output schema:

from pydantic import BaseModel, ValidationError
import json

class ProductSummary(BaseModel):
    title: str
    bullets: list[str]
    sentiment: float  # -1..1

 def validate_ai_output(raw: str) -> ProductSummary | None:
    try:
        return ProductSummary.model_validate_json(raw)
    except ValidationError:
        return None

raw = llm_call(prompt)
parsed = validate_ai_output(raw)
if not parsed:
    # Re-prompt with a function-calling schema or fall back
    parsed = fallback_summary(product)

Add a cost breaker:

Track tokens via provider headers or your tokenizer.
Maintain per-user/tenant counters in Redis.
If tokens_this_minute > quota, short-circuit to fallback.

And always redact: scrub secrets and PII from prompts/outputs in logs and traces. The one time you don’t is the time you get a subpoena.

Observability you can actually operate

If you can’t see it, you can’t save it. Instrument the full AI path with OpenTelemetry tracing and Prometheus metrics, attach model metadata, and expose breaker/fallback state.

Traces: a span around the AI call with attributes: model, provider, prompt_version, input_tokens, output_tokens, timeout_ms, retry_count, fallback_used, circuit_state.
Metrics:
- ai_llm_request_duration_seconds (histogram)
- ai_llm_requests_total{status} (counter; status=ok,timeout,5xx,validation_fail)
- ai_fallback_total{type} (counter)
- ai_circuit_open (gauge)
- ai_cost_tokens_total{direction} (counter)
Logs: sample prompts/responses with redaction; link to traces via trace_id.

Node OpenTelemetry snippet:

import { context, trace } from '@opentelemetry/api';

async function tracedLLMCall(prompt: string) {
  const tracer = trace.getTracer('ai');
  return await tracer.startActiveSpan('llm.call', async (span) => {
    span.setAttribute('model', 'gpt-4o-mini');
    span.setAttribute('prompt_version', 'v23');
    try {
      const start = Date.now();
      const result = await getRecommendation(prompt);
      span.setAttribute('latency_ms', Date.now() - start);
      span.setStatus({ code: 1 });
      return result;
    } catch (e: any) {
      span.recordException(e);
      span.setStatus({ code: 2, message: e.message });
      throw e;
    } finally {
      span.end();
    }
  });
}

Prometheus alerts that actually catch pain early:

# 5m error rate >5%
- alert: AILLMHighErrorRate
  expr: sum(rate(ai_llm_requests_total{status=~"timeout|5xx|validation_fail"}[5m]))
        / sum(rate(ai_llm_requests_total[5m])) > 0.05
  for: 10m
  labels: { severity: page }
  annotations:
    summary: "AI path error rate >5%"

# p99 latency > 1.5s
- alert: AILLMHighLatency
  expr: histogram_quantile(0.99, sum(rate(ai_llm_request_duration_seconds_bucket[5m])) by (le)) > 1.5
  for: 10m
  labels: { severity: page }

# breaker open or fallback surge
- alert: AICircuitOpen
  expr: ai_circuit_open > 0
  for: 5m
  labels: { severity: ticket }

Add drift signals: monitor content embeddings distribution shift or output field distributions. At minimum, track “validation_fail rate” and “fallback usage” trends by prompt version.

Ship safely: flags, canaries, chaos

Feature flags: wrap AI features with a kill switch. Turn off only the AI path, not the entire endpoint.
Canary & shadow: use Argo Rollouts or similar to gate new prompts/models on metrics; shadow-traffic to compare outputs before users see them.
Chaos drills: inject latency/5xx/rate limits monthly. If no one knows the runbook, it doesn’t exist.

LaunchDarkly kill switch example:

import LDClient from 'launchdarkly-node-server-sdk';

const ld = LDClient.init(process.env.LD_SDK_KEY);

export async function guardedAI(req) {
  const enabled = await ld.variation('ai-recos-enabled', { key: req.user.id }, false);
  if (!enabled) return rulesBasedRecommendation(req);
  return tracedLLMCall(req.prompt);
}

Latency injection for drills (dev/staging):

# using toxiproxy to add 400ms latency to provider host
toxiproxy-cli create openai --listen 127.0.0.1:9999 --upstream api.openai.com:443
toxiproxy-cli toxic add --type latency --toxicName slow --toxicity 1.0 --attributes latency=400 jitter=50 openai

Rollouts: start at 1% with strict error/latency/cost gates. If your guardrails don’t trip in the first hour of a canary, they probably aren’t wired to anything.

If I had to do it again tomorrow

Keep the LLM off the critical path unless you can fully degrade without user pain.
Breakers at code and mesh; timeouts that match your SLOs; retries that don’t DDoS the provider.
Fallbacks that are boring but reliable.
Schema validation and safety filters before any output touches prod.
Deep, linked observability with breaker and fallback state.
A kill switch you’ve actually used in a drill.

AI is powerful, but the novelty wears off fast when you’re paging at 2 a.m. The teams that win treat AI like any flaky dependency—and build for failure from day one.

Related Resources

Key takeaways

Put a circuit breaker and hard timeout on every AI hop—provider SDK timeouts are not a strategy.
Fallbacks must be explicit and business-aware: cached, rules-based, smaller model, or human-in-the-loop.
Instrument the AI path end-to-end with OpenTelemetry and Prometheus; alert on failure rate, latency, fallback rate, cost, and drift signals.
Validate outputs with schemas and guardrails; never ship raw LLM text to production surfaces.
Test breakers and fallbacks with chaos drills, not on paying users.

Implementation checklist

Define per-hop SLOs and hard timeouts for AI calls.
Add circuit breakers in code and at the mesh/edge (Envoy/Istio).
Implement at least two fallbacks: cached/rules and smaller/alternate model.
Validate outputs against a strict schema; re-prompt or fall back on validation failure.
Instrument with OpenTelemetry; export Prometheus metrics for latency, errors, fallback rate, and costs.
Add a cost breaker (token/requests) and a big red kill switch via feature flags.
Canary and shadow-test any new prompt/model; gate rollout on metrics.
Run quarterly chaos drills: inject latency, 5xx, schema failures, and provider rate limits.

Questions we hear from teams

Do I still need a circuit breaker if the provider SDK has retries and timeouts?: Yes. Provider SDKs protect their edge, not your SLOs. You need application-level timeouts and breakers tuned to your budgets, plus mesh-level protections to stop cascades within your cluster.
What’s a sane timeout for inline user flows?: Work backward from your SLO. If the request has 600ms total, give the AI hop ~250–300ms including one retry. Anything slower must fall back or be async.
How do I detect hallucinations automatically?: Use schema validation to catch structural issues, retrieval-grounding checks (does the answer cite retrieved docs?), and heuristic classifiers for toxicity/PII. Track a “validation_fail” metric and fall back or re-prompt when it trips.
What’s the fastest fallback that still looks good?: Cached last-known-good content with a short TTL, or a deterministic rules-based template. Users prefer slightly stale but correct over slow or wrong.
How do I prevent cost blowouts?: Enforce token quotas per request/user/tenant, cap context size, and monitor `ai_cost_tokens_total`. Add a cost breaker that downgrades models or refuses requests when budgets are exhausted.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Get an AI Reliability Assessment Download the Production AI Guardrails Checklist