Do I still need retries if I have a circuit breaker?

Yes, but keep retries bounded by a budget and backoff with jitter. The breaker protects you during sustained failure; retries handle transient blips. Never retry on timeouts forever or you’ll amplify load.

What’s a good first fallback for customer-facing answers?

A retrieval-only answer from your documentation with a confidence watermark. It’s fast, safe, and often good enough. Follow with a smaller/cheaper model or a templated response if retrieval confidence is low.

How do I detect hallucinations automatically?

Use a combination of schema checks, retrieval confidence thresholds, and offline evaluations over golden datasets. Track an eval failure rate metric and alert on spikes. True hallucination detection is probabilistic—design for mitigation, not perfect detection.

Should I run multiple LLM providers?

If uptime or price volatility matter, yes. Use mesh-level routing with outlier detection and feature-flag switches. Keep prompts/templates provider-agnostic and versioned so you can shift traffic without a redeploy.

Where do breakers live—in app code or the mesh?

Both. App-level breakers handle business-aware fallbacks and timeouts. Mesh-level policies protect the fleet and let you eject bad upstreams quickly. They’re complementary, not substitutes.

Ai-delivery · Nov 12, 2025 · 10 minute read

The Circuit Breaker That Saved Our LLM: Fallbacks, Guardrails, and Observability That Actually Work

Your LLM won’t take down prod if you treat it like the flaky dependency it is. Breakers, fallbacks, and hard telemetry turned our worst AI failure modes into manageable blips.

Riley Scott

Principal Engineer, GitPlumbers

20 years building and rescuing production systems at scale. Ex-Spotify, Ex-Twilio, exorcist of haunted microservices. I help teams ship AI features without burning their SLOs.

AI shouldn’t be a special snowflake in your architecture. Treat it like a flaky dependency and it’ll behave like one.

Back to all posts

The day a “smart” feature DoS’d our own product

We shipped a shiny LLM-powered help experience behind a feature flag. Looked great in staging. Then a minor provider hiccup turned into a feedback loop: latency spiked, user requests piled up, threads blocked, pods autoscaled, DB connections saturated, and the incident channel lit up. The LLM didn’t just fail—it amplified load everywhere.

The fix wasn’t a bigger node group. It was putting a circuit breaker with hard timeouts in front of the model, adding layered fallbacks, and giving SREs real telemetry on the AI path. Since then, we treat LLMs like the flaky third-party services they are.

If your AI feature can’t fail gracefully, it’s a feature flag away from a self-inflicted outage.

What actually fails in AI systems (and why breakers help)

You’ve seen these in the wild:

Latency spikes: cold starts, provider congestion, or longer prompts push p95 from 1.8s to 12s.
Rate limits / 429s: traffic bursts or mis-sized tokens blow through quotas.
Hallucinations: confident nonsense that passes schema but fails reality.
Drift: a model update changes behavior; your prompt no longer gates correctly.
Embedding staleness: outdated vectors lead to wrong context and worse answers.

Why circuit breakers matter:

Protect upstreams: cap concurrent calls and shed load when the model is struggling.
Fail fast: a fast fallback beats a slow “smart” answer every time.
Bound blast radius: isolate timeouts to the AI segment instead of dragging down the entire request.

We’ve implemented this at fintechs, marketplace platforms, and B2B SaaS. It’s boring reliability engineering applied to AI—and it works.

Build the breaker: timeouts, budgets, retries, fallbacks

Principles that haven’t failed me:

Timeouts first, retries second, and retry budgets always. No unbounded retries.
Circuit break on consecutive failures or elevated error rates; auto half-open with a small probe.
Cap concurrency to the provider; use a queue if you must, but don’t block critical threads.
Layered fallbacks that degrade gracefully.

Node/TypeScript with `opossum`

import CircuitBreaker from 'opossum';
import { OpenAI } from 'openai';

const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY! });

async function llmCall(prompt: string) {
  const res = await client.chat.completions.create({
    model: 'gpt-4o-mini',
    messages: [{ role: 'user', content: prompt }],
    temperature: 0.2,
  });
  return res.choices[0]?.message?.content || '';
}

const breaker = new CircuitBreaker(llmCall, {
  timeout: 5000,                 // ms per call
  errorThresholdPercentage: 50,  // open circuit if >=50% of last volumeThreshold failed
  volumeThreshold: 20,           // minimum calls before tripping
  resetTimeout: 30000,           // half-open after 30s
});

// Fast, safe fallback chain
breaker.fallback(async (prompt: string) => {
  // 1. cached answer (Redis/HTTP cache) with TTL
  const cached = await redis.get(`ans:${hash(prompt)}`);
  if (cached) return cached + ' (cached)';

  // 2. retrieval-only from your docs without generation
  const faq = await searchFAQ(prompt);
  if (faq.confidence > 0.8) return faq.answer + ' (retrieval)';

  // 3. secondary provider or smaller local model
  try { return await localSmallModel(prompt); } catch {}

  // 4. rules-based template or graceful degradation
  return 'I can’t answer confidently right now. We’re on it.';
});

// usage
const answer = await breaker.fire(userPrompt);

Python with `pybreaker`

import pybreaker
from provider import llm_call

breaker = pybreaker.CircuitBreaker(fail_max=5, reset_timeout=30)

@breaker
def safe_llm(prompt: str) -> str:
    return llm_call(prompt, timeout=5.0)

try:
    ans = safe_llm(user_prompt)
except pybreaker.CircuitBreakerError:
    ans = retrieval_only(user_prompt) or 'Degraded mode: try again later.'

A note on retries: jitter and backoff are fine, but cap by a retry budget (e.g., 10% of traffic) so you don’t DDoS the provider during incidents.

Instrument everything: traces, metrics, logs you can act on

You can’t operate what you can’t see. Instrument the AI path like any other critical dependency.

Traces: one span per LLM call; record model, tokens, route, and breaker state.
Metrics: requests, errors, p95 latency, TTFB, fallback rate, guardrail block rate, eval pass rate.
Logs: redacted prompts/responses with correlation IDs.

OpenTelemetry trace + Prometheus metrics (TS)

import { trace, SpanStatusCode } from '@opentelemetry/api';
import client from 'prom-client';

const tracer = trace.getTracer('ai');
const aiRequests = new client.Counter({ name: 'ai_requests_total', help: 'LLM calls' });
const aiFallback = new client.Counter({ name: 'ai_fallback_total', help: 'Fallbacks triggered' });
const aiLatency = new client.Histogram({ name: 'ai_latency_ms', help: 'LLM latency', buckets: [50,100,250,500,1000,2000,5000,10000] });

async function tracedLLM(prompt: string) {
  aiRequests.inc();
  const start = Date.now();
  return tracer.startActiveSpan('llm.completion', async (span) => {
    span.setAttributes({ 'ai.model': 'gpt-4o-mini', 'ai.route': 'primary' });
    try {
      const res = await breaker.fire(prompt);
      return res;
    } catch (e: any) {
      if (e instanceof Error && e.name === 'BreakerOpenError') aiFallback.inc();
      span.recordException(e); span.setStatus({ code: SpanStatusCode.ERROR });
      throw e;
    } finally {
      aiLatency.observe(Date.now() - start);
      span.end();
    }
  });
}

Prometheus alerting rules

groups:
- name: ai-slo
  rules:
  - alert: LLMHighFallbackRate
    expr: sum(rate(ai_fallback_total[5m])) / sum(rate(ai_requests_total[5m])) > 0.15
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: 'Fallback rate >15% for 10m'
  - alert: LLMP95LatencyHigh
    expr: histogram_quantile(0.95, sum(rate(ai_latency_ms_bucket[5m])) by (le)) > 3000
    for: 15m
    labels:
      severity: page
    annotations:
      summary: 'p95 LLM latency >3s for 15m'

Tooling we’ve seen work well: Honeycomb for traces, Prometheus/Grafana for SLOs, Langfuse or Arize Phoenix for LLM-specific analytics, and BigQuery/Snowflake for offline eval logs.

Guardrails around the model: schema, moderation, retrieval confidence

We don’t trust free-form text. We validate, moderate, and gate answers before they touch users.

Schema validation: require JSON output; reject/repair when invalid.
Content moderation: classify unsafe content; block or sanitize.
Retrieval confidence: if RAG confidence is low, don’t generate—or watermark as low confidence.

Pydantic schema + moderation fallback (Python)

from pydantic import BaseModel, conlist, constr, ValidationError

class Answer(BaseModel):
    answer: constr(max_length=600)
    citations: conlist(str, min_items=0, max_items=5)

raw = call_llm_json(prompt, schema=Answer.model_json_schema())

try:
    parsed = Answer.model_validate_json(raw)
    if is_flagged(parsed.answer):  # your moderation classifier/rules
        raise ValueError('Moderation block')
    deliver(parsed)
except (ValidationError, ValueError):
    # downgrade path
    faq = retrieval_only(prompt)
    deliver({
      'answer': faq.answer if faq else 'I can’t answer confidently. Escalating to support.',
      'citations': faq.citations if faq else []
    })

Guardrails should be fast and observable. Track ai_guardrail_block_total and ai_eval_fail_total to spot drift and prompt regressions.

Route and degrade at the mesh/edge

App-level breakers are table stakes. You also want infrastructure-level safety nets.

Istio circuit breaking and outlier detection

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: llm-proxy
spec:
  host: llm-proxy.ai.svc.cluster.local
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
      http:
        http2MaxRequests: 1000
        maxRequestsPerConnection: 100
    outlierDetection:
      consecutive5xx: 5
      interval: 10s
      baseEjectionTime: 30s
      maxEjectionPercent: 50

Pair this with a VirtualService to do weighted routing between providers (primary/canary) and a kill switch via feature flags.

Flag-controlled fallback (OpenFeature-style pseudocode)

const safeMode = flags.getBooleanValue('ai.safe_mode', false);
if (safeMode) return retrievalOnly(prompt);

const provider = flags.getStringValue('ai.provider', 'openai');
return routeTo(provider, prompt);

We’ve used this to shift 20–30% of traffic to a secondary provider during price spikes or partial outages, with zero deploys—just a flag flip.

Prove it works: chaos drills, SLOs, and runbooks

If you don’t rehearse, production will rehearse you.

SLOs that matter
- p95 time-to-first-token < 1.5s; p95 end-to-end < 4s
- Error rate < 2%; Fallback rate < 10%
- Hallucination/eval failure rate < 3% on your golden set
Chaos scenarios to run monthly
1. Provider returns 500/429 for 5 minutes.
2. Latency spike to 8–10s p95.
3. Schema drift: model starts emitting malformed JSON.
4. RAG index half-stale: confidence drops by 20%.
Runbook must include
- Flag names and kill switches (ai.safe_mode, ai.provider).
- How to force all traffic to fallback.
- Dashboards: traces, guardrail blocks, fallback rate, token spend.
- Rollback plan for prompts, templates, and retrieval pipelines (GitOps, versioned).

Results we’ve seen after putting this in place at a retail client: p95 dropped from 8.4s → 3.1s, fallback rate stabilized under 6%, and MTTR on AI incidents went from 47m → 9m.

What I’d do on day one (and what GitPlumbers can help with)

Put a breaker + timeout in front of every LLM call. Today.
Establish layered fallbacks and wire them to feature flags.
Add OTel spans and Prom metrics for requests, latency, fallback, and guardrail blocks.
Gate outputs with schema validation and moderation.
Add mesh-level outlier detection and provider routing.
Write an AI runbook and schedule a monthly chaos drill.

If you want a sober partner who’s implemented this under fire, GitPlumbers will help you wire in breakers, guardrails, and observability without boiling the ocean—and we’ll leave you with dashboards and runbooks your team actually uses.

Related Resources

Key takeaways

Treat LLMs like an external dependency: isolate with a circuit breaker and strict timeouts.
Build layered fallbacks: cache, retrieval-only, smaller local model, rules-based, then human-in-the-loop.
Instrument everything: traces for each request, domain metrics for guardrail blocks and fallback rate, and alerts tied to SLOs.
Use guardrails at the edges: schema validation, moderation, and retrieval confidence gating to catch hallucinations.
Routable architecture beats heroics: canary providers, mesh-level outlier detection, and feature-flag kill switches.
Chaos test your AI flows regularly; rehearse failover so on-call doesn’t learn in production.

Implementation checklist

Define SLOs for AI routes (p95 latency, error rate, fallback rate, hallucination threshold).
Wrap LLM calls with a circuit breaker + timeout; cap concurrency.
Add layered fallbacks (cache → retrieval-only → secondary provider → rules) with feature-flag toggles.
Instrument traces and metrics: ai_requests_total, ai_fallback_total, ai_guardrail_block_total, p50/p95/TTFB.
Deploy mesh/edge protections (Istio outlier detection, retry budgets).
Add validation and moderation guardrails; reject or downgrade low-confidence answers.
Write a runbook and drill chaos scenarios (provider outage, drift, latency spike).

Questions we hear from teams

Do I still need retries if I have a circuit breaker?: Yes, but keep retries bounded by a budget and backoff with jitter. The breaker protects you during sustained failure; retries handle transient blips. Never retry on timeouts forever or you’ll amplify load.
What’s a good first fallback for customer-facing answers?: A retrieval-only answer from your documentation with a confidence watermark. It’s fast, safe, and often good enough. Follow with a smaller/cheaper model or a templated response if retrieval confidence is low.
How do I detect hallucinations automatically?: Use a combination of schema checks, retrieval confidence thresholds, and offline evaluations over golden datasets. Track an eval failure rate metric and alert on spikes. True hallucination detection is probabilistic—design for mitigation, not perfect detection.
Should I run multiple LLM providers?: If uptime or price volatility matter, yes. Use mesh-level routing with outlier detection and feature-flag switches. Keep prompts/templates provider-agnostic and versioned so you can shift traffic without a redeploy.
Where do breakers live—in app code or the mesh?: Both. App-level breakers handle business-aware fallbacks and timeouts. Mesh-level policies protect the fleet and let you eject bad upstreams quickly. They’re complementary, not substitutes.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to an engineer about hardening your AI path Grab the AI observability checklist