Circuit Breakers for LLMs: How We Stop Hallucinations, Drift, and Latency Spikes From Taking Production Down

Your AI chain is only as strong as its weakest upstream. Here’s how we keep LLM-driven systems from faceplanting—and what to instrument so you actually know when they’re failing.

Fast failure beats slow chaos. Circuit breakers and fallbacks turn AI “maybe” into a dependable service.
Back to all posts

The Thursday 3 p.m. Latency Storm

At a fintech I worked with, p95 response time for our RAG-backed support assistant jumped from 450 ms to 4.8 s at exactly 3:07 p.m. The culprit wasn’t the LLM itself—it was the vector store getting I/O throttled after a noisy neighbor incident. The app had optimistic retries with exponential backoff and no circuit breaker. Traffic piled up, threads starved, and our queue workers started timing out. The LLM kept returning, but downstream consumers were already on fire.

That afternoon we shipped two things that saved us repeatedly afterward: circuit breakers where they mattered and fallbacks that degraded gracefully. We also got serious about observability—traces for every hop, metrics for every failure mode, and safety guardrails so hallucinations couldn’t leak to customers.

What Actually Fails in AI Chains (And How It Bites You)

If you’ve shipped AI to prod, you’ve seen some flavor of these:

  • Latency spikes: vector DB cold caches, cross-region calls, rate limits, or a vendor brownout. Symptoms: thread/connection exhaustion, cascading timeouts.
  • Hallucinations: model returns fluent nonsense. Symptoms: incorrect actions, policy violations, refunds.
  • Drift: prompt or data changes shift behavior. Symptoms: gradual accuracy decay, rising escalation rates.
  • Schema breakage: model returns malformed JSON; downstream parsers crash.
  • Vendor instability: throttling or 5xx from openai, bedrock, vertex-ai, or self-hosted models.

Mitigations that actually work:

  • Bounded retries and short timeouts at each hop.
  • Circuit breakers with fast failure and ejection.
  • Layered fallbacks that keep UX acceptable.
  • Output validation and guardrails for content and structure.
  • End-to-end instrumentation and SLOs so you can see and react in time.

Put Circuit Breakers Where They Belong (Mesh + App)

Service meshes are great at protecting the network; apps still need their own fuses. Do both.

  • Mesh-level: shed load before it hits your pods; eject bad upstreams; cap concurrency.
  • App-level: trip fast based on real business signals (schema errors, hallucination risk), not just HTTP codes.

Istio example: timeouts, retries, outlier detection

aPiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: ai-upstream-vs
spec:
  hosts: ["llm.api.svc.cluster.local"]
  http:
    - route:
        - destination:
            host: llm.api.svc.cluster.local
            subset: primary
      timeout: 1500ms
      retries:
        attempts: 2
        perTryTimeout: 600ms
        retryOn: 5xx,connect-failure,reset
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: llm-upstream-dr
spec:
  host: llm.api.svc.cluster.local
  trafficPolicy:
    connectionPool:
      http:
        http1MaxPendingRequests: 100
        maxRequestsPerConnection: 100
    outlierDetection:
      consecutive5xxErrors: 3
      interval: 5s
      baseEjectionTime: 30s
      maxEjectionPercent: 50

App-level breaker in Node using opossum

import CircuitBreaker from 'opossum';
import { Span, context, trace } from '@opentelemetry/api';
import OpenAI from 'openai';

const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

async function callLLM(prompt: string) {
  const span = trace.getTracer('ai').startSpan('llm.invoke');
  span.setAttribute('ai.model', 'gpt-4o');
  try {
    const res = await client.chat.completions.create({
      model: 'gpt-4o',
      messages: [{ role: 'user', content: prompt }],
      timeout: 1200, // ms hard cap
    });
    span.setAttribute('ai.latency_ms', res?.usage?.total_time_ms ?? 0);
    return res.choices[0].message.content;
  } finally {
    span.end();
  }
}

const breaker = new CircuitBreaker(callLLM, {
  timeout: 1500, // fail fast
  errorThresholdPercentage: 50,
  resetTimeout: 10000, // half-open after 10s
  capacity: 50, // bulkhead-ish
});

breaker.on('open', () => console.warn('llm breaker OPEN'));
breaker.on('halfOpen', () => console.warn('llm breaker HALF_OPEN'));
breaker.on('close', () => console.warn('llm breaker CLOSED'));

export async function safeLLM(prompt: string) {
  return breaker.fire(prompt);
}

Notes:

  • Keep timeouts under your SLO budget; don’t let slow AI starve the rest of the app.
  • Set breaker error conditions to include non-HTTP signals (schema invalid, high hallucination risk).

Fallbacks That Degrade, Not Betray

A fallback shouldn’t lie to the user. It should gracefully reduce capability while keeping trust.

Order of fallbacks we’ve seen work in production:

  1. Smaller/faster model with a compact prompt (gpt-4o-mini, Mistral-small).
  2. Retrieval-only summary with citations (skip generation when the model is the bottleneck).
  3. Cached known-good response from Redis keyed by normalized query.
  4. Human-in-the-loop or ticket creation with tight SLAs.

Python example: layered fallback with validation

import os, json
from pydantic import BaseModel, ValidationError
from tenacity import retry, stop_after_attempt, wait_exponential
import openai

openai.api_key = os.getenv('OPENAI_API_KEY')

class Answer(BaseModel):
    answer: str
    citations: list[str]
    confidence: float

@retry(stop=stop_after_attempt(2), wait=wait_exponential(multiplier=0.1, max=0.5))
def llm_call(model, prompt, timeout=1.2):
    return openai.ChatCompletion.create(
        model=model,
        messages=[{"role":"user","content":prompt}],
        timeout=timeout
    )

def retrieval_only(query):
    # pretend we hit vector DB and return top-3 docs
    docs = [
        {"title":"FAQ Billing","url":"/docs/billing","snippet":"…"},
    ]
    return Answer(answer="Based on our docs…", citations=[d['url'] for d in docs], confidence=0.6)

def validate_json(text:str)->Answer:
    try:
        obj = json.loads(text)
        return Answer(**obj)
    except (json.JSONDecodeError, ValidationError):
        raise ValueError('schema_invalid')

def answer(query:str)->Answer:
    prompt = f"Return JSON: {{answer, citations, confidence}}. Q: {query}"
    try:
        res = llm_call('gpt-4o', prompt)
        return validate_json(res.choices[0].message.content)
    except Exception:
        # fallback 1: smaller model
        try:
            res = llm_call('gpt-4o-mini', prompt, timeout=0.8)
            return validate_json(res.choices[0].message.content)
        except Exception:
            # fallback 2: retrieval-only
            ans = retrieval_only(query)
            if ans.confidence < 0.5:
                # fallback 3: handoff
                raise RuntimeError('handoff_required')
            return ans
  • Store a fallback_reason field in logs for every downgrade.
  • Never fabricate citations in a fallback—link to real sources or don’t claim sources at all.

Guardrails: Validate Structure, Evidence, and Policy

Guardrails should be automated and cheap. They decide when to accept, retry, or fall back.

  • Schema validation: accept only strict JSON. Reject on schema_invalid and trip breaker.
  • Evidence overlap for RAG: ensure the answer is grounded in retrieved docs.
  • Content policy: run moderation filters on input and output.
  • Safety budget: cap tokens, temperature, and tool calls; abort when exceeding budget.

Quick evidence-overlap gate

def grounded(enriched_answer: Answer, retrieved_chunks: list[str]) -> bool:
    # toy overlap metric: Jaccard on stemmed content words
    ans_terms = set(t for t in enriched_answer.answer.lower().split() if len(t) > 4)
    doc_terms = set()
    for ch in retrieved_chunks:
        doc_terms.update(t for t in ch.lower().split() if len(t) > 4)
    overlap = len(ans_terms & doc_terms) / (len(ans_terms) + 1)
    return overlap >= 0.2

Moderate and constrain

  • Use vendor moderation APIs or fast local classifiers (openai.moderations, aws.comprehend, HateXplain).
  • Enforce max_tokens, temperature<=0.5, and a per-request token budget.
  • If grounded() fails or moderation flags content, fall back or handoff—don’t “retry until it passes.”

Instrumentation That Actually Catches Problems

If you can’t see it, you can’t fix it. Wire tracing and metrics through the entire AI flow.

  • OpenTelemetry for traces across: web → retrieval → LLM → postprocessing. Propagate correlation_id.
  • Prometheus for RED metrics: request rate, error rate (including schema/hallucination failures), duration.
  • Logs must be structured with ai.request_id, ai.model, ai.latency_ms, ai.fallback_reason.

OpenTelemetry span around LLM call (TypeScript)

import { context, trace, SpanStatusCode } from '@opentelemetry/api';

async function llmWithTrace(fnName: string, invoke: () => Promise<any>) {
  const tracer = trace.getTracer('ai');
  return await tracer.startActiveSpan(`llm.${fnName}`, async (span) => {
    try {
      const res = await invoke();
      span.setAttribute('ai.vendor', 'openai');
      span.setAttribute('ai.model', res.model);
      span.setAttribute('ai.prompt_tokens', res.usage?.prompt_tokens ?? 0);
      span.setAttribute('ai.completion_tokens', res.usage?.completion_tokens ?? 0);
      return res;
    } catch (e:any) {
      span.recordException(e);
      span.setStatus({ code: SpanStatusCode.ERROR, message: e.message });
      throw e;
    } finally {
      span.end();
    }
  });
}

Prometheus metrics and alerting

Expose histograms and counters from your gateway or app:

// Go pseudo
var Latency = prometheus.NewHistogramVec(prometheus.HistogramOpts{
  Name: "ai_inference_latency_ms", Buckets: prometheus.LinearBuckets(100, 100, 30),
}, []string{"model","route"})
var Errors = prometheus.NewCounterVec(prometheus.CounterOpts{
  Name: "ai_failures_total",
}, []string{"reason","model"})

Alert when p95 crosses budget or breaker opens too often:

- alert: AILatencySpike
  expr: histogram_quantile(0.95, sum(rate(ai_inference_latency_ms_bucket[5m])) by (le)) > 2000
  for: 10m
  labels: { severity: critical }
  annotations:
    summary: "AI p95 latency > 2s"

- alert: AICircuitBreakerOpenRate
  expr: sum(rate(ai_breaker_open_total[5m])) by (service) > 0.1
  for: 5m
  labels: { severity: warning }

SLO targets we’ve used:

  • p95 < 1.5 s for user-facing endpoints.
  • Error rate (including schema invalid) < 2% over 24h.
  • Hallucination/grounding failure < 1% based on sampled evals.

Release AI Like You Release Infra: Flags and Canaries

Ship risky model/prompt changes behind flags and rollouts.

  • Feature flags: LaunchDarkly, Unleash, or Flagr for per-segment enablement and instant kill switches.
  • Canaries: Argo Rollouts/Flagger to ramp traffic to new model endpoints.

LaunchDarkly guard

const enabled = await ldClient.variation('ai-answer-enabled', user, false);
if (!enabled) return legacyAnswer();

Argo Rollouts for an LLM gateway

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: llm-gateway
spec:
  strategy:
    canary:
      steps:
        - setWeight: 10
        - pause: {duration: 300}
        - analysis:
            templates:
            - templateName: ai-latency-analysis
        - setWeight: 50
        - pause: {duration: 600}
  selector:
    matchLabels:
      app: llm-gateway

Roll back automatically if your analysis template sees p95 or error-rate regression.

Drift: Detect It Before Your Users Do

Drift rarely throws a 500. It shows up as more escalations, lower CSAT, or subtle accuracy loss.

What we implement:

  • Golden set evals daily: fixed prompts/inputs with expected outputs; block deploys on regression.
  • Embedding distribution shift: track centroid/covariance of embeddings; alert when KL divergence changes beyond threshold.
  • Prompt checksum in traces: when a prompt file changes, you’ll see it in your spans immediately.

Cheap drift check (offline batch):

  • Sample 200 recent queries, run through current and previous prompt/model.
  • Score with automatic metrics (BLEU/ROUGE for summaries, entailment classifiers for Q&A) plus a human pass on top 20.
  • If delta crosses guardrail, flip the feature flag back and open a ticket.

Drift is where we usually claw back 20–30% MTTR—because you actually know when behavior changed and can revert instantly.

Related Resources

Key takeaways

  • Put circuit breakers at both the mesh and application layers; don’t rely on one or the other.
  • Favor graceful degradation: smaller models, cached answers, retrieval-only summaries, then human handoff.
  • Validate AI outputs against schemas and retrieval evidence; abort if confidence or overlap is too low.
  • Instrument every hop with OpenTelemetry; expose RED metrics to Prometheus and set real SLOs.
  • Use feature flags and canaries to roll AI changes the same way you’d roll infra.
  • Monitor for drift with regression suites, embedding distribution checks, and periodic rebaselining.

Implementation checklist

  • Define SLOs for latency, error rate, and hallucination rate before shipping.
  • Set mesh-level timeouts, retries (bounded), and outlier detection for AI upstreams.
  • Wrap AI calls in an app-level circuit breaker with fast timeouts and bulkheads.
  • Implement layered fallbacks: fast model → compact prompt → retrieval-only → cached → human.
  • Validate outputs with a strict schema and retrieval-overlap threshold; log fallbacks.
  • Instrument with OpenTelemetry and publish ai_* metrics to Prometheus; alert on p95 spikes and breaker open rate.
  • Gate risky prompts/models behind feature flags; canary with Argo Rollouts or Flagger.
  • Schedule drift checks and offline evals; rotate prompts and embeddings on findings.

Questions we hear from teams

Where should I put the circuit breaker: mesh or app?
Both. Use the mesh (Istio/Linkerd/Envoy) to shed load and eject bad upstreams. Use an app-level breaker to trip on semantic failures (schema invalid, low grounding) the mesh can’t see.
How do I detect hallucinations automatically?
Use a retrieval-overlap gate for RAG, JSON schema validation for structured outputs, and a lightweight verifier (NLI/entailment) for critical answers. Log and sample for human review.
What’s a sane fallback order?
Smaller/faster model; compact prompt; retrieval-only with citations; cached answer; then human handoff. Don’t invent data in fallbacks.
How do I pick SLOs for AI endpoints?
Start with user experience: p95 under 1–1.5s for interactive flows, <2% error including schema failures, and <1% grounding failures on sampled evals. Adjust based on margin of error and business tolerance.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Schedule a resilience assessment Download the AI Resilience Checklist

Related resources