What thresholds should I use for my first circuit breaker?

Start conservative: request timeout 2–3s, volume threshold 20, error threshold 50%, reset 10s. Tune based on your provider’s p95 and your user experience. Don’t forget per-try timeouts if you enable retries.

How do I prevent hallucinations in a RAG system?

Validate outputs against a JSON schema, include confidence scores and citations, and short-circuit when retrieval returns low-signal context. Use evaluator suites (promptfoo/ragas) on canary traffic to catch drift before a full rollout.

Isn’t mesh-level retry enough?

No. Retries without application timeouts and a breaker will amplify latency spikes and cost. You need app-level control for validation, logging prompt versions, and triggering safe fallbacks.

How do I do provider failover safely?

Normalize the interface, keep prompts semantically equivalent across providers, and record per-provider metrics. Start with read-only shadow traffic to the backup. Only promote after passing SLOs and eval gates.

Ai-delivery · Oct 12, 2025 · 10 minute read

Stop Letting LLMs 500 Your App: Circuit Breakers, Fallbacks, and Guardrails That Actually Work

The production playbook for when your AI services spike, drift, or flat-out hallucinate—and how to keep your app alive anyway.

Alex Ramirez

Principal Engineer, GitPlumbers

20 years shipping and rescuing distributed systems at scale. Ex-Netflix platform, ex-Stripe reliability. I build guardrails so your team can ship AI without waking up PagerDuty at 3 a.m.

Boring fallbacks beat magical failures. Ship the breaker before you ship the prompt.

Back to all posts

The Friday outage that sold the team on circuit breakers

Two quarters ago, a retail client launched an AI-assisted product search. Friday evening, traffic doubled, their LLM provider had a latency wobble, and the node pool started thrashing. p95 jumped from 800ms to 12s, threads piled up behind retry storms, and downstream Redis went red. To add spice, the model drifted after a silent provider update—suddenly calling “toasters” “smart ovens,” which hammered their analytics. Classic.

We added circuit breakers, killed blind retries, and shipped boring fallbacks: semantic cache, template answers when the model timed out, and provider failover. MTTR dropped from hours to minutes. More importantly, the app stopped 500’ing anytime the model sneezed. This is the playbook we now install on day one at GitPlumbers.

Design the breaker before the model

Stop wiring prompts to production without guardrails. Define failure up front.

SLOs: e.g., p95 <= 2.5s, error_rate_5m < 1%, validation_pass_rate >= 98%, hallucination_rate < 2% on canary evals.
Budgets: per-request timeout (e.g., 2500ms), token ceilings (max 1,500 output tokens), and cost caps.
Admission control: rate limit QPS and tokens per tenant; shed low-priority traffic first.
Kill switches: feature-flag the AI path; be able to degrade to rules-based or cached responses instantly.
Versioning: prompts, tools, model IDs, and safety policies must be versioned and auditable.

If you don’t write the breaker’s thresholds first, you’re not building a system—you’re running a demo in production.

App-layer circuit breaker with safe fallbacks

You need the breaker close to where time and money are spent. In Node, opossum works. In Java, resilience4j. Python has pybreaker.

// typescript
import CircuitBreaker from 'opossum';
import pTimeout from 'p-timeout';
import fetch from 'node-fetch';

const LLM_ENDPOINT = process.env.LLM_ENDPOINT!;

async function callLLM(payload: any) {
  const res = await pTimeout(
    fetch(`${LLM_ENDPOINT}/v1/chat/completions`, {
      method: 'POST',
      headers: { 'Content-Type': 'application/json', Authorization: `Bearer ${process.env.LLM_KEY}` },
      body: JSON.stringify(payload),
    }),
    2500, // hard timeout
    'LLMTimeout'
  );

  if (!res.ok) throw new Error(`LLM-${res.status}`);
  const json = await res.json();
  return json;
}

// Circuit breaker: trip when >=50% failures over a volume of 20, cool down 10s
const breaker = new CircuitBreaker(callLLM, {
  timeout: 2600,
  errorThresholdPercentage: 50,
  volumeThreshold: 20,
  resetTimeout: 10000,
});

// Fallback: semantic cache -> provider failover -> template
async function fallbackFn(payload: any) {
  const cached = await semanticCacheGet(payload);
  if (cached) return cached;

  try {
    return await callBackupProvider(payload); // e.g., switch to Anthropic/Ollama
  } catch (e) {
    return templateAnswer(payload); // safe, boring answer
  }
}

breaker.fallback(fallbackFn);

export async function answerQuery(input: string) {
  const payload = buildPayload(input);
  try {
    const result = await breaker.fire(payload);
    return validateOrThrow(result);
  } catch (e) {
    // This still returns something safe from fallbackFn
    return validateOrTemplate(e, payload);
  }
}

Notes:

Always validate the model’s output schema before returning. If validation fails, treat it as an error that can trigger the breaker.
Do not layer retries inside the breaker without jittered backoff and a global timeout.
Record breaker state changes as metrics and logs with correlation IDs.

Mesh-level protection: timeouts, retries, and outlier detection

Don’t trust app code alone. Put a second safety net in the mesh (Istio/Envoy, NGINX, Linkerd).

# istio DestinationRule: trip outlier detection on 5xx bursts and eject for 30s
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: llm-endpoint
spec:
  host: api.openai.com
  trafficPolicy:
    connectionPool:
      http:
        http2MaxRequests: 1000
      tcp:
        maxConnections: 200
    outlierDetection:
      consecutive5xx: 5
      interval: 5s
      baseEjectionTime: 30s
    tls:
      mode: SIMPLE

# istio VirtualService: enforce timeouts and retries with backoff
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: llm-route
spec:
  hosts: ["api.openai.com"]
  http:
    - timeout: 3s
      retries:
        attempts: 2
        perTryTimeout: 1s
        retryOn: 5xx,reset,gateway-error,connect-failure
      fault:
        abort:
          percentage: { value: 0 }
          httpStatus: 0

Mesh-level timeouts stop zombie requests. Outlier detection ejects bad upstreams. Keep retries conservative; LLM endpoints are not idempotent in cost or latency.

Fallbacks that won’t embarrass you in front of customers

When the breaker opens, return something safe and useful.

Semantic cache: hash prompt + retrieved context + user ID; store answers and metadata (tokens, model ID). Good for FAQ and similar queries.
Provider failover: primary -> secondary (gpt-4o -> claude-3-5-sonnet), or to an on-prem model (Ollama, Llama-3.1-70B-instruct) for continuity. Normalize interfaces.
Heuristic/templates: for critical paths (checkout, policy responses), serve a rules-based template with links to human support.
Last-known good: if RAG retrieval fails, fall back to the last indexed fact or a “we couldn’t find this; here’s how to proceed” message.

// zod validation ensures the model returns what we can safely render
import { z } from 'zod';

const Answer = z.object({
  answer: z.string().max(2000),
  sources: z.array(z.string().url()).min(1),
});

function validateOrTemplate(err: unknown, payload: any) {
  try {
    const attempt = templateAnswer(payload);
    const parsed = Answer.parse(attempt);
    return parsed;
  } catch (_) {
    return { answer: "We’re experiencing high load. Here’s a safe summary and a link to support.", sources: ["https://status.example.com"] };
  }
}

Boring beats wrong. Nobody remembers a safe template; everyone screenshots a hallucination.

Instrumentation and observability: you can’t fix what you can’t see

If you’re not tracing prompts and tokens, you’re flying blind. We instrument with OpenTelemetry and expose Prometheus metrics.

// typescript OTel instrumentation
import { trace, context, SpanStatusCode } from '@opentelemetry/api';

const tracer = trace.getTracer('ai-pipeline');

export async function tracedLLMCall(model: string, provider: string, payload: any) {
  return await tracer.startActiveSpan('llm.call', async (span) => {
    span.setAttribute('ai.model', model);
    span.setAttribute('ai.provider', provider);
    span.setAttribute('ai.prompt_version', payload.meta?.promptVersion ?? 'v1');

    const start = Date.now();
    try {
      const res = await breaker.fire(payload);
      span.setAttribute('ai.output_tokens', res.usage?.completion_tokens ?? 0);
      span.setAttribute('ai.input_tokens', res.usage?.prompt_tokens ?? 0);
      span.setAttribute('ai.cost_usd', res.usage?.cost_usd ?? 0);
      span.setStatus({ code: SpanStatusCode.OK });
      return res;
    } catch (e: any) {
      span.recordException(e);
      span.setStatus({ code: SpanStatusCode.ERROR, message: e.message });
      throw e;
    } finally {
      span.setAttribute('ai.latency_ms', Date.now() - start);
      span.end();
    }
  });
}

Prometheus rules worth having:

# prometheus alert rules
- alert: AIServiceBreakerOpen
  expr: sum(rate(ai_breaker_open_total[5m])) by (service) > 0.1
  for: 2m
  labels: { severity: page }
  annotations:
    summary: "Circuit breaker opened >10% over 5m"

- alert: AIValidationFailures
  expr: rate(ai_output_validation_failures_total[5m]) > 0.02
  for: 5m
  labels: { severity: warn }
  annotations:
    summary: "Output validation failures >2%"

- alert: AILatencySLOBreached
  expr: histogram_quantile(0.95, sum(rate(ai_request_duration_seconds_bucket[5m])) by (le)) > 2.5
  for: 10m
  labels: { severity: page }
  annotations:
    summary: "p95 > 2.5s for 10m"

Also log ai.request_id, ai.prompt_hash, ai.model, ai.provider, breaker.state, validation.error, and fallback.type. Tools like Langfuse or Helicone make LLM observability less painful.

Guardrails for hallucination, drift, and cost spikes

I’ve seen teams rely on vibes for quality. Don’t. Build guardrails like you would for payments.

Schema validation: zod/pydantic enforce JSON outputs. Reject and retry with constrained prompts; never render invalid JSON to users.
Policy filters: profanity/toxicity checkers and PII redaction before display. Use provider moderation or NVIDIA NeMo Guardrails/Guardrails.ai.
Evaluator pipelines: run promptfoo, DeepEval, or ragas suites in CI and in shadow mode. Gate deploys on accuracy/toxicity.
Drift detection: treat model or embedding updates like a DB schema change. Canary 1-5% of traffic with shadow evals; compare precision/latency/cost.
Cost ceilings: circuit-break when per-tenant token burn exceeds budget.

# promptfoo example for CI evals
providers:
  - openai:gpt-4o-mini
  - anthropic:claude-3-5-sonnet
prompts:
  - file: prompts/answer.md
assertions:
  - type: contains
    value: "Please contact support" # safety phrase present in fallback
  - type: toxicity
    threshold: 0.1
  - type: json_schema
    value:
      type: object
      properties: { answer: { type: string }, sources: { type: array, items: { type: string } } }
      required: [answer, sources]

If your RAG retriever returns an empty context, don’t let the model invent. Fail fast and return a template with a link to escalate.

Test failure like you mean it: chaos, load, and runbooks

A breaker you’ve never tripped is a breaker that won’t work when you need it.

Load test the AI path with production-ish prompts.
Fault inject 5xx/latency into your LLM upstream.
Game day: pull provider creds, throttle network, watch the app degrade gracefully.

// k6 load test for AI endpoint
import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = { vus: 50, duration: '5m' };

export default function () {
  const res = http.post(`${__ENV.API}/ai/answer`, JSON.stringify({ query: 'find me a red toaster under $50' }), {
    headers: { 'Content-Type': 'application/json' },
  });
  check(res, {
    'status is 200 or fallback 204': (r) => [200, 204].includes(r.status),
    'latency < 1500ms': (r) => r.timings.duration < 1500,
  });
  sleep(1);
}

# Envoy fault injection via Istio for chaos testing
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: llm-faults
spec:
  hosts: ["api.openai.com"]
  http:
    - fault:
        delay:
          percentage: { value: 20 }
          fixedDelay: 2s
        abort:
          percentage: { value: 5 }
          httpStatus: 503
      route:
        - destination: { host: api.openai.com }

And yes, write the runbook: breaker thresholds, who gets paged, how to toggle the kill switch (feature_flag: ai_search=false), and how to roll back prompt/model versions with ArgoCD.

What changes after you ship breakers

The app keeps working when the LLM hiccups. Users see a safe template, not a spinner.
p95 drops by design because timeouts protect your thread pools.
On-call regains weekends because alerts trigger on breaker openings and validation failures, not on vague “CPU high.”
Product stops pushing silent model updates straight to prod; you gate with evals.

We’ve installed this pattern at fintechs, marketplaces, and B2B SaaS. It’s not fancy. It’s just what works. If you want a pair of hands that’s done it before, GitPlumbers can help you wire the breakers, guardrails, and observability without turning your stack inside out.

Related Resources

Key takeaways

Design the breaker before the model: set SLOs, timeouts, and kill switches upfront.
Implement circuit breakers in both app code and the mesh. Retries without timeouts will melt your pool.
Fallbacks must be boring and safe: caches, alternative providers, or heuristic templates—not magic.
Instrument everything: traces, tokens, prompts, validation errors, breaker open events.
Continuously evaluate for hallucination and drift with automated offline/online checks.
Chaos test AI failure modes the same way you test databases and queues.

Implementation checklist

Define AI SLOs (p95 latency, non-200 error rate, validation pass rate).
Add timeouts, budgets, and rate limits per request/token.
Implement app-level breaker with safe fallbacks and a global kill switch.
Configure mesh-level outlier detection, retries with backoff, and timeouts.
Instrument OTel spans with ai.* attributes; emit breaker/validation metrics.
Set up canary + shadow evals for model changes; version prompts and policies.
Create runbooks for breaker-open scenarios and hallucination spikes.
Run chaos drills: inject 5xx, timeouts, and slow responses into your AI path.

Questions we hear from teams

What thresholds should I use for my first circuit breaker?: Start conservative: request timeout 2–3s, volume threshold 20, error threshold 50%, reset 10s. Tune based on your provider’s p95 and your user experience. Don’t forget per-try timeouts if you enable retries.
How do I prevent hallucinations in a RAG system?: Validate outputs against a JSON schema, include confidence scores and citations, and short-circuit when retrieval returns low-signal context. Use evaluator suites (promptfoo/ragas) on canary traffic to catch drift before a full rollout.
Isn’t mesh-level retry enough?: No. Retries without application timeouts and a breaker will amplify latency spikes and cost. You need app-level control for validation, logging prompt versions, and triggering safe fallbacks.
How do I do provider failover safely?: Normalize the interface, keep prompts semantically equivalent across providers, and record per-provider metrics. Start with read-only shadow traffic to the backup. Only promote after passing SLOs and eval gates.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Schedule a Reliability Review Download the AI Observability Playbook