The AI Copilot That Faceplanted at 9:05 AM: How We Got It Stable Under Real Customer Load
A production case study in cleaning up AI-assisted code, fixing LLM-induced latency spikes, and turning “works in staging” into boring reliability.
Most “AI outages” are just distributed systems outages with a new failure mode: expensive calls, hard rate limits, and retry storms that multiply pain.Back to all posts
The week the AI copilot met real customers
The product was a familiar story: a mid-market B2B SaaS platform with a new in-app “copilot” that answered user questions and drafted workflows. The team had done what most teams do in 2024–2025: shipped fast using AI-assisted code generation. TypeScript service, Kubernetes, a pgvector table for retrieval, and an LLM call out to a hosted provider.
It looked fantastic in demos. In staging, with three people clicking around, it was “basically instant.” Then Monday hit.
By 9:05 AM local time, support tickets started rolling in:
- “Copilot hangs forever.”
- “Answers are blank.”
- “App is slow even outside copilot.”
And the dashboard was the nightmare combo: latency up, errors up, spend up.
GitPlumbers got pulled in mid-incident. The ask was blunt: “Make it stable without rewriting the product or turning the feature off.”
What was actually breaking (it wasn’t ‘AI magic’)
Under load, the copilot backend was doing three expensive things per request:
- Generate an embedding for the user question
- Query
pgvectorfor context - Call the LLM with the prompt + retrieved context
All reasonable—until you look at the implementation details that came out of AI-generated scaffolding:
- Unbounded concurrency: the service would happily spin up hundreds of in-flight LLM calls per pod.
- Retry storms: a “helpful” retry wrapper retried
429and5xximmediately, with no jitter, and no global cap. - Missing timeouts: several outbound calls had no hard timeout; sockets could hang until the Node process choked.
- No backpressure: requests piled up until Kubernetes killed pods, which made retries worse.
- Token blowups: prompts occasionally exceeded expected size when retrieval returned large chunks, spiking latency and cost.
The root cause wasn’t the model “hallucinating.” It was classic distributed systems behavior:
- Thundering herd on retries
- Queue collapse (everything is a queue; you just didn’t pick where)
- Saturation of downstream dependencies (LLM provider + Postgres)
We’ve seen this movie before—just with different villains. In 2015 it was Elasticsearch timeouts. In 2019 it was gRPC retry policies gone wild. Now it’s LLM APIs.
Constraints we had to respect (aka: no heroics)
This wasn’t a greenfield reliability project. We had real constraints:
- Zero downtime tolerance: the copilot was already marketed and embedded in the UI.
- Hard vendor limits: the LLM provider enforced per-minute request/token caps; bursting caused
429. - Kubernetes already running hot: “add more pods” would increase cost and still hit provider limits.
- No rewrite window: two-week sprint cadence, with releases behind feature flags.
- Compliance: tenant isolation mattered; caching couldn’t leak data across customers.
So we took the boring path that works: guardrails + backpressure + observability, shipped incrementally via canary.
The interventions that made it boring (in the good way)
We made changes in three layers: application code, platform config, and telemetry.
1) Put the LLM behind a concurrency bulkhead
First, we stopped each pod from becoming a denial-of-service machine.
In Node/TypeScript, we introduced a semaphore (or p-limit) so each pod could only run N in-flight LLM calls. Excess requests either queued briefly or failed fast with a user-friendly fallback.
import pLimit from "p-limit";
const limit = pLimit(Number(process.env.LLM_CONCURRENCY ?? 8));
export async function callLLM(req: LLMRequest) {
return limit(async () => {
return withTimeout(
llmClient.chat.completions.create(req),
8000 // hard timeout
);
});
}
async function withTimeout<T>(p: Promise<T>, ms: number): Promise<T> {
const ac = new AbortController();
const t = setTimeout(() => ac.abort(), ms);
try {
// ensure your client supports AbortController
// or wrap fetch() with signal
return await p;
} finally {
clearTimeout(t);
}
}This one change cut the “retry tornado” at the knees.
2) Fix retries: cap them, add jitter, and stop retrying the wrong things
The original code retried almost everything, immediately. We replaced it with:
- Max 2 retries
- Exponential backoff with jitter
- No retries on validation errors or prompt-too-large
- Respect
Retry-Afteron429
import { setTimeout as sleep } from "node:timers/promises";
async function safeLLMCall(fn: () => Promise<any>) {
let attempt = 0;
const max = 2;
while (true) {
try {
return await fn();
} catch (err: any) {
const status = err?.status ?? err?.response?.status;
const retryAfter = Number(err?.response?.headers?.["retry-after"]);
// Don't retry on client mistakes
if (status === 400 || status === 413) throw err;
if (attempt >= max) throw err;
attempt++;
const base = 200 * 2 ** attempt;
const jitter = Math.floor(Math.random() * 150);
const delay = Number.isFinite(retryAfter) ? retryAfter * 1000 : base + jitter;
await sleep(delay);
}
}
}Retry policy is one of those things AI-generated code gets almost right—which is worse than wrong.
3) Add caching where it actually pays off
Two caches mattered:
- Embeddings cache keyed by
(tenantId, normalizedQuestion)with a TTL - RAG context cache keyed by
(tenantId, docSetVersion, embeddingHash)
We used Redis, with tenant-safe keys. No shared cache blobs without tenant prefixing.
const key = `t:${tenantId}:emb:${sha256(normalize(q))}`;
const cached = await redis.get(key);
if (cached) return JSON.parse(cached);
const emb = await embeddings.create({ input: q, model: "text-embedding-3-large" });
await redis.set(key, JSON.stringify(emb), { EX: 60 * 60 * 24 });
return emb;This reduced both Postgres vector queries and LLM prompt sizes (because we could reuse stabilized retrieval results).
4) Introduce load shedding + a “good enough” fallback
We added a simple rule: if we’re saturated (queue depth too high, or circuit breaker open), we respond with:
- A short “copilot is busy” message
- A link to docs/search
- A request id for support correlation
Engineering leaders hate this until they see the alternative: cascading failure that takes out the whole app.
We also added a kill switch feature flag. When the LLM provider had a regional wobble, the copilot degraded gracefully instead of taking the rest of the API down.
5) Make Kubernetes scale on the right signals
CPU-based HPA didn’t help. The real bottleneck was outbound dependency capacity and in-flight work.
We added metrics for:
- in-flight LLM calls
- request queue depth
- p95 latency
Then tuned HPA to avoid runaway scaling that only amplifies 429.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: copilot-api
spec:
minReplicas: 4
maxReplicas: 12
metrics:
- type: Pods
pods:
metric:
name: llm_inflight
target:
type: AverageValue
averageValue: "8"Observability that didn’t lie (LLM spans or it didn’t happen)
The team had logs, but not answers. We instrumented with OpenTelemetry and shipped traces to Grafana Tempo (Datadog works too; the point is consistent tracing).
We added spans with:
modelandprovider- prompt size (chars and estimated tokens)
cache.hitbooleanretrieval.countand chunk sizes- response latency
- rate limit responses (
429)
That let us answer real questions fast:
- “Are we slow because Postgres is slow, or because the LLM is slow?”
- “Did caching help this tenant?”
- “Which endpoints are causing the biggest token usage?”
This is where a lot of AI products fall down: they observe the web API, not the AI boundary. We’ve seen teams spend weeks tuning prompts when their real problem is a missing timeout.
Results after three weeks (measured, not vibes)
We ran a load test that matched production patterns (bursts after standup, quiet mid-day, heavy end-of-quarter). k6 was good enough, with scenarios for “ask question,” “refine answer,” and “open copilot panel.”
k6 run --vus 80 --duration 15m ./load/copilot.jsAfter shipping changes via canary and rolling out behind a feature flag, the numbers moved materially:
- p95 latency: 4.8s → 1.2s (copilot endpoints)
- p99 latency: 11.4s → 2.9s
- 5xx rate: 3.5% → 0.2%
- LLM
429rate: 6–9% during bursts → <0.5% - MTTR (copilot incidents): ~90 minutes → 15 minutes (clear signals + kill switch)
- Monthly infra spend (copilot stack): down ~28% (less thrash, fewer retries, fewer wasted tokens)
- Cache hit rate: embeddings ~62%, retrieval context ~41% (varies by tenant behavior)
The “quiet win” was that the rest of the app stopped slowing down. Once we stopped retry storms and bounded concurrency, the shared Postgres and node pools stabilized.
What we tell teams building AI features now
If you’re shipping AI-assisted functionality into a real SaaS product, here’s what actually works (and what we’ve seen fail):
- Treat the LLM like a flaky downstream dependency, not a magic function call.
- Bound concurrency per pod. Unbounded async is just “DDOS myself” with nicer syntax.
- Retries must be globally budgeted. If you retry without jitter and caps, you’re manufacturing incidents.
- Cache aggressively but safely (tenant-aware keys, short TTLs, and clear invalidation rules).
- Design a degraded mode. A polite fallback is better than a brownout.
- Instrument tokens and prompts. If you don’t measure it, you’ll pay for it.
I’ve seen teams burn a quarter rewriting “the AI platform” when the fix was a semaphore, a timeout, and an HPA that scaled on the right metric.
At GitPlumbers, this is the work: stabilizing AI-assisted and legacy systems so you can ship without gambling your uptime on demo-grade code.
Related Resources
Key takeaways
- Most “AI app outages” are classic distributed-systems failures wearing an LLM costume: retry storms, unbounded concurrency, missing timeouts, and no backpressure.
- Stability came from **guardrails** (timeouts, circuit breakers, fallbacks), **backpressure** (queues + load shedding), and **observability** (LLM-specific spans + SLOs).
- Caching embeddings and responses (with sane TTLs and tenant scoping) is the fastest win for both latency and cost.
- Measure what matters: p95/p99 latency, 5xx rate, saturation, token usage, and downstream dependency health—not vibes from a staging demo.
- You don’t need “AI platform rewrites.” You need boring reliability engineering applied to the LLM boundary.
Implementation checklist
- Define 2-3 user-facing SLOs (latency + availability) and wire alerting to error budget burn.
- Put every LLM call behind `timeout`, `maxRetries`, and a circuit breaker with a fallback response.
- Add backpressure: queue or semaphore around concurrent LLM + vector DB calls.
- Cache embeddings and response fragments with tenant-safe keys and TTLs.
- Instrument with OpenTelemetry spans: prompt size, token counts, model, latency, and cache hit/miss.
- Run load tests that mimic real usage (burst + think time), then tune HPA based on saturation, not CPU alone.
- Ship changes via canary and feature flags; keep a kill switch for the AI feature.
Questions we hear from teams
- What was the single highest-leverage change?
- Bounding concurrency on LLM calls per pod (a bulkhead) plus hard timeouts. It immediately stopped runaway in-flight work and reduced cascading failures into Postgres and the rest of the API.
- Why not just scale Kubernetes pods until it works?
- Because the LLM provider had hard rate limits. Scaling pods increased parallelism, which increased 429s, which triggered retries, which made the system less stable and more expensive.
- How do you cache safely in a multi-tenant AI app?
- Use tenant-prefixed cache keys, avoid cross-tenant shared blobs, and include versioning (like `docSetVersion`) in keys so you can invalidate safely when content changes.
- What telemetry matters for LLM-backed endpoints?
- p95/p99 latency, error rate, in-flight concurrency, queue depth, cache hit rate, and LLM-specific attributes like token counts, model/provider, and rate-limit responses.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
