The AI Assistant That Melted at 2k RPS (And How We Got It Boring Again in 10 Days)
A Series B SaaS shipped an AI-assisted workflow that demoed like magic—and faceplanted under real customer load. We were called on a Friday; by the next week their p95 fell 83%, error rate dropped 10x, and cloud spend per request halved.
“We didn’t change the model. We changed the plumbing.”Back to all posts
The Friday Night Page: AI Magic Meets Real Traffic
They’d just GA’d an AI-assisted workflow for customer support agents—summarization, suggested replies, and auto-tagging—built with LangChain 0.2, OpenAI gpt-4o via Azure, retrieval from Pinecone, and glue code in Node 18/Express plus a Python RAG worker. Demos crushed. The first real customer (3k agents) turned it on at noon and everything went sideways: p95 3.8s, 7.2% 5xx on peaks, and the Kubernetes cluster’s CPU was yawning while the queue was on fire.
We got the call at 7:12 PM. “We can turn it off or fix it. Monday is our renewal date.” I’ve seen this movie since the first “bot” craze in 2016. The tech changes, the failure modes rhyme.
“The system didn’t crash. It melted.”
Constraints were brutal:
- Real customers on Monday; no maintenance window.
- Azure OpenAI rate limits; Pinecone with per-index QPS caps.
- EKS + Istio 1.20 + ArgoCD already in place (GitOps or bust).
- Tight budget: no vendor shopping spree.
We said yes because the failure was familiar: no backpressure, wrong autoscaling signal, and a pile of AI-generated glue code stitched by “vibe coding.”
What We Walked Into (The Usual Suspects)
After 4 hours of traces and dashboards, here’s what actually broke:
- Concurrency Explosion: Express was fanning out 3–5 downstream calls per request (embeddings, retrieval, model, tools) with no concurrency caps. Under 2k RPS, that’s 10k+ concurrent I/O operations.
- CPU-based HPA: HPA looked at CPU, but the bottleneck was external I/O and upstream rate limits. Pods scaled slowly and pointlessly.
- No Backpressure:
bullmqpushed unbounded jobs. When Pinecone throttled, the queue doubled, then tripled—classic thundering herd. - Missing Circuit Breakers: Istio had generous timeouts; the app retried on 429 without jitter. We DDoS’d our own vendors.
- Prompt Bloat & No Cache: 2–3KB system prompts plus oversized conversation windows. No prompt or response caching; every miss hit the model.
- Streaming Off: Users waited for the full response while we built it server-side. Perceived latency was awful.
- Observability Gap: Traces stopped at the service boundary. No span linking from HTTP to Pinecone/OpenAI; SLOs were vibes, not math.
I’ve seen this fail at three unicorns. Different logos, same smell: AI features added like a plugin, not a system.
The Stabilization Plan (10 Days, No Heroics)
We sequenced changes behind feature flags and shipped via canaries. The goal: restore boring.
- Add backpressure and concurrency caps.
- Put circuit breakers at the mesh and the client.
- Scale on the right signals (concurrency/tokens).
- Stream responses and cache prompts/results.
- Define SLOs, wire tracing, and automate rollback.
1) Concurrency and Backpressure
- We kept
bullmqbut added strictmaxStalledCount, saneconcurrency, and a bounded producer. For HTTP, we introduced per-request concurrency withp-limitand cancellations.
// Node 18 – cap concurrent downstreams and add timeouts
import pLimit from 'p-limit';
import pRetry from 'p-retry';
const limit = pLimit(parseInt(process.env.DOWNSTREAM_CONCURRENCY || '8'));
const withTimeout = (p: Promise<any>, ms = 3000) => {
const ac = new AbortController();
const t = setTimeout(() => ac.abort(), ms);
return p.finally(() => clearTimeout(t));
};
async function callModel(input: ModelInput) {
return limit(() => pRetry(() => openAI.complete(input, { signal: ac.signal }), {
retries: 2,
factor: 2,
minTimeout: 100,
maxTimeout: 800,
onFailedAttempt: e => meter.rate('openai.retry').mark(),
}));
}- BullMQ queue settings went from “infinite optimism” to bounded reality.
// Queue producer – refuse work when pressure is high
const pending = await queue.getWaitingCount();
if (pending > 2000) throw new Error('Backpressure: queue saturated');
await queue.add('rag-job', payload, { removeOnComplete: true, timeout: 5000, attempts: 2 });2) Circuit Breakers, Timeouts, Retries (in the Mesh)
Put the big levers outside the app. Istio can protect you from yourself.
# Istio DestinationRule with outlier detection (Azure OpenAI)
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: openai-dr
spec:
host: openai.azure.com
trafficPolicy:
connectionPool:
http:
http1MaxPendingRequests: 100
maxRequestsPerConnection: 100
outlierDetection:
consecutive5xx: 5
interval: 5s
baseEjectionTime: 30s
maxEjectionPercent: 50
retries:
attempts: 2
perTryTimeout: 2s
retryOn: 5xx,reset,connect-failure,refused-streamAnd set real timeouts.
# Istio VirtualService with tight timeouts
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: app-vs
spec:
hosts: ["app.svc.cluster.local"]
http:
- route:
- destination: { host: app }
timeout: 6s3) Scale on the Right Signals (KEDA + tokens/sec)
CPU didn’t move; tokens and in-flight requests did. We exposed custom metrics via OpenTelemetry and scaled with KEDA.
# KEDA ScaledObject – scale web on in-flight requests
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: web-scaled
spec:
scaleTargetRef:
name: web
minReplicaCount: 3
maxReplicaCount: 60
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus-server
metricName: http_inflight
threshold: "600"
query: sum(http_server_active_requests{job="web"})We did similar for the RAG worker with tokens_emitted_per_sec to avoid token tail-latency spikes.
4) Stream Early, Cache Aggressively
- Enabled SSE streaming from the model through to the browser so agents saw tokens within ~200ms.
- Implemented a prompt/answer cache keyed on
hash(prompt+retrieval_snapshot+model_version)with 5–15 minute TTLs.
// Simple Redis prompt cache with versioned key
const key = `ai:v2:${hash(prompt)}:${retrievalHash}:${model}`;
const cached = await redis.get(key);
if (cached) return streamToClient(JSON.parse(cached));
const result = await streamFromModel(...);
redis.set(key, JSON.stringify(result), 'EX', 600);5) SLOs, Traces, and Automated Rollback
- Defined SLOs:
99.5% <= 1sfor perceived latency (first token) and<1%5xx for AI endpoints. - Instrumented spans from HTTP -> Pinecone -> OpenAI with
traceparentpropagation. - Canary + auto-rollback with Argo Rollouts on SLO burn.
# Argo Rollout – canary with metric guardrail
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: web
spec:
strategy:
canary:
steps:
- setWeight: 10
- pause: { duration: 5m }
- setWeight: 25
- pause: { duration: 10m }
analysis:
templates:
- templateName: slo-check
startingStep: 0
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: slo-check
spec:
metrics:
- name: error-rate
interval: 2m
failureLimit: 2
provider:
prometheus:
address: http://prometheus-server
query: |
sum(rate(http_requests_total{app="web",status=~"5.."}[2m]))
/
sum(rate(http_requests_total{app="web"}[2m]))
> 0.01Shipping Behind a Safety Net (Observability That Matters)
We consolidated dashboards to the two questions that kill you in production:
- “Are we burning our error budget?”
- “Where is the tail?”
Prometheus alerts cut the noise; only page on symptoms users feel.
# PrometheusRule – page on SLO burn and tail latency
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: ai-slo
spec:
groups:
- name: ai
rules:
- alert: AILatencyP95Breaching
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{route="/ai"}[5m])) by (le)) > 1
for: 10m
labels: { severity: page }
annotations:
description: p95 > 1s for 10m on /ai
- alert: AIErrorBudgetBurn
expr: (sum(rate(http_requests_total{route="/ai",status=~"5.."}[30m])) / sum(rate(http_requests_total{route="/ai"}[30m]))) > 0.01
for: 15m
labels: { severity: page }Traces told the story: 52% of time in OpenAI, 18% in Pinecone, 7% JSON parsing, 3% GC. That’s where we fought.
We also killed “vibe coding” drift:
- Pinned
LangChainto a known-good minor. - Froze prompt templates behind flags and recorded prompt/version in spans.
- Built a regression suite with 200 canned conversations and golden responses. Changes that moved semantic accuracy beyond a threshold failed CI.
All shipped via GitOps with ArgoCD so every tweak was auditable and revertible. Boring on purpose.
The Numbers (Before/After, Real Dollars Included)
We didn’t change models. We changed the system around them.
- p95 latency: 3.8s -> 650ms (−83%).
- Perceived latency (time-to-first-token): 1.9s -> 220ms (−88%).
- 5xx rate on AI endpoints: 7.2% -> 0.6% (−10x).
- MTTR: 72m -> 9m, thanks to SLO-driven alerts and one-click rollback.
- Cost per 1k requests (compute + model + vector): $0.41 -> $0.19 (−54%).
- Throughput sustained: 2.3k RPS steady for 2 hours with no SLO burn (load sim plus real customer pilot).
Business impact:
- Monday renewal closed. Support leaders reported a 14% decrease in handle time on AI-assisted tickets within two weeks.
- Cloud bill came in 18% under forecast despite higher usage.
I’ve seen teams chase model swaps for months to get these gains. They’re in your plumbing, not your prompt.
What We’d Do Differently (And What You Can Steal Now)
If we had another week:
- Swap
bullmqto a proper broker (e.g.,RabbitMQwith quorum queues) for stricter backpressure semantics. - Add adaptive concurrency limits (AIMD) per-customer to prevent one whale from starving everyone else.
- Introduce a shadow traffic lane to evaluate model/prompt changes with real inputs.
You can steal this today:
- Put a
DestinationRulewith outlier detection on every external AI dependency. - Cache prompts; version the key on embeddings + prompt + model.
- Stream to the user; optimize for time-to-first-token.
- Scale on in-flight requests or tokens/sec, not CPU.
- Canary everything touching prompts, retrieval, or models. Treat them like schema changes.
And for the love of uptime, ban “vibe code” in hot paths. AI-generated code is fine, but it needs a grown-up review. We call it a “vibe code cleanup.” GitPlumbers does this weekly.
The Playbook (Condensed)
- Define SLOs up front:
p95 <= 1s,5xx < 1%, error budget per 30d. - Add backpressure: queues and per-request concurrency caps.
- Enforce timeouts, jittered retries, and circuit breakers in the mesh and clients.
- Scale with KEDA on concurrency/tokens; ditch CPU-only HPA.
- Stream responses; cache prompts/answers with versioned keys.
- Instrument traces through Pinecone/OpenAI with propagated context.
- Ship via Argo Rollouts canaries; auto-rollback on SLO burn.
- Lock deps, pin prompts, run a golden-convo regression suite.
If your AI feature melts at 2k RPS, this is how you make it boring again in a week. Call us before Monday.
Key takeaways
- AI features fail under load when concurrency, backpressure, and rate limits aren’t first-class citizens.
- Autoscaling on CPU for LLM-heavy traffic is a trap; scale on concurrent requests or tokens/sec.
- Canary every change that touches prompts, models, or retrieval—behavioral regressions don’t show up in unit tests.
- Add circuit breakers and timeouts outside your app (service mesh) and inside your client libraries.
- Prompt caching and response streaming cut perceived latency and cost more than most model swaps.
Implementation checklist
- Define SLOs for latency and failure budgets before changing code.
- Add backpressure: queue or concurrency caps; never let callers pile on.
- Implement circuit breakers, retries with jitter, and short timeouts for LLM calls.
- Scale on a relevant signal (concurrency/tokens), not CPU.
- Introduce prompt caching keyed by prompt+retrieval snapshot.
- Ship via canaries with automated rollback on SLO burn.
- Instrument everything: traces from HTTP ingress to model call; label by customer and feature flag.
Questions we hear from teams
- Why not just switch to a faster/cheaper model?
- Model swaps hide systemic issues. We cut p95 by 83% without changing the model. Fix concurrency, backpressure, caching, and streaming first. Then benchmark models under realistic load with shadow traffic.
- Is CPU-based HPA ever OK for AI features?
- Rarely. LLM-heavy workloads bottleneck on external I/O and rate limits. Scale on in-flight requests or tokens/sec using KEDA or custom metrics. Keep CPU for background jobs or pure compute paths.
- How do you keep AI-generated code from rotting?
- Require human review for hot paths, pin dependencies, add golden conversation tests, and gate prompt/model changes behind flags. We call the initial pass a “vibe code cleanup,” then schedule quarterly refactors.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
