The AI Feature That Buckled at 4 p.m.—And How We Kept It Standing
A B2B SaaS rolled out an AI assistant, then real users showed up. Rate limits, token burn, and 12s p95s. GitPlumbers stabilized it under live fire—no replatform, no pause button.
> “We were hemorrhaging tokens and user trust. GitPlumbers didn’t sell magic—they put in guardrails. Two weeks later, demos stopped being scary.” — CTO, mid-market B2B SaaSBack to all posts
The 4 p.m. Faceplant You’ve Lived Through
They ran a webinar, 1,800 signups, and the new AI drafting assistant was the star. Four minutes into the demo, the logs lit up: 429 storms from the LLM vendor, Node.js event loops pinned, vector queries queueing for seconds, and a token bill that looked like a Vegas weekend. p95 latency spiked to 12.4s, error rate to 8.7%. The sales team slacked “can we roll it back?” There was no rollback. The feature had quietly become core to the workflow.
I’ve seen this movie. AI features run great in staging, then die under real users and bursty traffic. The root cause is always the same: unbounded concurrency and wishful thinking about external dependencies.
GitPlumbers was called in “without turning anything off.” We didn’t replatform, didn’t change the frontend, and didn’t ask for a feature freeze. We stabilized it live.
What We Walked Into (Constraints Included)
Architecture, warts and all:
next.jsfrontend calling a Node/TypeScript API (express) that orchestrates prompts vialangchainto Azure OpenAI (gpt-4ofor generation,text-embedding-3-smallfor embeddings)- RAG on
PostgreSQL 14withpgvector 0.5, naiveivfflatindex, uneven tenant data creating hot partitions - Redis-backed job system existed for email, not used for AI calls; AI path did sync fanout from web tier
- No rate limiting per tenant, no backpressure; retries on 429 were infinite with linear backoff (yep)
- Observability: basic
winstonlogs and an aged Grafana pointing at CPU and memory only; no token metrics, no tracing - Constraints: SOC2 in flight (no new vendors without review), multi-tenant SLAs, and a hard requirement of zero downtime. Timeline: 4 weeks to get to “reliably demo-able” and 8 weeks to hit SLOs
Success targets we agreed on:
- AI endpoint p95 < 3s, p99 < 6s under 250 sustained RPS
- Error rate < 1% excluding 4xx
- Reduce LLM spend by 50% without degrading output quality beyond a 5% hit in human evals
- MTTR < 10 minutes for AI incidents
The First 72 Hours: Put the Fire Doors In
We didn’t “optimize prompts” first. We installed guardrails so the system could breathe.
- Queue and cap concurrency
- Moved AI calls off the request thread into
bullmqwith Redis. If the queue passed a threshold, we returned429with a friendly message rather than DDOSing the LLM.
// api/ai.ts
import express from 'express'
import { Queue, Worker, QueueScheduler } from 'bullmq'
import CircuitBreaker from 'opossum'
import { aiCall } from './llm'
const router = express.Router()
const connection = { host: process.env.REDIS_HOST, port: 6379 }
const queue = new Queue('ai-requests', {
connection,
defaultJobOptions: {
attempts: 2,
backoff: { type: 'exponential', delay: 250 },
removeOnComplete: true,
},
})
new QueueScheduler('ai-requests', { connection })
const breaker = new CircuitBreaker(aiCall, {
timeout: 5000,
errorThresholdPercentage: 50,
resetTimeout: 10000,
})
// Backpressure: don’t accept more than 1k waiting jobs
router.post('/v1/ai', async (req, res) => {
const waiting = await queue.getWaitingCount()
if (waiting > 1000) return res.status(429).send('Busy. Try again in a moment.')
const job = await queue.add('chat', { tenant: req.tenantId, payload: req.body })
res.status(202).json({ id: job.id })
})
// Workers (separate deployment) run the actual LLM call
new Worker('ai-requests', async (job) => {
return breaker.fire(job.data)
}, { connection, concurrency: Number(process.env.AI_WORKER_CONCURRENCY || 8) })
export default router- Scale on demand, not hope
- Hooked queue depth to autoscaling via KEDA’s Prometheus scaler. We exported a metric
bull_queue_waiting{queue="ai-requests"}and scaled workers accordingly.
# keda-scaledobject.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: ai-worker
spec:
scaleTargetRef:
name: ai-worker
minReplicaCount: 2
maxReplicaCount: 20
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus-server.monitoring.svc:80
query: sum(bull_queue_waiting{queue="ai-requests"})
threshold: "200" # one worker per ~200 waiting jobs- Timeouts, retry budgets, and circuit breaking
- Hard cap on LLM call duration (5s) and a per-tenant retry budget (2 tries, jittered backoff). No infinite retries. Bad tenants can’t starve good ones.
// llm.ts
import OpenAI from 'openai'
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY })
export async function aiCall({ tenant, payload }: any) {
const controller = new AbortController()
const timeout = setTimeout(() => controller.abort(), 5000)
try {
return await openai.chat.completions.create({
model: 'gpt-4o-mini',
messages: payload.messages,
max_tokens: 600,
temperature: 0.2,
timeout: 5000,
signal: controller.signal as any,
})
} finally { clearTimeout(timeout) }
}- Fast load test, realistic traffic
- Reproduced the exact meltdown with
k6+ recorded user journeys. This gave us a baseline and caught regressions fast.
// load.js
import http from 'k6/http'
import { check, sleep } from 'k6'
export const options = {
stages: [
{ duration: '2m', target: 100 },
{ duration: '5m', target: 250 },
{ duration: '5m', target: 0 },
],
}
export default function () {
const payload = JSON.stringify({ messages: [{ role: 'user', content: 'Draft a proposal for ACME' }] })
const res = http.post('https://api.example.com/v1/ai', payload, { headers: { 'Content-Type': 'application/json' } })
check(res, { 'status is 202 or 200': (r) => r.status === 202 || r.status === 200 })
sleep(1)
}Result after 3 days: 429s down 63%, median latency -58%, web tier CPU back to normal. Users noticed the system felt “responsive again,” even when it returned a polite “busy” instead of silently spinning.
Make the AI Layer Deterministic (Cut Hallucinations and Cost)
The original path streamed free-form text and regex-parsed “steps.” Under load, small formatting errors cascaded. We made outputs machine-checkable and shorter.
- Switched to
response_format: json_schemawith Zod validation - Shortened the
systemprompt and cached it; added an anti-prompt-injection preface - Tiered models: default to
gpt-4o-minifor routine drafting; escalate togpt-4oor Anthropic fallback only when needed (documented by tags) - Capped
max_tokensby task type; cut context by chunking and filters
import { z } from 'zod'
import OpenAI from 'openai'
const PlanSchema = z.object({
title: z.string().max(120),
steps: z.array(z.string()).min(3).max(7),
risk: z.array(z.string()).max(5).optional(),
})
const jsonSchema = {
name: 'Plan',
schema: PlanSchema.openapi({ title: 'Plan' }), // or construct manually
}
async function draftPlan(messages) {
const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY })
const resp = await client.chat.completions.create({
model: 'gpt-4o-mini',
messages,
temperature: 0.2,
response_format: { type: 'json_schema', json_schema: jsonSchema as any },
max_tokens: 500,
})
const content = resp.choices[0].message.content || '{}'
return PlanSchema.parse(JSON.parse(content))
}RAG tuning (don’t overcomplicate):
- Moved from
ivfflatto HNSW index, added tenant and document-type filters - Reduced chunk size to 600–800 tokens with overlap 80 for better recall without bloating context
- Precomputed embeddings asynchronously; cached top-k results per query key for 10 minutes in Redis
-- pgvector HNSW (PostgreSQL 14+, pgvector >= 0.5)
CREATE INDEX CONCURRENTLY IF NOT EXISTS docs_embedding_hnsw
ON documents USING hnsw (embedding vector_cosine_ops) WITH (m = 16, ef_construction = 64);Outcome in 2 weeks:
- Token usage per request down 40%
- Human eval “on-target” scores improved +9% (deterministic JSON helps UX a lot)
- Hallucination incidents (support tickets tagged) down ~70%
Observe What Matters (Tokens, Latency, Queues)
We wired three lenses: metrics, tracing, and LLM analytics.
- Prometheus for Golden Signals + tokens and cache hits
- OpenTelemetry traces to Tempo/Jaeger with LLM spans annotated (model, tokens, retry count)
langfusefor prompt-level analytics and experiment flags
// metrics.ts
import client from 'prom-client'
export const registry = new client.Registry()
export const llmDuration = new client.Histogram({
name: 'llm_request_duration_seconds',
help: 'LLM latency',
buckets: [0.2, 0.5, 1, 2, 3, 5, 8],
labelNames: ['model', 'tenant'],
})
export const llmTokens = new client.Counter({
name: 'llm_tokens_total',
help: 'Total tokens by model',
labelNames: ['model', 'type'],
})
export const bullWaiting = new client.Gauge({
name: 'bull_queue_waiting',
help: 'Waiting jobs per queue',
labelNames: ['queue'],
})
registry.registerMetric(llmDuration)
registry.registerMetric(llmTokens)
registry.registerMetric(bullWaiting)On-call stopped guessing. Grafana showed p95s per tenant and per model, and we set alerts that mapped to SLOs, not CPU graphs.
Roll Out Without Russian Roulette
We stopped doing “all-or-nothing” deploys. We used Argo Rollouts for canaries tied to Prometheus queries.
# rollout.yaml (snippet)
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: ai-worker
spec:
replicas: 6
strategy:
canary:
steps:
- setWeight: 10
- pause: { duration: 5m }
- analysis:
templates:
- templateName: error-rate
- templateName: p95-latency
- setWeight: 50
- pause: { duration: 10m }
- analysis:
templates:
- templateName: error-rate
- templateName: p95-latency
trafficRouting:
istio: { virtualService: { name: ai-worker, routes: [primary] } }If p95 > 3s or error rate > 1%, Rollouts auto-rolled back. We also practiced failure: chaos testing with blocked egress to the LLM vendor to prove circuit breakers actually did their job.
Results You Can Take to a CFO (and an SRE)
Four weeks in, under a customer event with real traffic:
- Sustained throughput: 30 RPS → 280 RPS (9.3x) without user-visible degradation
- p95 latency: 12.4s → 2.1s; p99: >20s → 4.8s
- Error rate: 8.7% → 0.7% (429s down 93%)
- LLM spend: -58% month-over-month; token burn down 40% per request; monthly dropped from ~$92k to ~$38k
- Cache hit ratio on RAG results: 62% within 10 min TTL windows
- MTTR on AI incidents: 46 min → 9 min (runbooks + alerts + traces)
Business impact: they kept the feature on during their biggest quarter; churn risk on two key accounts evaporated after performance stabilized.
What We’d Reuse Tomorrow (Do This First)
- Move AI work off the web thread. Use a queue; reject early when backpressure exceeds SLOs.
- Wrap LLM calls in circuit breakers with hard timeouts and per-tenant retry budgets.
- Make outputs deterministic with JSON schema and validate with Zod.
- Tier models. Default to an efficient model; promote to premium only when needed.
- Tune RAG with HNSW, smaller chunks, and cached top-k results.
- Scale workers based on queue depth, not CPU.
- Instrument tokens, queue depth, p95, and error budget burn. Gate deployments with canaries.
If any of this sounds like the fire you’re fighting, GitPlumbers can slot into your stack without replatforming. We’ve stabilized AI features at fintechs, logistics platforms, and healthcare orgs under stricter constraints than this one.
Key takeaways
- Stability isn’t magic—it’s backpressure, timeouts, and observability wired end-to-end.
- Queue the AI calls; don’t let the web tier do unbounded synchronous LLM fanout.
- Treat the LLM as an unreliable, rate-limited dependency; use circuit breakers and retry budgets.
- Make the AI layer deterministic: JSON schema responses, short prompts, and result caching.
- Scale on queue depth, not CPU; tie autoscaling to demand with KEDA or HPA custom metrics.
- Instrument tokens, latency, and error budgets; make promotion decisions data-driven with canaries.
Implementation checklist
- Cap concurrency with a queue (e.g., `bullmq`) and reject when backpressure exceeds SLOs.
- Wrap LLM calls in a circuit breaker with strict timeouts; enforce retry budgets per tenant.
- Cache prompts and RAG results; promote a cheap default model with smart fallbacks.
- Tune vector search (HNSW, filters) and chunking; precompute embeddings asynchronously.
- Export Prometheus metrics for latency, tokens, and queue depth; wire Grafana and alerts.
- Scale workers on queue length using KEDA Prometheus scaler; define PDBs and liveness probes.
- Roll out with Argo Rollouts canaries gated by error rate and p95; bake in auto-rollback.
Questions we hear from teams
- Can you stabilize our AI feature without pausing development?
- Yes. In this engagement we shipped guardrails (queues, circuit breakers, autoscaling) in parallel with feature work. We isolate the stabilization changes behind flags and roll them out with canaries.
- Do we need to replatform our LLM stack?
- Usually not. We prefer least-change approaches: cap concurrency, add backpressure, and make outputs deterministic. If vendor limits are the bottleneck, we add smart fallbacks rather than wholesale migrations.
- Which vendors and tools do you work with?
- Azure OpenAI, OpenAI, Anthropic (via Bedrock), Vertex AI; Node/TypeScript, Python; Redis/BullMQ, Kafka; Prometheus/Grafana; OpenTelemetry/Tempo/Jaeger; ArgoCD/Rollouts; KEDA; Postgres/pgvector; Elastic/OpenSearch; Langfuse; LaunchDarkly.
- How fast can we see results?
- We target tangible improvements in 72 hours (fewer 429s, lower median latency) and SLO-grade stability in 2–6 weeks depending on constraints and traffic patterns.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
