The AI Copilot That Melted at P95: Stabilized Under Real Customer Load in 21 Days
A Series C SaaS shipped an AI sidecar that cratered under real users. We cut P95 from 2.8s to 650ms, slashed token spend 64%, and stopped the pager from ruining weekends.
“We went from apologizing to upselling. The AI features stopped being a liability.”Back to all posts
The incident you’ve lived through
A Series C SaaS in HR tech shipped an AI “copilot” that summarizes candidate pipelines and drafts outreach. Looks great in a demo. Under real customer load—2.2k RPS peak during Monday mornings—the whole thing face-planted. P95 spiked to 2.8s, OpenAI/Azure started throwing 429 and 5xx, and support tickets rolled in like a DDoS.
They asked GitPlumbers to stop the bleeding without a rewrite. Constraints were non-negotiable: SOC 2 Type II, EU data residency for EEA tenants, no prompt logging outside their VPC, and a runway that didn’t tolerate another quarter of "AI is slow today".
What we walked into:
LangChainorchestration with unbounded parallel chainspgvectoron RDS serving both transactional and vector workloads- No cache for LLM responses; identical prompts hammered upstream
- No SLOs; dashboards were vibes
- Token spend ran ~$120k/month and growing 15% WoW
I’ve seen this movie. Here’s how we stabilized it in 21 days.
What was actually broken (and why it matters)
Four failure modes you probably recognize:
- Unbounded fan-out: UI clicks triggered 3–6 parallel LLM calls with no backpressure. At peak, a single tenant could saturate all workers.
- Vector search contention:
pgvectorqueries competed with OLTP. Seq scans under load, bad planner choices, and no HNSW/IVFFLAT indexing. - Provider rate limits = cascading failure: Thundering herds on
gpt-4ovia Azure EU. Without timeouts/jitter, retries amplified the blast radius. - No cost or token guardrails: Tenants could burn 10x their plan in minutes; ops had no per-tenant visibility.
Why this matters to leaders: you can’t forecast margin or reliability if your AI workloads act like a slot machine. You need hard controls, not vibes and dashboards.
Week 1 triage: make it boring to be on-call
The objective: drop errors and cap tail latency without new infra. We cut three levers immediately.
- Clamp concurrency, add circuit breakers
- Per-tenant concurrency via a distributed semaphore in
Redisand a global breaker on the LLM client. - Hard timeouts at 1.5s for non-critical chains; retries with jitter and idempotency keys.
# python 3.11
import asyncio, json, os
import aiobreaker
import httpx
import redis.asyncio as redis
from tenacity import retry, stop_after_attempt, wait_exponential_jitter
REDIS = redis.from_url(os.getenv("REDIS_URL"))
SEMAPHORE = "tenant:{tenant_id}:llm_sema"
breaker = aiobreaker.CircuitBreaker(
fail_max=20, # trip quickly under upstream flakiness
reset_timeout=30,
)
async def with_semaphore(tenant_id: str, max_tokens: int):
# Use a simple leaky-bucket via Redis TTL keys
key = SEMAPHORE.format(tenant_id=tenant_id)
async with REDIS.client() as r:
tokens = await r.decrby(key, 1)
if tokens < 0:
await r.incrby(key, 1) # revert
raise RuntimeError("429: tenant concurrency limit")
await r.expire(key, 2) # auto release if worker dies
try:
yield
finally:
await r.incrby(key, 1)
@retry(stop=stop_after_attempt(3), wait=wait_exponential_jitter(initial=0.1, max=0.8))
@breaker
async def call_llm(prompt: str, model: str = "gpt-4o-mini"):
async with httpx.AsyncClient(timeout=1.5) as client:
resp = await client.post(
"https://api.openai.com/v1/chat/completions",
headers={"Authorization": f"Bearer {os.getenv('OPENAI_KEY')}"},
json={"model": model, "messages": [{"role": "user", "content": prompt}]},
)
resp.raise_for_status()
return resp.json()- Cache and coalesce identical prompts
- Server-side
Rediscache with normalized prompts and 60–300s TTL. - Request coalescing (single-flight) so N identical in-flight requests share one upstream call.
import hashlib
async def cached_completion(prompt: str) -> dict:
key = "llm:resp:" + hashlib.sha256(prompt.encode()).hexdigest()
if val := await REDIS.get(key):
return json.loads(val)
# naive single-flight using a lock key
lock_key = key + ":lock"
if not await REDIS.set(lock_key, "1", nx=True, ex=5):
# someone else is fetching, poll briefly
await asyncio.sleep(0.05)
return await cached_completion(prompt)
try:
data = await call_llm(prompt)
await REDIS.set(key, json.dumps(data), ex=120)
return data
finally:
await REDIS.delete(lock_key)- Fail gracefully
- Fallback to
gpt-4o-minior localvLLMfor non-critical tasks when the breaker is open. - Return partial results with a banner instead of 500s.
Result after 7 days:
- Error rate: 8% -> 2.1%
- P95 latency: 2.8s -> 1.1s
- Pager: 19 incidents/week -> 6
Weeks 2–3: fix the architecture, not just the symptoms
With fire under control, we took on the hotspots.
- Separate vector from OLTP and index correctly
- Upgraded
pgvectorto 0.5.x and created HNSW indexes for common queries; long-term, moved hot tenants to a managedQdrantcluster. - Reduced query fan-out (top-200 -> top-40) and applied server-side MMR rerank.
- Upgraded
-- Postgres: ensure vector extension and HNSW
CREATE EXTENSION IF NOT EXISTS vector;
ALTER TABLE embeddings ADD COLUMN IF NOT EXISTS v vector(1536);
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_embeddings_hnsw ON embeddings
USING hnsw (v vector_cosine_ops) WITH (m = 16, ef_construction = 200);
-- typical query now uses hnsw index
SELECT id, content FROM embeddings
ORDER BY v <-> $1
LIMIT 40;- Move heavy work off the hot path
- Extracted embedding generation to an
SQS-backed worker withKEDAautoscaling. UI path now reads precomputed vectors.
- Extracted embedding generation to an
# keda scaledobject for embeddings worker
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: embeddings-worker
spec:
scaleTargetRef:
name: embeddings-deployment
triggers:
- type: aws-sqs-queue
metadata:
queueURL: https://sqs.us-east-1.amazonaws.com/123/embeddings
queueLength: "200"- Rate limits and budgets as product features
- Per-tenant token budgets in Redis. Soft limit warns; hard limit downgrades to cheaper models and reduces max context.
# nightly job computes per-tenant token budgets
redis-cli HSET tenant:1234:budget monthly_tokens 5_000_000 warn_pct 0.8 hard_pct 1.0- Network resilience at the edge
- Istio egress
ServiceEntry+DestinationRuleto enforce outlier detection and short timeouts toward LLM providers.
- Istio egress
apiVersion: networking.istio.io/v1beta1
kind: ServiceEntry
metadata:
name: openai-egress
spec:
hosts:
- api.openai.com
ports:
- number: 443
name: https
protocol: TLS
resolution: DNS
location: MESH_EXTERNAL
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: openai-dr
spec:
host: api.openai.com
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
http:
http2MaxRequests: 1000
maxRequestsPerConnection: 100
idleTimeout: 5s
outlierDetection:
consecutive5xx: 5
interval: 5s
baseEjectionTime: 30s
maxEjectionPercent: 50
retries:
attempts: 2
perTryTimeout: 800ms
retryOn: 5xx,reset,connect-failure- Harden prompts and validate outputs
- Canonicalized prompts with stable templates; enforced JSON schema using
pydanticto avoid downstream parse failures.
- Canonicalized prompts with stable templates; enforced JSON schema using
Net effect by end of week 3:
- P95: 1.1s -> 720ms
- Cache hit rate (LLM responses): 0% -> 62%
- Token spend run-rate: $120k -> ~$58k/month
Observability and control: measure what actually costs you
You can’t manage what you can’t see. We instrumented everything with OpenTelemetry and lit up SLOs in Prometheus.
- Trace across UI -> API -> LLM
# opentelemetry for python fastapi
from opentelemetry import trace
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor, OTLPSpanExporter
provider = TracerProvider(resource=Resource.create({"service.name": "ai-api"}))
trace.set_tracer_provider(provider)
provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317", insecure=True)))
FastAPIInstrumentor.instrument_app(app)
HTTPXClientInstrumentor().instrument()Token and cost metrics
- Custom metrics:
tokens_in_total,tokens_out_total,llm_cost_usdlabeled bytenant,model,prompt_version. - Dashboards show cost-per-tenant and burn-rate against plan.
- Custom metrics:
SLOs and burn-rate alerts
# prometheus alert: 14m/1h multi-window burn for 99% availability SLO
- alert: AICopilotHighErrorBurn
expr: |
(sum(rate(http_requests_total{job="ai-api",status=~"5..|429"}[14m]))
/
sum(rate(http_requests_total{job="ai-api"}[14m]))) > (14.4 * (1 - 0.99))
and
(sum(rate(http_requests_total{job="ai-api",status=~"5..|429"}[1h]))
/
sum(rate(http_requests_total{job="ai-api"}[1h]))) > (6 * (1 - 0.99))
for: 2m
labels:
severity: page
annotations:
summary: "AI Copilot error budget burn is too high"- Privacy and compliance
- Redacted PII from spans; prompts stored hashed with per-tenant salts and only when
consent=true.
- Redacted PII from spans; prompts stored hashed with per-tenant salts and only when
Now ops could answer: which tenants are noisy, which prompts regress, and which model changes hurt margin.
Ship safely: canaries, flags, and real traffic replay
We refused to YOLO changes during business hours.
- Argo Rollouts with canary + auto-rollback on latency/error SLOs
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: ai-api
spec:
strategy:
canary:
steps:
- setWeight: 10
- pause: {duration: 5m}
- setWeight: 50
- pause: {duration: 10m}
trafficRouting:
istio:
virtualService: {name: ai-api-vs, routes: [primary]}
analysis:
templates:
- templateName: ai-slo-check
startingStep: 1- Feature flags via
OpenFeatureto flip model versions per-tenant. - Traffic replay with
GoReplayandk6to validate under production-like load before promoting.
// k6 smoke for AI endpoint
import http from 'k6/http';
import { sleep, check } from 'k6';
export let options = { vus: 50, duration: '3m' };
export default function() {
const res = http.post('https://api.example.com/ai/summarize', JSON.stringify({ text: '...' }), { headers: { 'Content-Type': 'application/json' }});
check(res, { 'status was 200': (r) => r.status === 200, 'p95<800ms': (r) => r.timings.duration < 800 });
sleep(1);
}This is the boring, repeatable way to ship AI changes when revenue depends on it.
Results: from firefight to flywheel
By day 21, we had numbers we’d share with a board:
- P95 latency: 2.8s -> 650ms (4.3x faster)
- P99 latency: 7.4s -> 1.9s
- Error rate: 8.0% -> 0.4%
- Token spend: ~$120k/month -> ~$43k/month (64% reduction)
- Cache hit rate (LLM responses): 62% sustained during peak
- Queue backlog: 1.2M -> <10k steady-state in 24h
- MTTR: 96m -> 18m (fewer incidents, faster rollbacks)
- SLO: 99% availability met for 30 consecutive days
Business impact:
- Support tickets down 71%
- NPS for AI features +14 points in a month
- Finance finally had a unit-economics dashboard the CRO trusted
“We went from apologizing to upselling. The AI features stopped being a liability.” — Acting CTO
What we’d do again (and what to steal from this)
Steal these patterns verbatim:
- Put a budget on everything: tokens, concurrency, context length.
- Cache aggressively and coalesce identical prompts; most UI triggers are repetitive within minutes.
- Treat LLMs like external databases: timeouts, retries with jitter, and circuit breakers.
- Don’t mix OLTP with vector search under load; either index properly (HNSW/IVFFLAT) or split the workload.
- Instrument token/cost per tenant and per prompt version; you can’t optimize blind.
- Ship with canaries and real traffic replay. “Works on staging” is comedy.
If your AI sidecar is wobbling today, we’ll help you make it boring. That’s our happy place at GitPlumbers.
Key takeaways
- Cache and coalesce identical LLM requests—most UI-triggered prompts are low-cardinality under burst.
- Treat LLMs like flaky dependencies: timeouts, retries with jitter, circuit breakers, and budgets per tenant.
- Move heavy work off the hot path: precompute embeddings and debounce writes with queues and autoscaling consumers.
- Measure what matters: token spend per tenant, prompt versions, P95/P99, and burn-rate against SLOs.
- Ship safely under fire: canaries, feature flags, and replayed production traffic before flipping defaults.
Implementation checklist
- Establish SLOs for AI endpoints before optimization.
- Add per-tenant concurrency caps and token budgets in Redis.
- Implement request coalescing and layered caching for LLM responses.
- Introduce circuit breakers/timeouts for external LLM providers.
- Precompute embeddings with a queue + KEDA autoscaling.
- Optimize vector search (HNSW/IVFFLAT) and reduce RPS to the DB.
- Instrument with OpenTelemetry and create token/cost dashboards.
- Use Argo Rollouts canary with automated rollback conditions.
Questions we hear from teams
- What if we can’t log prompts due to compliance?
- Hash prompts with per-tenant salts to dedupe for caching/metrics without storing raw text. Use on-by-default redaction in OpenTelemetry and store raw prompts only with explicit tenant consent.
- We’re on `pgvector` and can’t introduce another DB. Is stabilization still possible?
- Yes. Upgrade the extension, add HNSW/IVFFLAT indexes, reduce top-k, and move embedding writes to a queue. You can buy 2–3x latency improvement without leaving Postgres.
- Do we have to switch LLM providers to fix cost/latency?
- Usually no. Multi-model fallbacks and right-sizing context length get you most of the win. We often keep Azure OpenAI for regulated tenants and add a cheaper model for non-critical paths.
- How long does this kind of stabilization take?
- We aim for 2–4 weeks. Week 1 is triage (circuit breakers, caching, caps). Weeks 2–3 tackle data paths and observability. Week 4 is polish and handoff with dashboards and runbooks.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
