The AI Copilot That Melted at P95: Stabilized Under Real Customer Load in 21 Days

A Series C SaaS shipped an AI sidecar that cratered under real users. We cut P95 from 2.8s to 650ms, slashed token spend 64%, and stopped the pager from ruining weekends.

“We went from apologizing to upselling. The AI features stopped being a liability.”
Back to all posts

The incident you’ve lived through

A Series C SaaS in HR tech shipped an AI “copilot” that summarizes candidate pipelines and drafts outreach. Looks great in a demo. Under real customer load—2.2k RPS peak during Monday mornings—the whole thing face-planted. P95 spiked to 2.8s, OpenAI/Azure started throwing 429 and 5xx, and support tickets rolled in like a DDoS.

They asked GitPlumbers to stop the bleeding without a rewrite. Constraints were non-negotiable: SOC 2 Type II, EU data residency for EEA tenants, no prompt logging outside their VPC, and a runway that didn’t tolerate another quarter of "AI is slow today".

What we walked into:

  • LangChain orchestration with unbounded parallel chains
  • pgvector on RDS serving both transactional and vector workloads
  • No cache for LLM responses; identical prompts hammered upstream
  • No SLOs; dashboards were vibes
  • Token spend ran ~$120k/month and growing 15% WoW

I’ve seen this movie. Here’s how we stabilized it in 21 days.

What was actually broken (and why it matters)

Four failure modes you probably recognize:

  • Unbounded fan-out: UI clicks triggered 3–6 parallel LLM calls with no backpressure. At peak, a single tenant could saturate all workers.
  • Vector search contention: pgvector queries competed with OLTP. Seq scans under load, bad planner choices, and no HNSW/IVFFLAT indexing.
  • Provider rate limits = cascading failure: Thundering herds on gpt-4o via Azure EU. Without timeouts/jitter, retries amplified the blast radius.
  • No cost or token guardrails: Tenants could burn 10x their plan in minutes; ops had no per-tenant visibility.

Why this matters to leaders: you can’t forecast margin or reliability if your AI workloads act like a slot machine. You need hard controls, not vibes and dashboards.

Week 1 triage: make it boring to be on-call

The objective: drop errors and cap tail latency without new infra. We cut three levers immediately.

  1. Clamp concurrency, add circuit breakers
  • Per-tenant concurrency via a distributed semaphore in Redis and a global breaker on the LLM client.
  • Hard timeouts at 1.5s for non-critical chains; retries with jitter and idempotency keys.
# python 3.11
import asyncio, json, os
import aiobreaker
import httpx
import redis.asyncio as redis
from tenacity import retry, stop_after_attempt, wait_exponential_jitter

REDIS = redis.from_url(os.getenv("REDIS_URL"))
SEMAPHORE = "tenant:{tenant_id}:llm_sema"

breaker = aiobreaker.CircuitBreaker(
    fail_max=20,  # trip quickly under upstream flakiness
    reset_timeout=30,
)

async def with_semaphore(tenant_id: str, max_tokens: int):
    # Use a simple leaky-bucket via Redis TTL keys
    key = SEMAPHORE.format(tenant_id=tenant_id)
    async with REDIS.client() as r:
        tokens = await r.decrby(key, 1)
        if tokens < 0:
            await r.incrby(key, 1)  # revert
            raise RuntimeError("429: tenant concurrency limit")
        await r.expire(key, 2)  # auto release if worker dies
        try:
            yield
        finally:
            await r.incrby(key, 1)

@retry(stop=stop_after_attempt(3), wait=wait_exponential_jitter(initial=0.1, max=0.8))
@breaker
async def call_llm(prompt: str, model: str = "gpt-4o-mini"):
    async with httpx.AsyncClient(timeout=1.5) as client:
        resp = await client.post(
            "https://api.openai.com/v1/chat/completions",
            headers={"Authorization": f"Bearer {os.getenv('OPENAI_KEY')}"},
            json={"model": model, "messages": [{"role": "user", "content": prompt}]},
        )
        resp.raise_for_status()
        return resp.json()
  1. Cache and coalesce identical prompts
  • Server-side Redis cache with normalized prompts and 60–300s TTL.
  • Request coalescing (single-flight) so N identical in-flight requests share one upstream call.
import hashlib

async def cached_completion(prompt: str) -> dict:
    key = "llm:resp:" + hashlib.sha256(prompt.encode()).hexdigest()
    if val := await REDIS.get(key):
        return json.loads(val)
    # naive single-flight using a lock key
    lock_key = key + ":lock"
    if not await REDIS.set(lock_key, "1", nx=True, ex=5):
        # someone else is fetching, poll briefly
        await asyncio.sleep(0.05)
        return await cached_completion(prompt)
    try:
        data = await call_llm(prompt)
        await REDIS.set(key, json.dumps(data), ex=120)
        return data
    finally:
        await REDIS.delete(lock_key)
  1. Fail gracefully
  • Fallback to gpt-4o-mini or local vLLM for non-critical tasks when the breaker is open.
  • Return partial results with a banner instead of 500s.

Result after 7 days:

  • Error rate: 8% -> 2.1%
  • P95 latency: 2.8s -> 1.1s
  • Pager: 19 incidents/week -> 6

Weeks 2–3: fix the architecture, not just the symptoms

With fire under control, we took on the hotspots.

  • Separate vector from OLTP and index correctly
    • Upgraded pgvector to 0.5.x and created HNSW indexes for common queries; long-term, moved hot tenants to a managed Qdrant cluster.
    • Reduced query fan-out (top-200 -> top-40) and applied server-side MMR rerank.
-- Postgres: ensure vector extension and HNSW
CREATE EXTENSION IF NOT EXISTS vector;
ALTER TABLE embeddings ADD COLUMN IF NOT EXISTS v vector(1536);
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_embeddings_hnsw ON embeddings
USING hnsw (v vector_cosine_ops) WITH (m = 16, ef_construction = 200);

-- typical query now uses hnsw index
SELECT id, content FROM embeddings
ORDER BY v <-> $1
LIMIT 40;
  • Move heavy work off the hot path
    • Extracted embedding generation to an SQS-backed worker with KEDA autoscaling. UI path now reads precomputed vectors.
# keda scaledobject for embeddings worker
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: embeddings-worker
spec:
  scaleTargetRef:
    name: embeddings-deployment
  triggers:
  - type: aws-sqs-queue
    metadata:
      queueURL: https://sqs.us-east-1.amazonaws.com/123/embeddings
      queueLength: "200"
  • Rate limits and budgets as product features
    • Per-tenant token budgets in Redis. Soft limit warns; hard limit downgrades to cheaper models and reduces max context.
# nightly job computes per-tenant token budgets
redis-cli HSET tenant:1234:budget monthly_tokens 5_000_000 warn_pct 0.8 hard_pct 1.0
  • Network resilience at the edge
    • Istio egress ServiceEntry + DestinationRule to enforce outlier detection and short timeouts toward LLM providers.
apiVersion: networking.istio.io/v1beta1
kind: ServiceEntry
metadata:
  name: openai-egress
spec:
  hosts:
  - api.openai.com
  ports:
  - number: 443
    name: https
    protocol: TLS
  resolution: DNS
  location: MESH_EXTERNAL
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: openai-dr
spec:
  host: api.openai.com
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
      http:
        http2MaxRequests: 1000
        maxRequestsPerConnection: 100
        idleTimeout: 5s
    outlierDetection:
      consecutive5xx: 5
      interval: 5s
      baseEjectionTime: 30s
      maxEjectionPercent: 50
    retries:
      attempts: 2
      perTryTimeout: 800ms
      retryOn: 5xx,reset,connect-failure
  • Harden prompts and validate outputs
    • Canonicalized prompts with stable templates; enforced JSON schema using pydantic to avoid downstream parse failures.

Net effect by end of week 3:

  • P95: 1.1s -> 720ms
  • Cache hit rate (LLM responses): 0% -> 62%
  • Token spend run-rate: $120k -> ~$58k/month

Observability and control: measure what actually costs you

You can’t manage what you can’t see. We instrumented everything with OpenTelemetry and lit up SLOs in Prometheus.

  • Trace across UI -> API -> LLM
# opentelemetry for python fastapi
from opentelemetry import trace
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor, OTLPSpanExporter

provider = TracerProvider(resource=Resource.create({"service.name": "ai-api"}))
trace.set_tracer_provider(provider)
provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317", insecure=True)))
FastAPIInstrumentor.instrument_app(app)
HTTPXClientInstrumentor().instrument()
  • Token and cost metrics

    • Custom metrics: tokens_in_total, tokens_out_total, llm_cost_usd labeled by tenant, model, prompt_version.
    • Dashboards show cost-per-tenant and burn-rate against plan.
  • SLOs and burn-rate alerts

# prometheus alert: 14m/1h multi-window burn for 99% availability SLO
- alert: AICopilotHighErrorBurn
  expr: |
    (sum(rate(http_requests_total{job="ai-api",status=~"5..|429"}[14m]))
      /
    sum(rate(http_requests_total{job="ai-api"}[14m]))) > (14.4 * (1 - 0.99))
    and
    (sum(rate(http_requests_total{job="ai-api",status=~"5..|429"}[1h]))
      /
    sum(rate(http_requests_total{job="ai-api"}[1h]))) > (6 * (1 - 0.99))
  for: 2m
  labels:
    severity: page
  annotations:
    summary: "AI Copilot error budget burn is too high"
  • Privacy and compliance
    • Redacted PII from spans; prompts stored hashed with per-tenant salts and only when consent=true.

Now ops could answer: which tenants are noisy, which prompts regress, and which model changes hurt margin.

Ship safely: canaries, flags, and real traffic replay

We refused to YOLO changes during business hours.

  • Argo Rollouts with canary + auto-rollback on latency/error SLOs
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: ai-api
spec:
  strategy:
    canary:
      steps:
      - setWeight: 10
      - pause: {duration: 5m}
      - setWeight: 50
      - pause: {duration: 10m}
      trafficRouting:
        istio:
          virtualService: {name: ai-api-vs, routes: [primary]}
      analysis:
        templates:
        - templateName: ai-slo-check
        startingStep: 1
  • Feature flags via OpenFeature to flip model versions per-tenant.
  • Traffic replay with GoReplay and k6 to validate under production-like load before promoting.
// k6 smoke for AI endpoint
import http from 'k6/http';
import { sleep, check } from 'k6';
export let options = { vus: 50, duration: '3m' };
export default function() {
  const res = http.post('https://api.example.com/ai/summarize', JSON.stringify({ text: '...' }), { headers: { 'Content-Type': 'application/json' }});
  check(res, { 'status was 200': (r) => r.status === 200, 'p95<800ms': (r) => r.timings.duration < 800 });
  sleep(1);
}

This is the boring, repeatable way to ship AI changes when revenue depends on it.

Results: from firefight to flywheel

By day 21, we had numbers we’d share with a board:

  • P95 latency: 2.8s -> 650ms (4.3x faster)
  • P99 latency: 7.4s -> 1.9s
  • Error rate: 8.0% -> 0.4%
  • Token spend: ~$120k/month -> ~$43k/month (64% reduction)
  • Cache hit rate (LLM responses): 62% sustained during peak
  • Queue backlog: 1.2M -> <10k steady-state in 24h
  • MTTR: 96m -> 18m (fewer incidents, faster rollbacks)
  • SLO: 99% availability met for 30 consecutive days

Business impact:

  • Support tickets down 71%
  • NPS for AI features +14 points in a month
  • Finance finally had a unit-economics dashboard the CRO trusted

“We went from apologizing to upselling. The AI features stopped being a liability.” — Acting CTO

What we’d do again (and what to steal from this)

Steal these patterns verbatim:

  • Put a budget on everything: tokens, concurrency, context length.
  • Cache aggressively and coalesce identical prompts; most UI triggers are repetitive within minutes.
  • Treat LLMs like external databases: timeouts, retries with jitter, and circuit breakers.
  • Don’t mix OLTP with vector search under load; either index properly (HNSW/IVFFLAT) or split the workload.
  • Instrument token/cost per tenant and per prompt version; you can’t optimize blind.
  • Ship with canaries and real traffic replay. “Works on staging” is comedy.

If your AI sidecar is wobbling today, we’ll help you make it boring. That’s our happy place at GitPlumbers.

Related Resources

Key takeaways

  • Cache and coalesce identical LLM requests—most UI-triggered prompts are low-cardinality under burst.
  • Treat LLMs like flaky dependencies: timeouts, retries with jitter, circuit breakers, and budgets per tenant.
  • Move heavy work off the hot path: precompute embeddings and debounce writes with queues and autoscaling consumers.
  • Measure what matters: token spend per tenant, prompt versions, P95/P99, and burn-rate against SLOs.
  • Ship safely under fire: canaries, feature flags, and replayed production traffic before flipping defaults.

Implementation checklist

  • Establish SLOs for AI endpoints before optimization.
  • Add per-tenant concurrency caps and token budgets in Redis.
  • Implement request coalescing and layered caching for LLM responses.
  • Introduce circuit breakers/timeouts for external LLM providers.
  • Precompute embeddings with a queue + KEDA autoscaling.
  • Optimize vector search (HNSW/IVFFLAT) and reduce RPS to the DB.
  • Instrument with OpenTelemetry and create token/cost dashboards.
  • Use Argo Rollouts canary with automated rollback conditions.

Questions we hear from teams

What if we can’t log prompts due to compliance?
Hash prompts with per-tenant salts to dedupe for caching/metrics without storing raw text. Use on-by-default redaction in OpenTelemetry and store raw prompts only with explicit tenant consent.
We’re on `pgvector` and can’t introduce another DB. Is stabilization still possible?
Yes. Upgrade the extension, add HNSW/IVFFLAT indexes, reduce top-k, and move embedding writes to a queue. You can buy 2–3x latency improvement without leaving Postgres.
Do we have to switch LLM providers to fix cost/latency?
Usually no. Multi-model fallbacks and right-sizing context length get you most of the win. We often keep Azure OpenAI for regulated tenants and add a cheaper model for non-critical paths.
How long does this kind of stabilization take?
We aim for 2–4 weeks. Week 1 is triage (circuit breakers, caching, caps). Weeks 2–3 tackle data paths and observability. Week 4 is polish and handoff with dashboards and runbooks.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Stabilize your AI features without a rewrite See how we harden AI systems

Related resources