The AI Feature That Buckled at 4 p.m.—And How We Kept It Standing

A B2B SaaS rolled out an AI assistant, then real users showed up. Rate limits, token burn, and 12s p95s. GitPlumbers stabilized it under live fire—no replatform, no pause button.

> “We were hemorrhaging tokens and user trust. GitPlumbers didn’t sell magic—they put in guardrails. Two weeks later, demos stopped being scary.” — CTO, mid-market B2B SaaS
Back to all posts

The 4 p.m. Faceplant You’ve Lived Through

They ran a webinar, 1,800 signups, and the new AI drafting assistant was the star. Four minutes into the demo, the logs lit up: 429 storms from the LLM vendor, Node.js event loops pinned, vector queries queueing for seconds, and a token bill that looked like a Vegas weekend. p95 latency spiked to 12.4s, error rate to 8.7%. The sales team slacked “can we roll it back?” There was no rollback. The feature had quietly become core to the workflow.

I’ve seen this movie. AI features run great in staging, then die under real users and bursty traffic. The root cause is always the same: unbounded concurrency and wishful thinking about external dependencies.

GitPlumbers was called in “without turning anything off.” We didn’t replatform, didn’t change the frontend, and didn’t ask for a feature freeze. We stabilized it live.

What We Walked Into (Constraints Included)

Architecture, warts and all:

  • next.js frontend calling a Node/TypeScript API (express) that orchestrates prompts via langchain to Azure OpenAI (gpt-4o for generation, text-embedding-3-small for embeddings)
  • RAG on PostgreSQL 14 with pgvector 0.5, naive ivfflat index, uneven tenant data creating hot partitions
  • Redis-backed job system existed for email, not used for AI calls; AI path did sync fanout from web tier
  • No rate limiting per tenant, no backpressure; retries on 429 were infinite with linear backoff (yep)
  • Observability: basic winston logs and an aged Grafana pointing at CPU and memory only; no token metrics, no tracing
  • Constraints: SOC2 in flight (no new vendors without review), multi-tenant SLAs, and a hard requirement of zero downtime. Timeline: 4 weeks to get to “reliably demo-able” and 8 weeks to hit SLOs

Success targets we agreed on:

  • AI endpoint p95 < 3s, p99 < 6s under 250 sustained RPS
  • Error rate < 1% excluding 4xx
  • Reduce LLM spend by 50% without degrading output quality beyond a 5% hit in human evals
  • MTTR < 10 minutes for AI incidents

The First 72 Hours: Put the Fire Doors In

We didn’t “optimize prompts” first. We installed guardrails so the system could breathe.

  1. Queue and cap concurrency
  • Moved AI calls off the request thread into bullmq with Redis. If the queue passed a threshold, we returned 429 with a friendly message rather than DDOSing the LLM.
// api/ai.ts
import express from 'express'
import { Queue, Worker, QueueScheduler } from 'bullmq'
import CircuitBreaker from 'opossum'
import { aiCall } from './llm'

const router = express.Router()
const connection = { host: process.env.REDIS_HOST, port: 6379 }

const queue = new Queue('ai-requests', {
  connection,
  defaultJobOptions: {
    attempts: 2,
    backoff: { type: 'exponential', delay: 250 },
    removeOnComplete: true,
  },
})
new QueueScheduler('ai-requests', { connection })

const breaker = new CircuitBreaker(aiCall, {
  timeout: 5000,
  errorThresholdPercentage: 50,
  resetTimeout: 10000,
})

// Backpressure: don’t accept more than 1k waiting jobs
router.post('/v1/ai', async (req, res) => {
  const waiting = await queue.getWaitingCount()
  if (waiting > 1000) return res.status(429).send('Busy. Try again in a moment.')
  const job = await queue.add('chat', { tenant: req.tenantId, payload: req.body })
  res.status(202).json({ id: job.id })
})

// Workers (separate deployment) run the actual LLM call
new Worker('ai-requests', async (job) => {
  return breaker.fire(job.data)
}, { connection, concurrency: Number(process.env.AI_WORKER_CONCURRENCY || 8) })

export default router
  1. Scale on demand, not hope
  • Hooked queue depth to autoscaling via KEDA’s Prometheus scaler. We exported a metric bull_queue_waiting{queue="ai-requests"} and scaled workers accordingly.
# keda-scaledobject.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: ai-worker
spec:
  scaleTargetRef:
    name: ai-worker
  minReplicaCount: 2
  maxReplicaCount: 20
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus-server.monitoring.svc:80
        query: sum(bull_queue_waiting{queue="ai-requests"})
        threshold: "200"  # one worker per ~200 waiting jobs
  1. Timeouts, retry budgets, and circuit breaking
  • Hard cap on LLM call duration (5s) and a per-tenant retry budget (2 tries, jittered backoff). No infinite retries. Bad tenants can’t starve good ones.
// llm.ts
import OpenAI from 'openai'
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY })

export async function aiCall({ tenant, payload }: any) {
  const controller = new AbortController()
  const timeout = setTimeout(() => controller.abort(), 5000)
  try {
    return await openai.chat.completions.create({
      model: 'gpt-4o-mini',
      messages: payload.messages,
      max_tokens: 600,
      temperature: 0.2,
      timeout: 5000,
      signal: controller.signal as any,
    })
  } finally { clearTimeout(timeout) }
}
  1. Fast load test, realistic traffic
  • Reproduced the exact meltdown with k6 + recorded user journeys. This gave us a baseline and caught regressions fast.
// load.js
import http from 'k6/http'
import { check, sleep } from 'k6'

export const options = {
  stages: [
    { duration: '2m', target: 100 },
    { duration: '5m', target: 250 },
    { duration: '5m', target: 0 },
  ],
}

export default function () {
  const payload = JSON.stringify({ messages: [{ role: 'user', content: 'Draft a proposal for ACME' }] })
  const res = http.post('https://api.example.com/v1/ai', payload, { headers: { 'Content-Type': 'application/json' } })
  check(res, { 'status is 202 or 200': (r) => r.status === 202 || r.status === 200 })
  sleep(1)
}

Result after 3 days: 429s down 63%, median latency -58%, web tier CPU back to normal. Users noticed the system felt “responsive again,” even when it returned a polite “busy” instead of silently spinning.

Make the AI Layer Deterministic (Cut Hallucinations and Cost)

The original path streamed free-form text and regex-parsed “steps.” Under load, small formatting errors cascaded. We made outputs machine-checkable and shorter.

  • Switched to response_format: json_schema with Zod validation
  • Shortened the system prompt and cached it; added an anti-prompt-injection preface
  • Tiered models: default to gpt-4o-mini for routine drafting; escalate to gpt-4o or Anthropic fallback only when needed (documented by tags)
  • Capped max_tokens by task type; cut context by chunking and filters
import { z } from 'zod'
import OpenAI from 'openai'

const PlanSchema = z.object({
  title: z.string().max(120),
  steps: z.array(z.string()).min(3).max(7),
  risk: z.array(z.string()).max(5).optional(),
})

const jsonSchema = {
  name: 'Plan',
  schema: PlanSchema.openapi({ title: 'Plan' }), // or construct manually
}

async function draftPlan(messages) {
  const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY })
  const resp = await client.chat.completions.create({
    model: 'gpt-4o-mini',
    messages,
    temperature: 0.2,
    response_format: { type: 'json_schema', json_schema: jsonSchema as any },
    max_tokens: 500,
  })
  const content = resp.choices[0].message.content || '{}'
  return PlanSchema.parse(JSON.parse(content))
}

RAG tuning (don’t overcomplicate):

  • Moved from ivfflat to HNSW index, added tenant and document-type filters
  • Reduced chunk size to 600–800 tokens with overlap 80 for better recall without bloating context
  • Precomputed embeddings asynchronously; cached top-k results per query key for 10 minutes in Redis
-- pgvector HNSW (PostgreSQL 14+, pgvector >= 0.5)
CREATE INDEX CONCURRENTLY IF NOT EXISTS docs_embedding_hnsw
ON documents USING hnsw (embedding vector_cosine_ops) WITH (m = 16, ef_construction = 64);

Outcome in 2 weeks:

  • Token usage per request down 40%
  • Human eval “on-target” scores improved +9% (deterministic JSON helps UX a lot)
  • Hallucination incidents (support tickets tagged) down ~70%

Observe What Matters (Tokens, Latency, Queues)

We wired three lenses: metrics, tracing, and LLM analytics.

  • Prometheus for Golden Signals + tokens and cache hits
  • OpenTelemetry traces to Tempo/Jaeger with LLM spans annotated (model, tokens, retry count)
  • langfuse for prompt-level analytics and experiment flags
// metrics.ts
import client from 'prom-client'
export const registry = new client.Registry()

export const llmDuration = new client.Histogram({
  name: 'llm_request_duration_seconds',
  help: 'LLM latency',
  buckets: [0.2, 0.5, 1, 2, 3, 5, 8],
  labelNames: ['model', 'tenant'],
})
export const llmTokens = new client.Counter({
  name: 'llm_tokens_total',
  help: 'Total tokens by model',
  labelNames: ['model', 'type'],
})
export const bullWaiting = new client.Gauge({
  name: 'bull_queue_waiting',
  help: 'Waiting jobs per queue',
  labelNames: ['queue'],
})
registry.registerMetric(llmDuration)
registry.registerMetric(llmTokens)
registry.registerMetric(bullWaiting)

On-call stopped guessing. Grafana showed p95s per tenant and per model, and we set alerts that mapped to SLOs, not CPU graphs.

Roll Out Without Russian Roulette

We stopped doing “all-or-nothing” deploys. We used Argo Rollouts for canaries tied to Prometheus queries.

# rollout.yaml (snippet)
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: ai-worker
spec:
  replicas: 6
  strategy:
    canary:
      steps:
        - setWeight: 10
        - pause: { duration: 5m }
        - analysis:
            templates:
              - templateName: error-rate
              - templateName: p95-latency
        - setWeight: 50
        - pause: { duration: 10m }
        - analysis:
            templates:
              - templateName: error-rate
              - templateName: p95-latency
      trafficRouting:
        istio: { virtualService: { name: ai-worker, routes: [primary] } }

If p95 > 3s or error rate > 1%, Rollouts auto-rolled back. We also practiced failure: chaos testing with blocked egress to the LLM vendor to prove circuit breakers actually did their job.

Results You Can Take to a CFO (and an SRE)

Four weeks in, under a customer event with real traffic:

  • Sustained throughput: 30 RPS → 280 RPS (9.3x) without user-visible degradation
  • p95 latency: 12.4s → 2.1s; p99: >20s → 4.8s
  • Error rate: 8.7% → 0.7% (429s down 93%)
  • LLM spend: -58% month-over-month; token burn down 40% per request; monthly dropped from ~$92k to ~$38k
  • Cache hit ratio on RAG results: 62% within 10 min TTL windows
  • MTTR on AI incidents: 46 min → 9 min (runbooks + alerts + traces)

Business impact: they kept the feature on during their biggest quarter; churn risk on two key accounts evaporated after performance stabilized.

What We’d Reuse Tomorrow (Do This First)

  • Move AI work off the web thread. Use a queue; reject early when backpressure exceeds SLOs.
  • Wrap LLM calls in circuit breakers with hard timeouts and per-tenant retry budgets.
  • Make outputs deterministic with JSON schema and validate with Zod.
  • Tier models. Default to an efficient model; promote to premium only when needed.
  • Tune RAG with HNSW, smaller chunks, and cached top-k results.
  • Scale workers based on queue depth, not CPU.
  • Instrument tokens, queue depth, p95, and error budget burn. Gate deployments with canaries.

If any of this sounds like the fire you’re fighting, GitPlumbers can slot into your stack without replatforming. We’ve stabilized AI features at fintechs, logistics platforms, and healthcare orgs under stricter constraints than this one.

Related Resources

Key takeaways

  • Stability isn’t magic—it’s backpressure, timeouts, and observability wired end-to-end.
  • Queue the AI calls; don’t let the web tier do unbounded synchronous LLM fanout.
  • Treat the LLM as an unreliable, rate-limited dependency; use circuit breakers and retry budgets.
  • Make the AI layer deterministic: JSON schema responses, short prompts, and result caching.
  • Scale on queue depth, not CPU; tie autoscaling to demand with KEDA or HPA custom metrics.
  • Instrument tokens, latency, and error budgets; make promotion decisions data-driven with canaries.

Implementation checklist

  • Cap concurrency with a queue (e.g., `bullmq`) and reject when backpressure exceeds SLOs.
  • Wrap LLM calls in a circuit breaker with strict timeouts; enforce retry budgets per tenant.
  • Cache prompts and RAG results; promote a cheap default model with smart fallbacks.
  • Tune vector search (HNSW, filters) and chunking; precompute embeddings asynchronously.
  • Export Prometheus metrics for latency, tokens, and queue depth; wire Grafana and alerts.
  • Scale workers on queue length using KEDA Prometheus scaler; define PDBs and liveness probes.
  • Roll out with Argo Rollouts canaries gated by error rate and p95; bake in auto-rollback.

Questions we hear from teams

Can you stabilize our AI feature without pausing development?
Yes. In this engagement we shipped guardrails (queues, circuit breakers, autoscaling) in parallel with feature work. We isolate the stabilization changes behind flags and roll them out with canaries.
Do we need to replatform our LLM stack?
Usually not. We prefer least-change approaches: cap concurrency, add backpressure, and make outputs deterministic. If vendor limits are the bottleneck, we add smart fallbacks rather than wholesale migrations.
Which vendors and tools do you work with?
Azure OpenAI, OpenAI, Anthropic (via Bedrock), Vertex AI; Node/TypeScript, Python; Redis/BullMQ, Kafka; Prometheus/Grafana; OpenTelemetry/Tempo/Jaeger; ArgoCD/Rollouts; KEDA; Postgres/pgvector; Elastic/OpenSearch; Langfuse; LaunchDarkly.
How fast can we see results?
We target tangible improvements in 72 hours (fewer 429s, lower median latency) and SLO-grade stability in 2–6 weeks depending on constraints and traffic patterns.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Stabilize your AI feature Talk to an engineer

Related resources