Your Load Test Passed. Production Still Melted. Here’s the Strategy That Actually Predicts Pain.
Load testing isn’t about bragging rights on RPS. It’s about protecting p95 latency, checkout success, and revenue when your marketing team “surprises” you with traffic.
A load test that doesn’t measure p95/p99, error rate, and saturation is just a stress-themed demo.Back to all posts
The load test that “passed” right up until checkout died
I’ve watched teams celebrate a green load test—“We hit 5k RPS!”—and then get absolutely wrecked the first time a real campaign lands. The postmortem always reads the same:
- Synthetic traffic was too clean (no cache misses, no weird payloads, no mobile clients)
- Success criteria was throughput, not user-perceived latency
- The test environment had different DB parameters, smaller datasets, or “helpful” feature flags turned off
- Nobody measured saturation signals (connection pools, queue depth, lock time, GC)
If you’re an engineering leader, what you actually care about is simple:
- p95/p99 latency on the flows that make money
- error rate during peak
- conversion impact (or churn impact) when the system gets slow
That’s the difference between “we ran a load test” and “we validated system behavior under stress.”
Start with business outcomes, then map to SLIs that don’t lie
Load testing gets political fast when it’s framed as “performance work.” The way out is to tie it to business outcomes and translate those into hard metrics.
Typical mapping we use at GitPlumbers:
- Checkout / payment
- SLIs:
p95andp99end-to-end latency, 5xx rate, timeouts, payment provider error rate - Business: conversion rate, revenue/hour, cart abandonment
- SLIs:
- Search / browse
- SLIs:
p95latency, 429/503 rate, cache hit ratio - Business: pages/session, add-to-cart rate
- SLIs:
- Auth / session
- SLIs: login success %, token issuance latency, dependency timeouts
- Business: activation rate, support tickets
Define pass/fail gates like you mean it. Example gate set:
- Checkout
p95 < 800ms,p99 < 1500msunder expected peak - 5xx < 0.2% and timeouts < 0.1%
- No sustained saturation: DB CPU < 80%, connection pool < 90% used, queue depth stable
If you can’t decide what “good” looks like, the load test will devolve into vibes and graphs.
Build a workload model from reality (not a wish)
I’ve seen this fail most often with AI-assisted codebases: someone generates a “quick k6 script,” hits one endpoint, and calls it done. Meanwhile production traffic is 30 endpoints with ugly long tails, retries, and payloads that blow up JSON parsing.
What actually works:
- Pull production request mix from access logs/APM (even a week is enough).
- Model:
- endpoint distribution (top N routes)
- concurrency (not just RPS)
- payload sizes
- think time (humans don’t click at 0ms intervals)
- cache behavior (warm vs cold)
- Include the “bad weather” patterns:
- auth token refresh bursts
- thundering herd on popular items
- dependency slowness (payment, email, fraud checks)
If you have OpenTelemetry traces, you can sanity check the model by comparing span breakdowns under load.
Here’s a concrete k6 scenario that mixes browse/search/checkout and produces latency percentiles that matter:
import http from 'k6/http';
import { check, sleep } from 'k6';
export const options = {
scenarios: {
peak: {
executor: 'ramping-arrival-rate',
startRate: 50,
timeUnit: '1s',
preAllocatedVUs: 200,
maxVUs: 2000,
stages: [
{ target: 200, duration: '5m' }, // warmup
{ target: 800, duration: '10m' }, // peak
{ target: 200, duration: '5m' }, // cool down
],
},
},
thresholds: {
http_req_failed: ['rate<0.002'],
http_req_duration: ['p(95)<800', 'p(99)<1500'],
},
};
export default function () {
// 70% browse/search
if (Math.random() < 0.7) {
const r1 = http.get(`${__ENV.BASE_URL}/api/search?q=headphones`);
check(r1, { 'search 200': (r) => r.status === 200 });
sleep(Math.random() * 2);
const r2 = http.get(`${__ENV.BASE_URL}/api/product/sku-123`);
check(r2, { 'product 200': (r) => r.status === 200 });
} else {
// 30% checkout
const cart = http.post(`${__ENV.BASE_URL}/api/cart/add`, JSON.stringify({ sku: 'sku-123', qty: 1 }), {
headers: { 'Content-Type': 'application/json' },
});
check(cart, { 'cart add 200': (r) => r.status === 200 });
const checkout = http.post(`${__ENV.BASE_URL}/api/checkout`, JSON.stringify({ paymentMethod: 'card' }), {
headers: { 'Content-Type': 'application/json' },
});
check(checkout, { 'checkout 200/202': (r) => r.status === 200 || r.status === 202 });
}
// user think time
sleep(0.5 + Math.random() * 1.5);
}Notice what’s missing: raw “RPS goals.” We’re driving arrival rate + concurrency because that’s closer to how systems fail in the real world.
Test shapes that catch different production failures
One load test shape won’t find all failure modes. You need a small set of intentional tests that each prove something.
Baseline test (10–20 min):
- Goal: validate no regressions; catch obvious pool sizing/config mistakes
- Pass: SLO gates pass; dashboards show stable saturation
Stress test (push past expected peak):
- Goal: find the cliff (where queues explode, retries storm, DB locks spike)
- Pass: graceful degradation (429s, backpressure), not random 500s
Spike test (step function):
- Goal: validate autoscaling lag, cache stampedes, cold starts
- Pass: recover within an SLO recovery window (e.g., 2–5 minutes)
Soak test (2–8 hours):
- Goal: memory leaks, GC thrash, connection leaks, log volume surprises
- Pass: stable memory/latency trend; no creeping error rate
I’ve seen Kubernetes teams get burned specifically on spike tests because the HPA reacts too slowly and the service falls into a retry storm before it scales.
Here’s a real-world HPA config that’s less “demo-friendly” and more “survives Black Friday”:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: checkout-api
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: checkout-api
minReplicas: 10
maxReplicas: 120
behavior:
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 15
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 20
periodSeconds: 60
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60CPU isn’t perfect, but it’s better than pretending scaling doesn’t matter. If you can, scale on queue depth or in-flight requests—just don’t do it without solid telemetry.
Instrumentation that lets you answer “why is it slow?” in 10 minutes
If you can’t correlate load with saturation, you’ll end up in the worst performance loop: “We tried bigger instances.” (Yes, sometimes that helps. No, it’s not a strategy.)
Minimum viable observability for load tests:
- RED metrics (Request rate, Error rate, Duration) per endpoint
- USE metrics (Utilization, Saturation, Errors) for CPU, memory, disk, network
- Dependency metrics:
- DB: connections, slow queries, lock time
- Cache: hit ratio, evictions
- Queues: depth, age, consumer lag
- Traces for top flows (checkout/search)
Example PromQL queries we commonly pin to a “load test” Grafana dashboard:
histogram_quantile(0.95,
sum by (le) (rate(http_server_request_duration_seconds_bucket{route="/api/checkout"}[5m]))
)sum(rate(http_server_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_server_requests_total[5m]))During a test, you’re looking for the classic signatures:
- Latency rises while CPU is flat → often IO wait, locks, or downstream slowness
- Error rate rises after latency rises → often timeouts and retry amplification
- p99 explodes but p50 is fine → usually queueing or a few hot shards/rows
Optimizations that reliably move the needle (with outcomes you can defend)
Here are fixes I’ve seen produce measurable outcomes repeatedly—especially on legacy or AI-generated services where “it works” but the edges are sharp.
Database indexing + query shape (PostgreSQL)
- Symptom: p95 climbs with concurrency; DB CPU hits 80–90%; slow query log lights up
- Fix: add missing composite index, remove
SELECT *, fix N+1 - Outcome we commonly see: p95 -30% to -70%, DB CPU -20%, fewer timeouts
Connection pool right-sizing (
HikariCP,pgBouncer)- Symptom: app threads blocked; DB max connections exhausted; “works in test” until peak
- Fix: cap app pool, use
pgBouncerin transaction mode, match pool to DB cores - Outcome: fewer brownouts; error rate drops from ~1% to <0.2% at peak
Cache stampede prevention (
Redis)- Symptom: spike test melts the DB after deploy/cache flush
- Fix: request coalescing, TTL jitter, stale-while-revalidate
- Outcome: p99 stabilized; DB read QPS reduced 40–80% during spikes
Backpressure + rate limiting (Envoy/Nginx, app-level)
- Symptom: under stress everything returns 500 and recovery is slow
- Fix: return 429/503 fast with clear retry headers; shed non-critical work
- Outcome: faster recovery, smaller blast radius, improved MTTR
Async boundaries for slow dependencies (payments, fraud checks)
- Symptom: checkout blocked on a downstream with unpredictable latency
- Fix: queue + worker, outbox pattern, idempotency keys
- Outcome: higher checkout success under peak; tail latency improves dramatically
When we run these with clients, we always tie results back to business metrics. Example from a recent engagement: moving checkout fraud scoring async plus fixing a hot DB index took two sprints, improved checkout p95 from ~1.4s to ~650ms, and reduced peak-time abandonment (measured via funnel) by ~3–5%. That’s not “performance work.” That’s revenue.
A repeatable load testing loop that doesn’t rot after the consultant leaves
The strategy only sticks if it becomes part of your delivery system.
- Codify the tests in-repo (
k6,Locust, orGatling) and version them like application code. - Make environments honest:
- same DB engine/version
- production-like dataset scale (or at least realistic cardinality)
- same autoscaling and resource limits
- Gate releases on performance budgets:
- allow small regressions (e.g., +5% p95) but fail big ones
- track budget like you track error budget
- Write down what you learn:
- what bottleneck you hit
- what you changed
- what metric moved
- what it cost (time + infra)
If you’re doing AI-assisted refactors or “vibe coding” your way through a backlog, this loop is non-negotiable. AI can generate a lot of code quickly; it’s fantastic at also generating accidental N+1s, unbounded concurrency, and “helpful” retries.
At GitPlumbers we call this “performance hygiene”: you don’t do it once, you institutionalize it—otherwise the system slowly returns to its natural state of chaos.
If you want a second set of eyes, GitPlumbers can help you build a load test suite that actually predicts production incidents, not just produces pretty graphs.
Key takeaways
- Treat load tests as a business-risk exercise: validate p95/p99 latency, error rate, and conversion-critical flows—not just throughput.
- Build workload models from real production traces and logs; otherwise you’ll optimize the wrong bottleneck.
- Define “pass/fail” using SLO-style gates (latency + availability + saturation) and wire it into CI/CD.
- Use multiple test shapes (baseline, stress, spike, soak) to catch different failure modes: autoscaling lag, pool exhaustion, memory leaks, and cache stampedes.
- Optimization wins that matter are usually boring: DB indexes, pool sizing, caching, backpressure, and removing synchronous fan-out.
- Always correlate user-facing metrics with system saturation (CPU, IO, DB locks, queue depth) to find the real limiter.
Implementation checklist
- Define top 3 user journeys (e.g., login, search, checkout) and assign latency + error SLOs.
- Capture production traffic shape: request mix, concurrency, payload size, cache hit rate, peak-to-average.
- Create test types: baseline, stress, spike, soak; document what each is meant to prove.
- Instrument before testing: `OpenTelemetry` traces + `Prometheus` RED/USE metrics + dashboards.
- Establish pass/fail gates: p95/p99 latency, error rate, saturation thresholds, and regression budget.
- Run tests against production-like infra (same DB engine/version, same autoscaling, same limits).
- Record bottlenecks and fixes in a “perf ledger” (what changed, why, and measured outcome).
- Re-run the same scenarios after every meaningful change (framework upgrades, AI-generated refactors, config drift).
Questions we hear from teams
- What’s the biggest mistake teams make with load testing?
- Optimizing for throughput (RPS) instead of **user-facing latency percentiles** and failure modes. Production incidents are usually p99 + retries + saturation, not “we didn’t hit enough RPS.”
- Do I need a perfect production clone to get value?
- No, but you need production-like **constraints**: same DB engine/version, realistic dataset cardinality, the same autoscaling/resource limits, and the same critical dependencies (or controlled simulators). Otherwise you’ll find the wrong bottleneck.
- k6 vs Locust vs Gatling—does the tool matter?
- Less than you want it to. Pick one your team will maintain. k6 is great for CI-friendly scripting and thresholds; Locust is great if your org is Python-heavy; Gatling shines in JVM shops. The strategy (workload model + SLIs + gates) matters more.
- How do you connect performance improvements to business impact?
- Measure funnel metrics alongside SLIs: checkout success rate, abandonment, conversion, revenue/hour. Then correlate changes in p95/p99 and error rate during peak tests (and real peaks) to those funnel shifts.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
