The Load Test That Caught a $3M Outage Before Marketing Did
Stop chasing CPU graphs. Validate user-facing behavior under real stress, tie it to revenue, and ship with a margin of safety.
If your load test doesn’t break something now and then, it’s telling you bedtime stories.Back to all posts
The promo that almost killed checkout
Black Friday, 07:58. Marketing flips a feature flag, drops a 20% promo, and traffic climbs 4x in six minutes. I’ve watched this movie. At a retailer I worked with, the p95 checkout latency went from 700ms to 2.9s, error rate hit 6%, and conversion cratered. We’d done “load testing,” but it was a closed-loop JMeter script hammering a single endpoint with warm caches and no third-party calls. Useless.
Here’s what actually works: build a load testing strategy that validates the end-to-end user journey under realistic stress, measures what customers feel (p95, Apdex, LCP), and ties it to what the business pays for (conversion, revenue at risk). That’s how we caught a similar promo-induced failure for another client a week before launch—and avoided a seven-figure incident.
If your load test doesn’t break something now and then, it’s telling you bedtime stories.
Measure what users feel, not what servers do
Stop starting with CPU. Start with the journeys that print money.
- User-facing SLOs (owned by product + SRE):
Search -> PDP -> Add to cart -> Checkout
p95 < 800ms, error rate < 0.5%- Web LCP < 2.5s p75 on mobile, CLS < 0.1, TTFB < 200ms from key regions
- Auth p95 < 300ms, 99.9% success
- Business KPIs (observed during tests):
- Checkout completion rate, drop-off at payment step
- Queue length vs. abandonment when virtual waiting room is on
- Revenue proxy:
requests_to_payment_gateway * AOV * success_rate
- Technical signals (supporting, not leading):
- RED: Rate, Errors, Duration per service in
Grafana
- USE: Utilization, Saturation, Errors on infra (
Prometheus
,Node Exporter
,cAdvisor
) - Saturation indicators: DB
active_connections
,pg_locks
, GC pause time,nginx
499/5xx
- RED: Rate, Errors, Duration per service in
Tie each journey to a clear budget and a consequence. If checkout p95 exceeds 1s, you aren’t just slow—you’re losing carts at scale.
Model real traffic: open workloads and ugly edge cases
Most teams get this part wrong.
- Use an open workload model: users arrive regardless of your response time. Closed models (think: one VU waits for response, then sends next) hide tail latency via coordinated omission.
- Tools that support open/constant arrival:
k6
(constant-arrival-rate
),wrk2
,Vegeta
.Gatling
can be scripted similarly.
- Tools that support open/constant arrival:
- Shape the load like your business:
- Baseline: current steady-state RPS (e.g., 800 req/s checkout)
- Step: +20% every 5 minutes until you hit 2x expected peak
- Spike: instant 3x for 60s (promo drop)
- Soak: 2 hours at expected peak to catch leaks and GC churn
- Stress: push until SLO breaks; record the breakpoint and failure mode
- Test the whole journey: auth -> browse -> cart -> checkout -> payment. Include redirects, image/CDN fetches, and third-party calls.
- Include failure modes: 2% payment gateway timeouts, 1% DNS resolution errors, 150ms jitter from a key ISP; rate limits from
Stripe
orAdyen
.
Example k6
constant arrival for checkout:
import http from 'k6/http';
import { check, sleep } from 'k6';
export const options = {
scenarios: {
steady_checkout: {
executor: 'constant-arrival-rate',
rate: 1000, // requests per second
timeUnit: '1s',
duration: '30m',
preAllocatedVUs: 2000,
maxVUs: 5000,
},
},
thresholds: {
http_req_duration: ['p(95)<800'],
http_req_failed: ['rate<0.005'],
},
};
export default function () {
const res = http.post('https://api.example.com/checkout', JSON.stringify({
cartId: 'k6-' + __VU + '-' + __ITER,
paymentMethod: 'visa',
}), { headers: { 'Content-Type': 'application/json' }});
check(res, { 'status is 200/201': (r) => r.status === 200 || r.status === 201 });
sleep(0.1);
}
This setup finds tail latency and backpressure problems you won’t see with closed loops.
Make the environment honest: data, caches, and third parties
I’ve seen “successful” tests that were just exercising warmed caches with toy data. Then production fell over on the first cold start.
- Data realism:
- Use production-like cardinality: products, prices, promotions, users. Synthetic generation > sanitizing prod when PII risk is high.
- Vary payload sizes: 5th percentile to pathological 99th (e.g., carts with 40 line items).
- Seed test accounts, API keys, and payment tokens at scale.
- Cache behavior:
- Warm caches to expected hit rate before a run; then deliberately invalidate during the test.
- Implement stampede protection:
singleflight
/mutex
orrequest coalescing
innginx
/Envoy
.
- Third-party dependencies:
- Don’t mock away latency. Use service virtualization with
WireMock
orMountebank
to inject timeouts, 429s, and jitter. - Cap outbound concurrency; validate circuit breakers (
Resilience4j
,Envoy
circuit_breakers
) actually trip.
- Don’t mock away latency. Use service virtualization with
- Network conditions:
- Test from regions that matter. Use
k6 cloud
orFlood.io
for distributed load close to users. - Shape latency/jitter with
tc
in lower envs if you can’t test in prod.
- Test from regions that matter. Use
The goal: when your test passes, you trust it in your bones because it smelled like production.
Run, observe, decide: a repeatable test playbook
You don’t need a platform team the size of Meta. You need discipline and the right guardrails.
- Instrument: dashboards per journey with RED + business KPIs. Pin
p50/p95/p99
, error rate, saturation, and conversion proxies side by side. Include “SLO burn rate” panels. - Baseline: run at current peak for 20 minutes. Capture p95, CPU, DB connections, GC pauses, cache hit ratio. Save as
baseline-YYYYMMDD
. - Step and spike: increase arrival rate; record when p95 or error rate crosses thresholds. Note first failure domain (DB waits? thread pool exhaustion? 502 from edge?).
- Soak: hold for 2 hours. Watch for memory climbs, file descriptor leaks, and growing GC pause percent.
- Decide: either you have headroom (≥30% above forecast) or you have work. Open issues with exact breakpoints and owners.
- Automate: bake thresholds into CI (GitHub Actions, GitLab CI). Fail builds on regression.
Example GitHub Action step for k6
threshold gating:
- name: Run k6
uses: grafana/k6-action@v0.3.1
with:
filename: ./tests/checkout.js
env:
K6_CLOUD_TOKEN: ${{ secrets.K6_CLOUD_TOKEN }}
A load test without an automated threshold is a dashboard tour, not a gate.
Fix the bottlenecks that actually move p95 (with numbers)
Here are optimizations that routinely pay for themselves, with the deltas we’ve seen in the field.
- Database:
- Add missing composite indexes (e.g.,
orders(user_id, created_at DESC) WHERE status='PAID'
). Result: p95 query time 120ms -> 9ms; checkout p95 900ms -> 640ms. - Use
HikariCP
sane defaults:maximumPoolSize=2 x CPU
,minimumIdle=10
,connectionTimeout=250ms
. Result: timeouts disappear; thread pool no longer stalls under burst. - Put
PgBouncer
in transaction pooling for spiky traffic. Result: DB connection spikes flatten; error rate -1.5pp at peak.
- Add missing composite indexes (e.g.,
- Caching:
- Cache product/pricing responses in
Redis
with TTL 30–120s. UseSETNX
/singleflight
to prevent dogpile. Result: origin RPS -60%, API p95 780ms -> 410ms. - Evict by key on price change events; measure cache hit > 85% at peak.
- Cache product/pricing responses in
- Async + batching:
- Move non-critical writes (email, analytics) off the checkout path via
Kafka
withacks=all
,linger.ms=5
, batch up to 64KB. Result: request handler work -30%, p95 620ms -> 520ms. - Use the Outbox pattern to keep consistency.
- Move non-critical writes (email, analytics) off the checkout path via
- Backpressure & resiliency:
Envoy
local_rate_limit
token bucket per route;circuit_breakers
for upstreams. Result: protects DB during spikes; error rate stays <0.5% instead of cascading 5xx.- Graceful degradation via feature flags (
LaunchDarkly
) to disable heavy recommendations at high load.
- JVM and runtime tuning:
- Switch to
G1GC
with-XX:MaxGCPauseMillis=200
and-XX:MaxRAMPercentage=75
. Result: GC pauses p95 180ms -> 60ms, request p95 down 10–15%. - For ultra-low latency services, test
Shenandoah
(JDK17+) where pauses dominate tail.
- Switch to
- HTTP and edge:
- Enable Brotli for text, tune
nginx
keepalive_requests=1000
,keepalive_timeout=30s
,worker_connections=8192
. Result: edge CPU -20%, TTFB -40ms. - Move static/media to CDN with
Cache-Control: max-age=31536000, immutable
andstale-while-revalidate=60
. Result: origin egress -70%, LCP improves 200–400ms.
- Enable Brotli for text, tune
- Frontend hygiene:
- Code-split, kill
moment.js
fordate-fns
, lazy-load non-critical widgets, preconnect to critical origins. Result: JS bundle -250KB, LCP -300ms on mobile.
- Code-split, kill
Each change must be verified by rerunning the exact scenario that exposed the problem. No “feels faster.”
Put dollars on it: translating latency into business impact
Engineering leaders don’t get budget for p95. They get budget for revenue and risk.
- Conversion math: If checkout p95 drops from 1.2s to 800ms and your historical elasticity shows +0.6% conversion per 100ms at that range, a 400ms improvement on 500k weekly sessions is meaningful revenue.
- Incident avoidance: If stress tests show the system fails at 2.6x baseline and marketing expects 2.2x, you have 18% headroom. Without it, you’re one promo away from an SEV-1. Assign a $/minute based on past incidents.
- Cost-performance trade-off: After caching + DB tuning, you might scale down nodes 25% while meeting the same SLOs. That’s direct cloud savings.
- Capacity planning: Document the breakpoint and revalidate quarterly. Tie infra reservations/commitments to tested headroom, not guesswork.
Put these numbers in the same slide as your SLO report and watch prioritization get easier.
Make it part of delivery: tests that run when you’re not watching
One heroic test won’t save you next quarter. Institutionalize it.
- GitOps integration: Store test scripts (
k6
,Locust
) alongside services. Changes to infra (Terraform
) trigger representative load tests viaArgo Workflows
or CI. - Thresholds as code: Keep SLO-aligned thresholds in version control. Example:
http_req_duration: p(95)<800
blocks merges. - Weekly soak: Run a 2-hour soak Sunday morning. Alert on drift: more GC pauses, slower p95, lower cache hit. This catches memory leaks and “just one more if statement” regressions.
- Prod-on-purpose: For read-heavy paths, run small canary load in production with
k6 operator
and strict rate limits. Observe from user regions. - War-room drills: Quarterly chaos + load day: partial dependency outages with
Toxiproxy
, then spike load. Validate failover, rate limits, and feature flag degradation.
This is the difference between hoping and knowing. GitPlumbers helps teams wire this into their delivery so it sticks, not squeaks. See how we approach this in our services page and case studies below.
What I’d do differently, every time
- Start with journeys and SLOs; keep infra graphs in a supporting role.
- Use open models to avoid lying to yourself about latency.
- Make third parties real; if they’re flaky in prod, they’re flaky in tests.
- Bake thresholds into CI, not someone’s calendar.
- Treat every optimization as a hypothesis; re-measure the same scenario.
- Keep 30% headroom above forecast. The forecast is always wrong.
If you want a second set of eyes on your load strategy, we’ve probably seen your failure mode before—and the political fight you’ll need to fix it.
Key takeaways
- Design load tests around user journeys and business-critical SLOs, not infrastructure metrics.
- Use an open workload model to avoid coordinated omission; test at and beyond expected arrival rates.
- Measure p50/p95/p99, error rate, and saturation alongside conversion and revenue proxies.
- Build realistic data and dependencies: warm caches, simulate third parties, and model failures.
- Automate thresholds in CI and run weekly soak tests to catch regressions before incidents.
- Apply targeted optimizations (DB, cache, GC, HTTP, backpressure) and re-measure outcomes.
Implementation checklist
- Define user-facing SLOs for each critical journey (e.g., checkout p95 < 800ms, <0.5% errors).
- Choose an open-load tool (`k6`, `wrk2`, `Vegeta`) and model arrival rates realistically.
- Create synthetic-but-realistic data; warm caches; virtualize flaky third parties.
- Instrument RED/USE metrics and business KPIs in Grafana dashboards.
- Run baseline, step, spike, stress, and soak tests; record breakpoints and headroom.
- Bake thresholds into CI; block merges on p95/error regression.
- Tune bottlenecks (DB, caching, GC, backpressure) and validate improvements with the same test.
Questions we hear from teams
- Do we need a production-scale environment to get meaningful results?
- You need production-like behavior more than identical scale. Use realistic data cardinality, warm caches to expected hit rates, and simulate third-party latency/failures. For read-heavy paths, run small, controlled tests in production with rate limits. For write-heavy paths, use service virtualization and shadow traffic to avoid data corruption.
- Open vs. closed workload models—why should I care?
- Closed models tie request rate to response time, masking tail latency (coordinated omission). Open models send requests at a fixed arrival rate regardless of response. Real users are open load. Use tools like k6’s constant-arrival-rate, wrk2, or Vegeta to generate open load and expose backpressure and queueing effects.
- How do I include third parties without risking billable calls?
- Virtualize them with WireMock/Mountebank/Toxiproxy. Record real responses, inject jitter/timeouts/429s, and rate-limit outbound calls. Validate that your timeouts, retries, and circuit breakers behave under load. Then run a small canary against the real provider off-peak to confirm assumptions.
- What metrics should gate a release?
- User-facing ones aligned to SLOs: p95/p99 latency per journey, error rate, and a business proxy (e.g., checkout success). Support with saturation (DB connections, thread pool queue depth) for root cause. Bake thresholds into CI so regressions block merges automatically.
- How often should we re-run load tests?
- At minimum before major promos and quarterly to refresh capacity headroom. Mature teams run weekly soaks and trigger targeted load tests on significant infra or dependency changes (database version upgrades, feature flags that change data access patterns).
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.