What’s the simplest way to start if we’ve never done proper load testing?

Pick one journey (checkout), define p95/error SLOs, write a k6 arrival-rate test, and build one Grafana dashboard that shows p95, error rate, and orders/minute. Run a 15-minute ramp + 10-minute soak. Fix one bottleneck. Repeat.

How do we avoid polluting production metrics with load test traffic?

Tag requests with headers (X-LoadTest), trace IDs, or a dedicated tenant ID. Filter in PromQL via labels and send load-test spans to a separate trace index. Always annotate dashboards and time windows.

Do we test in prod or staging?

Both. Staging for breakpoints and failure injection. Prod for low-traffic canary and synthetic checks to validate real dependencies (CDN, Auth, Payments). Never exceed agreed budgets in prod; use allowlists and off-peak windows.

What if our bottleneck is a third-party API?

Rate-limit upstream calls, batch when possible, cache responses with TTLs, and add circuit breakers/outlier detection. Negotiate higher SLAs and concurrency limits informed by your measurements. Model their fail modes in tests.

How do we quantify business impact from performance work?

Run A/B canaries with the same workload and compare conversion, revenue/order, and drop-off. Use statistical tests where possible. Tie infra changes to $ by multiplying conversion lift by traffic and AOV. Show cost reductions from right-sized infra.

Performance-optimization · Oct 6, 2025 · 10 minute read

Stop Chasing RPS: Load Tests That Protect p95, Revenue, and Sleep

Design load tests around user journeys and SLOs, not vanity throughput. Validate behavior under stress, prove business impact, and ship safer.

Eli Navarro

Principal Engineer, GitPlumbers

20 years building and rescuing distributed systems at scale. Ex-Netflix SRE, ex-Stripe performance team, observability nerd, and habitual p95 breaker.

“If your load test doesn’t move a business metric, it’s theater.”

Back to all posts

The promo code that broke checkout

We had a retail client whose promo code went viral at 9:12 PM on a Sunday. Traffic doubled in three minutes, p95 on the checkout API jumped from 650ms to 2.8s, and error rate crept past 5%. The thing is, they’d “passed” load testing the week before at 5k RPS. The tests were perfect—just not representative. They hammered one endpoint with steady traffic and zero failures. Real users don’t.

I’ve seen this fail at unicorns and banks. The fix isn’t more RPS. It’s designing load tests that validate real user behavior under stress—tied to SLOs and business metrics—so you catch cracks before production does.

Start with user-facing metrics, not RPS

If your test plan doesn’t mention p95 and conversion, it’s theater.

Define SLOs from the user’s point of view:
- Web/API: p95 latency and error rate per key journey (login, search, add-to-cart, checkout)
- Mobile: TTI/FCP/LCP budgets (Core Web Vitals) and offline behavior
- Business: checkout conversion, drop-off rate, retries per order, support tickets
Set explicit budgets by journey. Example:
- Search p95 < 400ms, 99th < 800ms, error rate < 0.5%
- Checkout p95 < 800ms, 99th < 1.5s, error rate < 1%
- Apdex ≥ 0.95 at 200ms T
Tie infra metrics to user pain:
- Thread pool saturation correlates to 99th percentile spikes
- GC pauses, DB lock wait, and cache misses predict conversion drops

Write these SLOs on the first slide of the test plan and again on the last. Everything else is implementation.

Model realistic load: arrival rates, journeys, and data

Successful tests simulate how users behave, not how load generators like to send traffic.

Use arrival rate (requests per second arriving independently), not just VUs. Real traffic clusters and bursts.
Include think time and user journeys: SSR + API calls + static assets.
Mix cache hits and misses. Test the cold start path.
Vary payload sizes and data cardinality (e.g., carts with 1 item vs 30).
Tools that don’t fight you: k6, Locust, Gatling, Artillery. JMeter works, but keep it under control.

Example k6 script with arrival-rate, thresholds, and tagging:

import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
  scenarios: {
    checkout_journey: {
      executor: 'ramping-arrival-rate',
      startRate: 50,
      timeUnit: '1s',
      preAllocatedVUs: 100,
      stages: [
        { duration: '3m', target: 200 }, // ramp to steady
        { duration: '5m', target: 200 }, // steady
        { duration: '1m', target: 500 }, // spike
        { duration: '10m', target: 200 }, // recovery/soak
      ],
      tags: { journey: 'checkout', env: 'staging', source: 'loadtest' },
    },
  },
  thresholds: {
    'http_req_duration{journey:checkout}': ['p(95)<800', 'p(99)<1500'],
    'http_req_failed{journey:checkout}': ['rate<0.01'],
  },
};

const BASE = __ENV.BASE_URL;

export default function () {
  const params = { headers: { 'X-LoadTest': 'true' } };
  const hp = http.get(`${BASE}/`, params);
  check(hp, { 'home 200': (r) => r.status === 200 });
  sleep(1);

  const cart = http.post(`${BASE}/api/cart`, JSON.stringify({ sku: 'SKU123', qty: 1 }), { ...params, headers: { ...params.headers, 'Content-Type': 'application/json' } });
  check(cart, { 'cart 201': (r) => r.status === 201 });
  sleep(0.5);

  const checkout = http.post(`${BASE}/api/checkout`, JSON.stringify({ method: 'card' }), { ...params, headers: { ...params.headers, 'Content-Type': 'application/json' } });
  check(checkout, { 'checkout 200': (r) => r.status === 200 });
  sleep(Math.random() * 2);
}

Turn up the heat: spikes, soaks, and failure injection

Steady load hides bugs. Real systems fail at the edges.

Spike tests: 3–5x traffic for short bursts to validate autoscaling and backpressure.
Soak tests: 2–4 hours steady to catch leaks, jitter, and scheduler drift.
Breakpoint tests: increase arrival rate until SLOs break; note the cliff.
Failure injection: introduce latency, drop packets, throttle dependencies.

Examples:

Simulate network issues locally with tc:

sudo tc qdisc add dev eth0 root netem delay 120ms 40ms distribution normal loss 0.5% rate 20mbit
# ... run load ...
sudo tc qdisc del dev eth0 root

Use toxiproxy to degrade a dependency:

docker run --rm -p 8474:8474 -p 8666:8666 shopify/toxiproxy
# Create a proxy from localhost:8666 to redis:6379 and add 200ms latency
curl -s -XPOST localhost:8474/proxies -d '{"name":"redis","listen":"0.0.0.0:8666","upstream":"redis:6379"}'
curl -s -XPOST localhost:8474/proxies/redis/toxics -d '{"name":"lag","type":"latency","attributes":{"latency":200,"jitter":50}}'

Stress critical dependencies, not just the edge. If your auth provider or payment gateway rate limits, model that. I’ve watched OAuth token refresh storms take down otherwise healthy clusters.

See what users would see: observability that matters

If you can’t correlate test phases to user metrics, you’re flying blind.

Tag load-test traffic with headers (X-LoadTest: true) and propagate via traceparent. Sample at 100% for those traces.
Build dashboards by journey with p50/p95/p99, error rate, and request volume.
Add business KPIs: orders placed/min, auth failures/min, retries/order, drop-off at each funnel stage.
Use Prometheus and tracing (Jaeger/Tempo) to follow a request across services and DB.

PromQL you’ll actually use:

# p95 latency for checkout route over 5m
histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket{route="/checkout"}[5m])))

# Error rate for 5xx on checkout
sum(rate(http_requests_total{route="/checkout",status=~"5.."}[5m]))
/ ignoring(status)
sum(rate(http_requests_total{route="/checkout"}[5m]))

# Saturation: DB connections usage
max(pg_stat_activity_count{db="app"}) / max(pg_settings_max_connections{db="app"})

Pro tip: annotate dashboards with test phases (ramp, spike, soak) and changes (deploys, feature flags). When p95 jumps on spike but not on soak, that’s a scaling/queueing problem, not a leak.

Fix the bottlenecks with measurable wins

This is where most teams flail. Here’s what actually moves the needle and how to prove it.

Database
- Use pg_stat_statements to find hot queries; index for WHERE/ORDER BY patterns.
- Add pgbouncer in transaction mode to cap connection storms.

-- Find top queries by total time
SELECT query, total_exec_time, calls, mean_exec_time FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 10;

-- Example index that saved 400ms p95 on search
CREATE INDEX CONCURRENTLY idx_products_lower_name ON products (lower(name));

Caching
- Cache read-heavy endpoints at CDN/edge for 30–300s with stale-while-revalidate.
- Use Redis for query result caching with strict TTLs and key versioning to avoid stampedes.

# Nginx/Ingress snippet: cache product pages
proxy_cache_valid 200 301 302 300s;
add_header Cache-Control "public, s-maxage=300, stale-while-revalidate=60";

Backpressure and timeouts
- Set per-service timeouts; fail fast before thread pools exhaust.
- Add circuit breakers and outlier detection (Istio/Envoy) to quarantine bad pods or dependencies.

# Istio DestinationRule with connection pool + outlier detection
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: checkout-api
spec:
  host: checkout.svc.cluster.local
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
      http:
        http1MaxPendingRequests: 1000
        maxRequestsPerConnection: 100
        idleTimeout: 5s
        maxRetries: 2
    outlierDetection:
      consecutive5xxErrors: 5
      interval: 5s
      baseEjectionTime: 30s
      maxEjectionPercent: 50
    loadBalancer:
      simple: LEAST_REQUEST

Autoscaling and queueing
- Scale on arrival rate or queue depth, not just CPU. Use HPA v2 with custom metrics.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: checkout
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: checkout
  minReplicas: 5
  maxReplicas: 50
  metrics:
  - type: Pods
    pods:
      metric:
        name: requests_per_second
      target:
        type: AverageValue
        averageValue: "20"

Network and payloads
- Turn on HTTP/2, gzip/brotli, and shrink payloads. I’ve dropped p95 by 200ms by killing a chatty JSON field.

Measure before/after on the same workload. If checkout p95 drops from 1.8s to 650ms and error rate from 3% to 0.4%, run the A/B in production at 10% via canary to confirm conversion moves accordingly.

Bake it into delivery: gates, canaries, and budgets

If performance is a one-off exercise, it will regress. Automate it.

CI gate with k6 thresholds. Fail fast when budgets break.
Canary with automated analysis (Argo Rollouts + Prometheus/Kayenta) using your SLOs.
Synthetic probes run 24/7 to detect drift.

GitHub Actions example running k6 and failing on thresholds:

name: perf-check
on: [push]
jobs:
  k6:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: k6 run
        uses: grafana/k6-action@v0.3.1
        with:
          filename: tests/checkout.k6.js
        env:
          BASE_URL: https://staging.example.com

Argo Rollouts canary with Prometheus checks:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: checkout
spec:
  strategy:
    canary:
      steps:
      - setWeight: 10
      - pause: { duration: 5m }
      - analysis:
          templates:
          - templateName: p95-checkout-slo
          args:
          - name: slo
            value: "0.8" # 800ms
      - setWeight: 50
      - pause: { duration: 10m }
      - setWeight: 100

Define failure actions: if p95 > budget or error rate > 1%, abort rollout and page the on-call. I’ve watched teams save hours of MTTR by letting automation do the boring (and fast) decision-making.

What good looks like: outcomes you can take to the CFO

This is what “done” looks like after a three-week sprint with a client who thought they needed to double their Kubernetes nodes:

Checkout p95: 1.9s → 620ms (67% faster)
Error rate: 2.7% → 0.5%
Conversion: +3.2% (A/B canary at 10% traffic, p<0.05)
Infra cost: −18% (CPU steady-state dropped after fixing N+1 and adding edge caching)
MTTR during peak: 45m → 12m (automated rollback on SLO breach)

If your load test plan doesn’t produce a delta like this, it’s not aligned with the business.

Pick journeys and SLOs.
Model realistic arrival + failure modes.
Observe like a user (p95, errors, conversion), not a kernel engineer.
Fix the bottlenecks that move those numbers.
Automate the checks in CI and canary.

We do this for teams that have already been burned by “just add more pods.” GitPlumbers shows up, measures, fixes, and leaves your team with a playbook you can run before every peak. You sleep; your revenue doesn’t dip.

Related Resources

Key takeaways

Tie load tests to SLOs and conversion metrics; RPS without user context is noise.
Model arrival rates and real user journeys; validate under spike, soak, and failure modes.
Instrument observability to correlate test phases with p95, error rate, and business KPIs.
Use concrete fixes (indexes, caching, circuit breakers, autoscaling) and prove impact with before/after metrics.
Automate performance gates in CI/CD and canary; don’t rely on hope during peak traffic.

Implementation checklist

Define user-centric SLOs (e.g., checkout p95 < 800ms, error rate < 1%).
Model realistic traffic: arrival rate, think time, cache hit/miss mix, data cardinality.
Include spike, ramp, and soak tests; inject failures (latency, packet loss, dependency timeouts).
Tag and trace load-test traffic; correlate with Prometheus and tracing (Jaeger/Tempo).
Set thresholds and budgets; gate releases with k6 thresholds and canary analysis.
Prioritize fixes with measured wins (DB, cache, backpressure, autoscaling).
Re-run with the same workload; compare apples-to-apples dashboards.
Document a runbook: when to abort, who to page, rollback triggers.

Questions we hear from teams

What’s the simplest way to start if we’ve never done proper load testing?: Pick one journey (checkout), define p95/error SLOs, write a k6 arrival-rate test, and build one Grafana dashboard that shows p95, error rate, and orders/minute. Run a 15-minute ramp + 10-minute soak. Fix one bottleneck. Repeat.
How do we avoid polluting production metrics with load test traffic?: Tag requests with headers (X-LoadTest), trace IDs, or a dedicated tenant ID. Filter in PromQL via labels and send load-test spans to a separate trace index. Always annotate dashboards and time windows.
Do we test in prod or staging?: Both. Staging for breakpoints and failure injection. Prod for low-traffic canary and synthetic checks to validate real dependencies (CDN, Auth, Payments). Never exceed agreed budgets in prod; use allowlists and off-peak windows.
What if our bottleneck is a third-party API?: Rate-limit upstream calls, batch when possible, cache responses with TTLs, and add circuit breakers/outlier detection. Negotiate higher SLAs and concurrency limits informed by your measurements. Model their fail modes in tests.
How do we quantify business impact from performance work?: Run A/B canaries with the same workload and compare conversion, revenue/order, and drop-off. Use statistical tests where possible. Tie infra changes to $ by multiplying conversion lift by traffic and AOV. Show cost reductions from right-sized infra.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Run a Performance Readiness Check Talk to an Engineer (Not a Sales Rep)