Why isn’t CPU-based autoscaling enough for capacity planning?

Because user pain shows up first in saturation signals—queue depth, DB waits, connection pool exhaustion, and tail latency. CPU can sit at 40% while p95 blows up due to lock contention, downstream throttling, or retries.

How much headroom should we plan for?

Commonly 30–50% for steady growth, plus explicit buffers for N+1 failures, deploy overhead, cache warmups, and retry storms. In systems with aggressive retries or fragile downstreams, 2x concurrency headroom is not unusual.

Do we need a full-blown queuing theory model?

Not at first. Start with empirically measured service curves and Little’s Law to convert throughput to concurrency. Add more sophistication only where the curve shows nonlinear behavior or where stateful tiers (DB/queues) dominate.

What’s the fastest way to improve capacity without adding hardware?

Reduce fan-out and stateful load: cache high-repeat reads (pricing/catalog), eliminate N+1 queries, add pagination limits, batch writes, and fix slow queries. These usually improve both p95 latency and cost per request.

Performance-optimization · Jan 23, 2026 · 8 minute read

The Day Your Checkout Hit 800ms: Capacity Planning That Predicts Scale Before Customers Feel It

A pragmatic model for forecasting scaling needs using user-facing SLOs, real utilization curves, and business KPIs—without betting the company on hand-wavy “requests per second.”

GitPlumbers Editorial Team

Performance & Scalability (20-year trenches edition)

We’re the folks you call when the graph is vertical and the incident channel is on fire. GitPlumbers fixes AI-assisted and legacy systems—capacity planning, tail-latency hunts, database bottlenecks, Kubernetes autoscaling, and the unglamorous refactors that make shipping safe again. We’ve lived through dot-com infra, JVM GC wars, microservices sprawl, and today’s vibe-coded production surprises.

Capacity planning that ignores p95 latency is just budgeting for outages.

Back to all posts

The failure mode: “We scaled CPU and still melted”

I’ve watched this exact movie at least a dozen times: leadership sees a traffic forecast, someone multiplies “requests per second” by “pods,” HPA scales on cpu: 70%, and the site still crawls. Checkout p95 goes from 220ms to 800ms, support tickets spike, and the incident channel fills with people arguing whether it’s Istio, GC, or “AWS being AWS.”

Here’s the uncomfortable truth: most capacity planning models predict infrastructure consumption, not user pain. Customers don’t care that you had 35% CPU. They care that the “Place Order” button spins.

At GitPlumbers, the models that actually work start with user-facing SLOs and work backward to capacity—because the business impact is tied to latency and errors, not pod counts.

Start with the only metrics that matter: user journeys + money

If your model can’t answer “how close are we to hurting conversion?” it’s not a capacity plan—it’s a hardware wishlist.

Pick 3–5 critical user journeys and define SLOs:

Checkout: p95 < 300ms, p99 < 700ms, 5xx < 0.1%
Search: p95 < 250ms, timeout < 0.2%
Login: p95 < 200ms, 4xx (auth failures) < 0.5%

Then tie them to business KPIs:

Conversion rate and AOV (average order value)
Revenue per minute during peak
Churn / session abandonment

A simple (and surprisingly effective) translation layer:

If checkout p95 increases by 100ms, conversion drops 0.3–1.0% depending on your funnel.

You can estimate this with your own data:

-- Example: correlate latency with conversion for checkout sessions
-- (adapt to your warehouse schema)
WITH sessions AS (
  SELECT
    session_id,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY checkout_latency_ms) AS p95_latency,
    MAX(CASE WHEN event = 'purchase_completed' THEN 1 ELSE 0 END) AS converted
  FROM analytics.events
  WHERE event IN ('checkout_request','purchase_completed')
    AND ts >= NOW() - INTERVAL '30 days'
  GROUP BY session_id
)
SELECT
  WIDTH_BUCKET(p95_latency, 0, 2000, 20) AS latency_bucket,
  COUNT(*) AS sessions,
  AVG(converted) AS conversion_rate
FROM sessions
GROUP BY latency_bucket
ORDER BY latency_bucket;

Once you have a curve (even a rough one), you can express impact as $ per 100ms at peak. That’s how you get budget for the work.

Build service curves from reality (not vibes)

Capacity planning gets accurate when you stop assuming linear scaling.

For each critical service/tier, build a service curve:

X-axis: offered load (RPS, jobs/sec, messages/sec)
Y-axis: p95 latency, plus saturation signals (CPU, memory, DB waits, queue depth)

You can generate this from:

Controlled load tests (k6, Locust, vegeta) against a staging environment with production-like data
Production telemetry during known peaks (Black Friday, product drops, month-end batch)

A quick k6 skeleton that ramps predictably:

import http from 'k6/http';
import { sleep } from 'k6';

export const options = {
  scenarios: {
    ramp: {
      executor: 'ramping-arrival-rate',
      startRate: 50,
      timeUnit: '1s',
      preAllocatedVUs: 200,
      maxVUs: 1000,
      stages: [
        { target: 100, duration: '2m' },
        { target: 200, duration: '2m' },
        { target: 400, duration: '2m' },
        { target: 600, duration: '2m' },
      ],
    },
  },
};

export default function () {
  http.post('https://api.example.com/checkout', JSON.stringify({ cartId: '123' }), {
    headers: { 'Content-Type': 'application/json' },
  });
  sleep(1);
}

Then plot:

checkout_api p95 latency vs RPS
DB p95 query latency vs QPS
Redis hit rate vs latency
Queue depth vs processing time

If you use Prometheus, you can pull p95 and saturation in one dashboard-friendly query set:

-- p95 latency for checkout handler
histogram_quantile(
  0.95,
  sum by (le) (rate(http_request_duration_seconds_bucket{route="/checkout"}[5m]))
)

-- error rate
sum(rate(http_requests_total{route="/checkout",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{route="/checkout"}[5m]))

-- queue depth (example: RabbitMQ)
rabbitmq_queue_messages_ready{queue="checkout"}

What you’re looking for is the knee in the curve: the point where adding load causes latency to go nonlinear. That knee—plus headroom—is your real capacity.

Translate demand to capacity using concurrency (Little’s Law saves lives)

RPS is a trap because it ignores response time. The thing that blows up systems is usually concurrency: inflight requests, open DB connections, threads, queue consumers.

Use Little’s Law:

L = λ * W
L = average concurrency (inflight)
λ = throughput (requests/sec)
W = average time in system (sec)

Example:

Forecast peak: 500 RPS on checkout
Target p95: 300ms (0.3s) under load

Minimum concurrency just for steady-state:

L = 500 * 0.3 = 150 inflight requests

Now add reality:

retries during partial failures
GC pauses
cold caches after deploy
“thundering herd” after a slow downstream

We typically model 1.5–2.5x concurrency headroom depending on how retry-happy the stack is (looking at you, default axios + circuit-breaker-free microservices).

So you plan for:

225–375 inflight at peak

That number maps directly to:

node/java worker pools
nginx upstream connections
DB connection pools (HikariCP, pgBouncer)
queue consumer counts

This is where AI-generated “vibe-coded autoscaling” fails in practice: it scales pods, but your maxPoolSize=20 stays fixed and the DB falls over.

A concrete model: from forecast to pods, DB, and dollars

Here’s a simplified model that engineering leaders can actually sanity-check.

Demand forecast (from product/marketing):
- Peak sessions: 120k/hour
- Checkout attempts: 18% of sessions
- Completion: 60% of attempts
Convert to throughput:
- Checkout attempts/hour = 120k * 0.18 = 21.6k/hour
- Attempts/sec ≈ 21,600 / 3600 = 6 RPS

That’s deceptively small, right? Now add burstiness (campaigns aren’t smooth), and dependency fan-out:

Peak burst factor (observed): 8x
Each checkout hits:
- checkout-api: 1 request
- pricing: 2 calls
- inventory: 1 call
- payments: 1 call
- DB: ~12 queries

So effective tier load:

checkout-api: 6 * 8 = 48 RPS
DB QPS: 48 * 12 = 576 QPS

Service curve constraint (from your tests/telemetry):

checkout-api pod (2 vCPU) stays under p95<300ms up to 25 RPS/pod
DB (RDS db.r6g.large) stays under p95 query < 20ms up to 450 QPS before IO waits spike

Capacity calculation:

App pods needed: 48 / 25 = 1.92 → 3 pods
DB is the bottleneck: 576 / 450 = 1.28 → need bigger DB or reduce QPS

Optimization options with measurable outcomes:

Add Redis cache for pricing lookups (observed 40% repeat)
- Expect DB QPS reduction: pricing queries * hit_rate
- If pricing is 4 of 12 queries and you hit 70% cache:
  - QPS saved ≈ 48 RPS * 4 * 0.7 = 134 QPS
  - New DB QPS ≈ 576 - 134 = 442 QPS (back under the knee)
Batch inventory reads (reduce 1 query to 1 query per checkout group)
Add pgBouncer to stabilize connection storms

This is the punchline: capacity planning often turns into performance work, and the model tells you where you’ll get the biggest business win.

In real engagements, we’ve seen:

p95 checkout latency: 480ms → 240ms after caching + query plan fixes
Peak incident rate during campaigns: down 60–80%
Infra cost: flat despite traffic growth (because you stop brute-force scaling)

Autoscaling that doesn’t DDoS your own database

CPU-based autoscaling is fine until it isn’t. When you scale stateless pods without guarding stateful dependencies, you create a self-inflicted outage.

What actually works:

Scale on saturation signals:
- request queue depth / backlog
- upstream latency (p95)
- threadpool/worker utilization
- DB waits (pg_stat_activity, Innodb_row_lock_time, RDS ReadLatency)
Put caps in place so the app can’t stampede the DB

A Kubernetes HPA example that uses latency (via a custom metric) and includes sane limits:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: checkout-api
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: checkout-api
  minReplicas: 3
  maxReplicas: 20
  metrics:
    - type: Pods
      pods:
        metric:
          name: http_p95_latency_ms
        target:
          type: AverageValue
          averageValue: "250"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Percent
          value: 50
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 20
          periodSeconds: 60

And if you’re using Karpenter/Cluster Autoscaler: pre-scale nodes for known events (campaigns, batch windows) so you’re not waiting for EC2 provisioning while customers rage-refresh.

Keep the model honest: forecast vs actual, every week

The difference between a “capacity plan” and a living system is whether you revisit it.

The cadence we push at GitPlumbers:

Weekly 30-minute review:
- forecasted peak vs actual peak
- predicted p95 vs actual p95
- incidents / near-misses (error budget burn)
Update coefficients:
- burst factor
- cache hit rates
- per-request DB query counts
Re-run the next 4–8 weeks of projections

A simple drift check in bash (pulling from Prometheus) is enough to keep people honest:

# Example: compare last week's peak RPS to model assumptions
curl -G "https://prom.example.com/api/v1/query" \
  --data-urlencode 'query=max_over_time(sum(rate(http_requests_total{job="checkout-api"}[1m]))[7d:1m])'

If the model is consistently wrong, that’s not a math failure—it’s a signal that:

product behavior changed
a dependency got slower
a “small” feature added hidden fan-out (classic microservices tax)

Either way, you found it before customers did.

When you want this done fast (and without heroics)

Capacity planning is one of those problems that looks easy until it isn’t—because the hard part is teasing out the real bottleneck and the real business driver.

If you want a second set of eyes, GitPlumbers does performance + scalability rescues where we:

baseline user-journey SLOs
build service curves from your telemetry and targeted load tests
deliver a capacity model your team can run without us
implement the top 2–3 optimizations with measurable impact (latency, error budget, cost)

You don’t need a 6-month “platform initiative.” You need a model that predicts the next failure mode and the fixes that buy you time.

Related Resources

Key takeaways

Capacity planning is an SLO problem first, a hardware problem second—model p95/p99 latency and errors against load.
Use empirically derived service curves (load vs latency/utilization) from production + load tests; don’t assume linear scaling.
Forecast demand with business drivers (traffic, conversions, batch jobs, partner spikes) and translate it to per-service concurrency using Little’s Law.
Scale triggers should be based on saturation signals (queue depth, threadpool exhaustion, DB waits), not just CPU.
Bake headroom into the model (N+1, deploy overhead, cache warmups) and validate weekly with “forecast vs actual” drift checks.

Implementation checklist

Define user-facing SLOs per critical journey (login, search, checkout) with p95/p99 latency and error rate targets.
Identify the true bottleneck tier(s) by correlating latency with saturation signals (DB waits, queue depth, connection pool, GC).
Generate service curves from controlled load tests and production telemetry (throughput → latency/CPU/memory/IO).
Convert forecasted traffic into concurrency using Little’s Law and validate against observed concurrency.
Model headroom: failure tolerance (N+1), deploy overhead, cache warmup, retry storms, and batch contention.
Set autoscaling guardrails: scale on saturation metrics; cap scaling to avoid DB collapse; pre-scale for known events.
Review weekly: compare predicted vs actual, update coefficients, and retire stale assumptions.

Questions we hear from teams

Why isn’t CPU-based autoscaling enough for capacity planning?: Because user pain shows up first in saturation signals—queue depth, DB waits, connection pool exhaustion, and tail latency. CPU can sit at 40% while p95 blows up due to lock contention, downstream throttling, or retries.
How much headroom should we plan for?: Commonly 30–50% for steady growth, plus explicit buffers for N+1 failures, deploy overhead, cache warmups, and retry storms. In systems with aggressive retries or fragile downstreams, 2x concurrency headroom is not unusual.
Do we need a full-blown queuing theory model?: Not at first. Start with empirically measured service curves and Little’s Law to convert throughput to concurrency. Add more sophistication only where the curve shows nonlinear behavior or where stateful tiers (DB/queues) dominate.
What’s the fastest way to improve capacity without adding hardware?: Reduce fan-out and stateful load: cache high-repeat reads (pricing/catalog), eliminate N+1 queries, add pagination limits, batch writes, and fix slow queries. These usually improve both p95 latency and cost per request.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to GitPlumbers about a capacity planning + performance assessment See performance rescue case studies