Capacity Planning That Doesn’t Lie: Predict Scale With Leading Indicators, Not Dashboards
If your model needs a slide deck to explain, it won’t save your on-call. Here’s the lean blueprint we use to predict capacity needs and wire telemetry into autoscaling, triage, and rollouts that don’t blow up at 2 a.m.
Capacity planning that can’t predict burn rate or queue growth is just a dashboard with a résumé.Back to all posts
The Friday deploy that finally forced honest capacity planning
I watched a payments API on GKE
faceplant at 5:12 p.m. on a Friday. Dashboards were green. CPU%
hovered at 55. Autoscaler was “healthy.” Then p99 jumped 3x, Kafka lag spiked, thread pools starved, and the on-call got paged into a goose chase. The root cause wasn’t a mystery: rising consumer_lag
slope + container_cpu_cfs_throttled_seconds_total
+ a slow bake of new code increased service time. None of that shows up in a vanity CPU%
widget.
That weekend we ripped out our “capacity model” spreadsheet and built a real one tied to leading indicators and automated rollouts. The next peak? Zero sev-1s, predictable spend bump, and no heroics. Here’s the playbook we use at GitPlumbers when a team is done being surprised by scale.
Track signals that actually predict incidents
Stop staring at CPU%
and request counts in isolation. The predictors that have saved my bacon repeatedly are saturation, backpressure, tail drift, and burn rate.
Web/API services:
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
– tail-latency drift, not instantaneous spikes.sum(rate(container_cpu_cfs_throttled_seconds_total[5m])) by (pod)
– CPU throttling = silent performance killer on bursty workloads.sum(rate(process_runtime_gc_pause_seconds_sum[5m]))
orjvm_gc_pause_seconds_sum
– GC pauses predict tail blowups.sum(rate(threadpool_queue_length[1m])) by (pool)
orgo_goroutines
blocked signals – thread pool starvation precedes 5xx.
Data pipelines / Kafka:
kafka_consumergroup_lag{group="payments"}
andderiv(kafka_consumergroup_lag[10m])
– lag slope is the early warning.sum(rate(kafka_network_requestmetrics_requests_total{request="Produce"}[5m]))
vs. broker diskiowait
– broker saturation predicts backlogs.job_queue_depth
andrate(job_queue_depth[5m])
– queue growth under steady arrival rate means service time just got slower.
Databases:
db_connections_in_use
vs.max_connections
andlock_waits_total
– connection pool saturation and lock contention.pg_stat_activity
long transactions andcheckpoint_write_time
– write amplification predicts nasty latency cliffs.
Inference/GPU services:
gpu_mem_used_bytes / gpu_mem_total_bytes
– model swaps hurt latency.request_concurrency
vs.max_batch_size
– service-time step functions when batchers saturate.
Platform signals:
node_load1 / cpu_count
andrun_queue
per core – CPU queueing before utilization looks scary.tcp_retransmits_total
andpacket_loss
– network drops make everything look “slow.”- Evictions:
kube_pod_evict_events_total
andoom_kills_total
– bad binpacking and noisy neighbors.
If a metric can’t forecast pain 10–30 minutes ahead, it doesn’t belong in your capacity model.
From signals to a model you can explain under pressure
Skip the black-box ML unless you’ve nailed the basics. The model we ship fits on a whiteboard and survives postmortems.
Quantify demand and service time
- Use Little’s Law:
L = λ * W
.λ
(arrival rate):sum(rate(http_requests_total[5m]))
orevents/s
.W
(service time): median or p90 request duration, ortime_in_handler
.L
(concurrency): expected in-flight requests; helps size thread pools and pods.
- Cross-check with real concurrency:
sum(http_server_active_requests)
orwork_in_progress
gauges.
- Use Little’s Law:
Fit resource usage vs. work
- For each service, fit a simple linear model:
CPU_cores = α + β * rps_effective
andmem = α + β * concurrency
. - PromQL quick-and-dirty slope:
- For each service, fit a simple linear model:
# CPU cores per 100 RPS over last 1h
( sum(rate(container_cpu_usage_seconds_total{container="api"}[5m])) by (pod)
) / scalar(sum(rate(http_requests_total{service="api"}[5m]))) * 100
- Validate with a controlled load test (10–20 minutes) in staging using
k6
orvegeta
. Lock versions and configs.
Incorporate saturation breakpoints
- Identify cliffs: GC pauses > 100ms, thread pool queue > 50,
kafka_consumergroup_lag
derivative > 0 for 10 minutes. - These are where linear fits stop working; your model must clamp or switch regimes.
- Identify cliffs: GC pauses > 100ms, thread pool queue > 50,
Forecast short-term demand
- Keep it boring: Holt-Winters or Prophet for weekly seasonality works.
from prophet import Prophet
import pandas as pd
m = Prophet(weekly_seasonality=True, daily_seasonality=True)
m.fit(df) # df: ts + y=rps
future = m.make_future_dataframe(periods=72, freq='H')
forecast = m.predict(future)
- Feed predicted
λ
into the resource fit to get pod/node counts. Stick to 48–72 hours; beyond that, you’re speculating.
- Translate to headroom and pre-scaling
- Pick a headroom policy: 30–40% over predicted p95 demand or “one node spare per AZ.”
- Pre-scale before known spikes (marketing emails, region cutovers) and when
burn_rate
rises during “normal” periods.
Autoscaling that tracks reality, not dashboards
Kubernetes HPA v2
and KEDA
can scale on the metrics that matter if you wire the adapters correctly.
- Use
k8s-prometheus-adapter
to surface custom metrics likequeue_length_per_pod
oractive_requests
. - Prefer per-pod or object metrics over cluster-averaged
CPU%
. - Keep scaling behavior conservative: stabilization windows, max surge, and min replicas tuned to avoid flapping.
Example: scale API pods on in-flight requests, not CPU:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api
minReplicas: 6
maxReplicas: 60
behavior:
scaleUp:
stabilizationWindowSeconds: 120
policies:
- type: Percent
value: 100
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 50
periodSeconds: 300
metrics:
- type: Pods
pods:
metric:
name: http_server_active_requests
target:
type: AverageValue
averageValue: "50" # target 50 active requests per pod
Example: scale consumers on Kafka
lag with KEDA
:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: payments-consumer
spec:
minReplicaCount: 3
maxReplicaCount: 100
scaleTargetRef:
name: payments-consumer
triggers:
- type: kafka
metadata:
bootstrapServers: kafka:9092
consumerGroup: payments
topic: payments
lagThreshold: "5000"
activationLagThreshold: "500"
Don’t forget the boring bits: Cluster Autoscaler
limits, node pool SKUs, and PodDisruptionBudget
/PriorityClass
so the cluster can actually add capacity without evicting the hot path.
Tie telemetry to triage and rollout automation
A capacity model isn’t done until it drives decisions automatically. We gate rollouts and slash MTTR using the same leading indicators.
- SLO-aware alerts
- Multi-window
error_budget_burn_rate
is the north star.
- Multi-window
# 1h and 6h burn; alert when both exceed thresholds
expr: (
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
) > (1 - slo_target)
Pair with tail-latency slope and queue growth to catch regressions early.
Gate canaries with
Argo Rollouts
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: api-slo-check
spec:
metrics:
- name: p99-latency
interval: 1m
count: 5
successCondition: result < 0.250 # 250ms
provider:
prometheus:
address: http://prometheus:9090
query: |
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service="api",version="canary"}[1m])) by (le))
- name: burn-rate
interval: 1m
count: 5
successCondition: result < 2 # 2x budget
provider:
prometheus:
address: http://prometheus:9090
query: |
(
sum(rate(http_requests_total{service="api",version="canary",status=~"5.."}[1m]))
) /
(
sum(rate(http_requests_total{service="api",version="canary"}[1m]))
) / (1 - 0.995)
---
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: api
spec:
strategy:
canary:
steps:
- setWeight: 10
- analysis:
templates:
- templateName: api-slo-check
- setWeight: 25
- analysis:
templates:
- templateName: api-slo-check
- setWeight: 50
- analysis:
templates:
- templateName: api-slo-check
- setWeight: 100
Runbooks wired to alerts
- Every alert links to a runbook with
kubectl
,kafka-consumer-groups
, and Grafana panels. - First triage step is always “check saturation/lag slope,” not “restart things.”
- Every alert links to a runbook with
ChatOps for human-in-the-loop
- A
/promote
command only appears when SLO checks pass;/rollback
posts the failing metrics inline.
- A
A 48-hour Black Friday prediction that held under fire
A retailer (K8s on EKS, msk
for Kafka, Aurora Postgres
, Redis
cache) asked us to sanity-check their scale plan 48 hours before Black Friday. Their dashboards said 60% CPU, “we’re fine.” Our model said otherwise:
Leading indicators
container_cpu_cfs_throttled_seconds_total
rising 10–15%/h on the API tier.kafka_consumergroup_lag
derivative > 0 despite steadyrps
.db_connections_in_use
brushing the limit during traffic spikes;lock_waits_total
inching up.
Actions in 24 hours
- Switched HPA target to
http_server_active_requests
and raisedminReplicas
from 8 to 20. - Added a larger
c5.4xlarge
node group for API withcpuManagerPolicy: static
to kill throttling. - Pre-warmed Redis cluster and doubled
maxmemory
with eviction policy tune. - KEDA policy on consumer lag with
activationLagThreshold
to wake up earlier. - Gated canary with burn-rate and p99 checks via Argo Rollouts.
- Switched HPA target to
Results
- Black Friday: p95 latency held at 120ms (was 210ms previous year), 0 sev-1, 1 auto-rollback caught by burn-rate.
- Spend +12% for the weekend, but cost per order down 18%.
- On-call slept. Business shipped promos without rollback fear.
What I’d do differently next time
- Turn “one-off” changes into code. We landed Terraform modules for nodegroups and KEDA, Sloth for SLOs, and versioned them under
platform/
. - Tighten labels. Cardinality explosions in OTel traces mask the real signals; sample intelligently at tail.
- Make saturation tests part of CI. We now block merges if a 10-minute
k6
run shows p99 drift > 10% at baseline load. - Review headroom weekly. Marketing doesn’t tell engineering; the model should catch early demand shifts.
Field checklist you can copy-paste
- Define SLIs/SLOs that the business actually cares about: availability, p95/p99 latency, and freshness for async pipelines.
- Instrument leading indicators: queue depths and slopes, CPU throttling, GC pauses, thread-pool backlog, consumer lag, connection pool usage.
- Build the model: Little’s Law for concurrency + linear fit of resource vs. load; annotate known saturation cliffs.
- Validate with synthetic load; update coefficients each deploy or significant config change.
- Forecast 48–72 hours out with simple seasonality. Pre-scale for spikes and when burn-rate rises under “normal” load.
- Wire autoscaling to custom metrics using
HPA v2
andKEDA
; set sensible stabilization windows. - Gate rollouts with
Argo Rollouts
AnalysisTemplates tied to SLOs; auto-rollback on regressions. - Keep runbooks and alerts glued together; first triage step checks leading indicators.
- Postmortem misses and bake changes back into code within 24 hours. Don’t let tribal knowledge rot.
Key takeaways
- Track leading indicators that predict incidents: saturation, queue growth rate, tail latency drift, error-budget burn, and resource throttling.
- Use simple, explainable models first: Little’s Law + linear fits from telemetry beat black-box ML for capacity planning.
- Wire metrics to autoscaling through `HPA v2` and `KEDA` using custom metrics like `queue_length_per_pod` and `kafka_consumer_lag`.
- Gate canary promotions with SLO-aligned signals using `Argo Rollouts` or `Flagger` to automate rollbacks.
- Create headroom and warm paths proactively; predict and pre-scale before peak windows.
- Keep dashboards honest: prefer multi-window burn-rate and slope-based alerts over instantaneous thresholds.
Implementation checklist
- Define SLIs/SLOs that matter: availability, tail latency, and freshness (for pipelines).
- Instrument saturation and backpressure: CPU throttling, run queue, GC pauses, thread-pool queue length, and queue depth slope.
- Model capacity with Little’s Law and linear fits of load vs. resource usage; validate with synthetic load.
- Forecast near-term demand (48–72 hours) with weekly seasonality; pre-scale when error-budget burn rises under normal load.
- Deploy `HPA v2`/`KEDA` for autoscaling on custom metrics tied to SLOs, not CPU%.
- Gate rollouts with `Argo Rollouts` AnalysisTemplates that watch burn-rate and p99 drift; auto-rollback on regressions.
- Document triage runbooks linked in alerts; include commands and dashboards for each metric.
- Run postmortems on capacity misses and update models/alerts within 24 hours.
Questions we hear from teams
- Why not just use CPU or memory for autoscaling?
- Because CPU% and RSS are lagging and often misleading. They don’t capture contention (throttling, run-queue) or backpressure (queue growth, lag slope). Scale on signals tied to service time and concurrency like active requests, queue depth per pod, or consumer lag.
- Do we need ML to forecast demand?
- No. A simple model (Little’s Law + linear fit + Holt-Winters seasonality) usually beats black-box ML because it’s explainable and debuggable in incident reviews. Add complexity only when residuals demand it.
- How do we avoid flapping when scaling on custom metrics?
- Use stabilization windows, minimum replicas, and conservative policy steps. Smooth with 1–5 minute rates, and clamp on known saturation cliffs. Validate in staging with a small load test.
- What about databases? Autoscaling won’t save us there.
- Correct. DBs are capacity-planned, not autoscaled. Watch connection pool saturation, lock waits, and storage IOPS headroom. Pre-scale replicas, partition hot tables, and warm caches before peak windows. Tie app-level backpressure to protect the DB.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.