Predictive Capacity Planning That Doesn’t Lie: Leading Indicators, Not Vanity Dashboards
Build models that call the surge 30 minutes early and auto‑gate your rollouts. No more praying to CPU%.
"Capacity planning isn’t a spreadsheet; it’s a feedback loop with teeth."Back to all posts
The page I didn’t get (and why)
Two Black Fridays ago, a retailer’s checkout stack looked “healthy” at 58% CPU and 65% memory. Grafana was green. Meanwhile, Kafka consumer lag started climbing, container_cpu_cfs_throttled_seconds_total spiked, DB connection pools pinned at 95%, and p95 latency curved up before it cliffed. Thirty minutes later, carts timed out.
We fixed it in three weeks. Not by buying bigger nodes, but by modeling the system with leading indicators and wiring those signals to autoscaling and rollout gates. The next sale, the model called the surge 30 minutes early, auto‑scaled workers off Kafka lag, and blocked a canary that would’ve blown the error budget. No pages.
I’ve seen CPU% lull teams into outages since the blade server days. If your “capacity plan” is a spreadsheet and a hope, this is for you.
What actually predicts incidents
Most dashboards are vanity. They describe the past and page when customers already feel pain. What you need are saturation and backlog signals that move first.
Leading indicators that consistently predict trouble:
- Backlog growth:
kafka_consumergroup_lag,sqs_approximate_number_of_messages_visible,sidekiq_queue_length. The derivative matters more than the absolute. - CPU throttling:
rate(container_cpu_cfs_throttled_seconds_total[2m]) / rate(container_cpu_cfs_periods_total[2m]). 2–5% sustained throttling is a silent killer even at “low” CPU%. - DB connection pool saturation:
db_pool_in_use / db_pool_size(from app metrics, PgBouncer, or RDS proxy). Anything > 0.8 with rising latency is pre-failure. - Thread/goroutine run-queue pressure: Go:
go_sched_goroutines_goroutinesand runnable queue; JVM: active threads vs max. - GC pressure: heap usage vs limit and stop-the-world pauses (JVM
jvm_gc_pause_seconds_bucket, Gogo_gc_duration_seconds_bucket). - Multi-window SLO burn: short + long windows to catch both fast regressions and slow rolls.
- Latency curvature: the slope change of p95 vs QPS before it cliffs.
Things that don’t predict well by themselves:
- Raw CPU% and memory% without throttling/eviction context.
- Averages (p50 latency) without tail and slope.
- Request count detached from queue length or service time.
Build a simple, ruthless capacity model
You don’t need an ML PhD. You need a model that forecasts “when do we run out of headroom?” and acts before users notice.
Principles:
- Use Little’s Law: L = λW. Effective concurrency L ≈ QPS × latency. When L approaches your real concurrency limit (threads, DB conns, cores), risk skyrockets.
- Track headroom: how far you are from hard limits (threads, conns, IOPS) and soft limits (throttling onset, GC knee).
- Watch slopes and burn rates, not just levels.
Prometheus recording rules that form the backbone:
# prom-rules.yaml
groups:
- name: capacity-model
rules:
- record: job:qps:rate1m
expr: sum(rate(http_requests_total{job="checkout"}[1m])) by (job)
- record: job:latency_p95:rate1m
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job="checkout"}[1m])) by (le, job))
# Effective concurrency (Little's Law): L ≈ λ * W
- record: job:effective_concurrency
expr: (job:qps:rate1m * job:latency_p95:rate1m)
# Kafka backlog and its growth
- record: job:kafka_lag_total
expr: sum(kafka_consumergroup_lag{group="checkout-workers"})
- record: job:kafka_lag_growth_per_min
expr: derivative(job:kafka_lag_total[5m]) * 60
# CPU throttling ratio
- record: job:cpu_throttling_ratio
expr: sum(rate(container_cpu_cfs_throttled_seconds_total{namespace="prod",pod=~"checkout-.*"}[2m]))
/
sum(rate(container_cpu_cfs_periods_total{namespace="prod",pod=~"checkout-.*"}[2m]))
# DB pool saturation (from app-exported gauges)
- record: job:db_pool_saturation
expr: avg(db_pool_in_use{job="checkout"} / db_pool_size{job="checkout"}) by (job)
- name: slo
rules:
- record: job:error_ratio:5m
expr: sum(rate(http_requests_total{job="checkout",status=~"5.."}[5m])) by (job)
/
sum(rate(http_requests_total{job="checkout"}[5m])) by (job)
- record: job:error_ratio:1h
expr: sum(rate(http_requests_total{job="checkout",status=~"5.."}[1h])) by (job)
/
sum(rate(http_requests_total{job="checkout"}[1h])) by (job)Now add early-warning alerts on slopes and saturation, not just absolute thresholds:
- alert: BacklogGrowing
expr: job:kafka_lag_growth_per_min > 200
for: 10m
labels:
severity: warn
service: checkout
runbook: checkout-lag
annotations:
summary: 'Kafka lag growing: {{ $value }} msgs/min'
- alert: ThrottlingOnset
expr: job:cpu_throttling_ratio > 0.02
for: 5m
labels:
severity: page
service: checkout
runbook: throttling
annotations:
summary: 'CPU throttling >2% — performance cliff imminent'
- alert: SLOBurnRateHigh
expr: (job:error_ratio:5m > (0.001 * 14.4)) and (job:error_ratio:1h > (0.001 * 6))
for: 5m
labels:
severity: page
service: checkout
runbook: slo-burn
annotations:
summary: '99.9 SLO burning fast (5m and 1h)'This is the model: capacity = limits minus the distance you’re traveling toward them. When backlog growth and throttling rise together, you don’t argue—you act.
Instrumentation that earns its keep
Skip the yak-shaving and standardize early:
- OpenTelemetry for traces and metrics; export to Prometheus and a trace backend (Tempo/Jaeger).
- Standard labels:
service,version,env,team. You’ll use them to gate rollouts. - A single, versioned repo for recording rules; ship via ArgoCD or Flux.
Collector example routing app metrics and traces:
# otel-collector.yaml
receivers:
otlp:
protocols:
http:
grpc:
exporters:
prometheus:
endpoint: 0.0.0.0:9464
otlp:
endpoint: tempo:4317
tls:
insecure: true
processors:
batch: {}
service:
pipelines:
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]
traces:
receivers: [otlp]
processors: [batch]
exporters: [otlp]Make sure every service exports in_flight requests or derive concurrency with Little’s Law. Also expose queue length and pool sizes from the app—don’t rely solely on black-box exporters.
Wire telemetry to autoscaling and rollout automation
Here’s where most teams stop. Don’t. If your model doesn’t change the system automatically, it’s just a pretty graph.
- Autoscale on real bottlenecks (not CPU%)
- Kafka/SQS workers: scale on lag.
- API pods with throttling: scale on throttling or in-flight requests.
KEDA example scaling checkout workers from Kafka lag:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: checkout-workers
namespace: prod
spec:
scaleTargetRef:
name: checkout-worker-deployment
cooldownPeriod: 120
maxReplicaCount: 80
minReplicaCount: 4
triggers:
- type: kafka
metadata:
bootstrapServers: kafka-1:9092,kafka-2:9092
consumerGroup: checkout-workers
topic: orders
lagThreshold: '1000'If you prefer native HPA via Prometheus Adapter (external metrics):
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: checkout-api
namespace: prod
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: checkout-api
minReplicas: 6
maxReplicas: 60
metrics:
- type: Pods
pods:
metric:
name: job_effective_concurrency
target:
type: AverageValue
averageValue: '80'- Gate rollouts on SLO and latency slope
Use Argo Rollouts with an AnalysisTemplate that fails fast when error ratio or latency slope breaches.
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: checkout-prevent-regression
spec:
metrics:
- name: error-rate
interval: 1m
failureLimit: 2
successCondition: result < 0.002
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
sum(rate(http_requests_total{job="checkout",status=~"5..",version="{{args.newVersion}}"}[1m]))
/
sum(rate(http_requests_total{job="checkout",version="{{args.newVersion}}"}[1m]))
- name: latency-slope
interval: 1m
failureLimit: 2
successCondition: result < 0.15
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
stddev_over_time(
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job="checkout",version="{{args.newVersion}}"}[1m])) by (le))
[5m])Reference this template in your Rollout steps with a small canary weight. If the analysis fails, Rollouts auto-aborts. Tie this to your error budget policy: freeze deploys when burn exceeds threshold.
Triage without heroics
Your on-call shouldn’t guess. Alerts should say “what broke, why, what’s next.” Use Alertmanager labels and route to a triage channel with runbooks.
route:
receiver: 'slack-triage'
group_by: ['cluster','service']
routes:
- matchers: ['severity="page"']
receiver: 'slack-pager'
continue: true
receivers:
- name: 'slack-triage'
slack_configs:
- channel: '#prod-triage'
title: '{{ .Labels.alertname }} — {{ .Labels.service }}'
text: 'Runbook: https://gitplumbers.dev/runbooks/{{ .Labels.runbook }}\nSLO: https://grafana.example.com/d/checkout-slo'
- name: 'slack-pager'
slack_configs:
- channel: '#oncall'
title: 'PAGE: {{ .Labels.alertname }} — {{ .Labels.service }}'
text: 'Immediate action required.'Add triage automations:
- Auto-open a Jira incident with labels
service,env,slo. - Attach the on-call’s “first 5 minutes” checklist in the runbook.
- Include a
rollbacklink for the last rollout if Argo is in play.
This turns telemetry into a playbook, not a scavenger hunt.
Prove the model in practice
You don’t trust models—you verify them. Here’s what we do with clients:
- Replay production traffic in staging (TCPCopy/GoReplay, or a synthetic load in Locust/k6) while shadowing canaries.
- Chaos the likely bottlenecks: lower CPU quotas to force throttling, inject DB latency with Toxiproxy, pause Kafka partitions.
- Watch the leading indicators fire first; tune thresholds until they predict before SLO burn.
- Validate autoscaling reacts within your MTTA target (e.g., scale within 2 minutes of lag growth > 200 msgs/min).
- Verify rollout gates abort bad versions within 2–3 minutes and auto-rollback.
Metrics that matter for sign-off:
- Mean warning lead time before incident (target: >15 minutes).
- MTTA/MTTR improvement (target: >30% reduction).
- Cost efficiency band (e.g., 45–65% resource utilization with zero throttling at p95 load).
Common traps (and what I’d do differently)
- CPU% worship: Throttling kills before CPU%. Watch
cfs_throttled_seconds. - Per-service heroics without shared rules: Centralize recording rules; version them.
- Cardinality bombs: Don’t label by
user_idorrequest_id. Stick toservice,version,env,team. - Chasing ML: We’ve tried Prophet, Holt-Winters, even Kayenta-style scores. Useful for dashboards, but your first wins come from slope + headroom and automations.
- Ignoring cost: Add alerts for under‑utilization. If HPA/KEDA never scales in, you’re burning money.
- One window to rule them all: Use multi-window burn (5m + 1h). Short catches fast failures, long prevents flapping.
If you only do one thing this quarter: wire backlog growth and throttling to autoscaling and rollout gates. You’ll buy yourself hours of sleep and weeks of roadmap time.
Where GitPlumbers fits
We’ve implemented this stack at fintechs throttled by RDS, marketplaces drowning in Kafka lag, and SaaS orgs tripping on their own canaries. We’ll help you:
- Map bottlenecks and model headroom per service.
- Ship standardized OTel + Prometheus rules via GitOps.
- Autoscale on external metrics and gate rollouts with Argo.
- Prove the model with load + chaos, then hand you the runbooks.
If you’re tired of vendors selling magic, let’s make your telemetry bite back.
Key takeaways
- Track leading indicators like backlog growth, CPU throttling, connection pool saturation, and multi-window SLO burn—these predict incidents before customers feel them.
- Model capacity with Little’s Law (L = λW) and headroom thresholds; detect the curve before it breaks by watching slope changes, not averages.
- Turn telemetry into action: autoscale on external metrics (Kafka lag, SQS depth), gate rollouts with Argo AnalysisTemplates, and page on error-budget burn.
- Standardize recording rules and labels once; reuse everywhere (dashboards, alerts, autoscaling, canaries).
- Load-test the model and verify with chaos; don’t trust spreadsheets or “CPU%” alone.
- Keep cost in the loop—target a utilization band with alerting on over- and under-provisioning.
Implementation checklist
- Define your SLOs and error budgets per service.
- Instrument RED and USE signals with OpenTelemetry and Prometheus.
- Create recording rules for backlog growth, throttling rate, pool saturation, and burn rate.
- Autoscale from external metrics (Kafka/SQS lag, queue depth) via HPA/KEDA.
- Gate canaries with Argo Rollouts using your recorded metrics.
- Route Alertmanager to triage with runbooks and labels.
- Load-test and chaos-test the model; validate predictions vs. incidents.
- Set weekly review to tune thresholds and cost bands.
Questions we hear from teams
- We don’t have good telemetry yet. Where do we start without boiling the ocean?
- Start with the four signals that pay rent fast: error ratio (for SLO), p95 latency, backlog length (Kafka/SQS/Sidekiq), and CPU throttling. Add simple Prometheus rules and one KEDA/HPA scaler off backlog. Ship via GitOps so it sticks. Expand to pool saturation and concurrency next.
- Should we use ML/forecasting for capacity?
- Use ML for business forecasting (expected traffic) if you have seasonality; for incident prediction, slopes + headroom + burn rate consistently outperform fancy models in real-time. We’ve seen Prophet/Holt-Winters be fine for staffing and scheduling; production protection still hinges on leading indicators and automations.
- How do we keep costs under control while autoscaling more aggressively?
- Define a utilization band (e.g., 45–65%), add scale-in policies (cooldowns, smaller step-downs), and alert on under-utilization. Prefer bin-packing nodes with the right pod requests/limits and use cluster-autoscaler with spot capacity for bursty workers.
- What about multi-tenant platforms (shared DBs, shared Kafka)?
- Model bottlenecks per tenant where possible (labels by `tenant`), but keep recording rules at the service level to avoid cardinality blowups. Apply quotas (pool sizes, partitions per tenant), and autoscale workers on the aggregate backlog while alerting on top-N tenants driving saturation.
- We’re not on Kubernetes. Does this still apply?
- Yes. Use ASGs scaling on SQS depth, Kafka lag (CloudWatch or Prometheus `cloudwatch_exporter`), and gate deploys in Spinnaker or GitHub Actions with PromQL checks. The model (leading indicators -> automation) is infra-agnostic.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
