We don’t have good telemetry yet. Where do we start without boiling the ocean?

Start with the four signals that pay rent fast: error ratio (for SLO), p95 latency, backlog length (Kafka/SQS/Sidekiq), and CPU throttling. Add simple Prometheus rules and one KEDA/HPA scaler off backlog. Ship via GitOps so it sticks. Expand to pool saturation and concurrency next.

Should we use ML/forecasting for capacity?

Use ML for business forecasting (expected traffic) if you have seasonality; for incident prediction, slopes + headroom + burn rate consistently outperform fancy models in real-time. We’ve seen Prophet/Holt-Winters be fine for staffing and scheduling; production protection still hinges on leading indicators and automations.

How do we keep costs under control while autoscaling more aggressively?

Define a utilization band (e.g., 45–65%), add scale-in policies (cooldowns, smaller step-downs), and alert on under-utilization. Prefer bin-packing nodes with the right pod requests/limits and use cluster-autoscaler with spot capacity for bursty workers.

What about multi-tenant platforms (shared DBs, shared Kafka)?

Model bottlenecks per tenant where possible (labels by `tenant`), but keep recording rules at the service level to avoid cardinality blowups. Apply quotas (pool sizes, partitions per tenant), and autoscale workers on the aggregate backlog while alerting on top-N tenants driving saturation.

We’re not on Kubernetes. Does this still apply?

Yes. Use ASGs scaling on SQS depth, Kafka lag (CloudWatch or Prometheus `cloudwatch_exporter`), and gate deploys in Spinnaker or GitHub Actions with PromQL checks. The model (leading indicators -> automation) is infra-agnostic.

Reliability-observability · Oct 20, 2025 · 10 minute read

Predictive Capacity Planning That Doesn’t Lie: Leading Indicators, Not Vanity Dashboards

Build models that call the surge 30 minutes early and auto‑gate your rollouts. No more praying to CPU%.

Riley Morgan

Principal Reliability Engineer, GitPlumbers

20 years building and rescuing systems at scale. Ex-Spotify SRE, ex-AWS SA. Helped retailers survive Black Friday, fintechs tame RDS, and SaaS orgs stop shipping regressions on Fridays.

"Capacity planning isn’t a spreadsheet; it’s a feedback loop with teeth."

Back to all posts

The page I didn’t get (and why)

Two Black Fridays ago, a retailer’s checkout stack looked “healthy” at 58% CPU and 65% memory. Grafana was green. Meanwhile, Kafka consumer lag started climbing, container_cpu_cfs_throttled_seconds_total spiked, DB connection pools pinned at 95%, and p95 latency curved up before it cliffed. Thirty minutes later, carts timed out.

We fixed it in three weeks. Not by buying bigger nodes, but by modeling the system with leading indicators and wiring those signals to autoscaling and rollout gates. The next sale, the model called the surge 30 minutes early, auto‑scaled workers off Kafka lag, and blocked a canary that would’ve blown the error budget. No pages.

I’ve seen CPU% lull teams into outages since the blade server days. If your “capacity plan” is a spreadsheet and a hope, this is for you.

What actually predicts incidents

Most dashboards are vanity. They describe the past and page when customers already feel pain. What you need are saturation and backlog signals that move first.

Leading indicators that consistently predict trouble:

Backlog growth: kafka_consumergroup_lag, sqs_approximate_number_of_messages_visible, sidekiq_queue_length. The derivative matters more than the absolute.
CPU throttling: rate(container_cpu_cfs_throttled_seconds_total[2m]) / rate(container_cpu_cfs_periods_total[2m]). 2–5% sustained throttling is a silent killer even at “low” CPU%.
DB connection pool saturation: db_pool_in_use / db_pool_size (from app metrics, PgBouncer, or RDS proxy). Anything > 0.8 with rising latency is pre-failure.
Thread/goroutine run-queue pressure: Go: go_sched_goroutines_goroutines and runnable queue; JVM: active threads vs max.
GC pressure: heap usage vs limit and stop-the-world pauses (JVM jvm_gc_pause_seconds_bucket, Go go_gc_duration_seconds_bucket).
Multi-window SLO burn: short + long windows to catch both fast regressions and slow rolls.
Latency curvature: the slope change of p95 vs QPS before it cliffs.

Things that don’t predict well by themselves:

Raw CPU% and memory% without throttling/eviction context.
Averages (p50 latency) without tail and slope.
Request count detached from queue length or service time.

Build a simple, ruthless capacity model

You don’t need an ML PhD. You need a model that forecasts “when do we run out of headroom?” and acts before users notice.

Principles:

Use Little’s Law: L = λW. Effective concurrency L ≈ QPS × latency. When L approaches your real concurrency limit (threads, DB conns, cores), risk skyrockets.
Track headroom: how far you are from hard limits (threads, conns, IOPS) and soft limits (throttling onset, GC knee).
Watch slopes and burn rates, not just levels.

Prometheus recording rules that form the backbone:

# prom-rules.yaml
groups:
- name: capacity-model
  rules:
  - record: job:qps:rate1m
    expr: sum(rate(http_requests_total{job="checkout"}[1m])) by (job)

  - record: job:latency_p95:rate1m
    expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job="checkout"}[1m])) by (le, job))

  # Effective concurrency (Little's Law): L ≈ λ * W
  - record: job:effective_concurrency
    expr: (job:qps:rate1m * job:latency_p95:rate1m)

  # Kafka backlog and its growth
  - record: job:kafka_lag_total
    expr: sum(kafka_consumergroup_lag{group="checkout-workers"})

  - record: job:kafka_lag_growth_per_min
    expr: derivative(job:kafka_lag_total[5m]) * 60

  # CPU throttling ratio
  - record: job:cpu_throttling_ratio
    expr: sum(rate(container_cpu_cfs_throttled_seconds_total{namespace="prod",pod=~"checkout-.*"}[2m]))
          /
          sum(rate(container_cpu_cfs_periods_total{namespace="prod",pod=~"checkout-.*"}[2m]))

  # DB pool saturation (from app-exported gauges)
  - record: job:db_pool_saturation
    expr: avg(db_pool_in_use{job="checkout"} / db_pool_size{job="checkout"}) by (job)

- name: slo
  rules:
  - record: job:error_ratio:5m
    expr: sum(rate(http_requests_total{job="checkout",status=~"5.."}[5m])) by (job)
          /
          sum(rate(http_requests_total{job="checkout"}[5m])) by (job)

  - record: job:error_ratio:1h
    expr: sum(rate(http_requests_total{job="checkout",status=~"5.."}[1h])) by (job)
          /
          sum(rate(http_requests_total{job="checkout"}[1h])) by (job)

Now add early-warning alerts on slopes and saturation, not just absolute thresholds:

- alert: BacklogGrowing
  expr: job:kafka_lag_growth_per_min > 200
  for: 10m
  labels:
    severity: warn
    service: checkout
    runbook: checkout-lag
  annotations:
    summary: 'Kafka lag growing: {{ $value }} msgs/min'

- alert: ThrottlingOnset
  expr: job:cpu_throttling_ratio > 0.02
  for: 5m
  labels:
    severity: page
    service: checkout
    runbook: throttling
  annotations:
    summary: 'CPU throttling >2% — performance cliff imminent'

- alert: SLOBurnRateHigh
  expr: (job:error_ratio:5m > (0.001 * 14.4)) and (job:error_ratio:1h > (0.001 * 6))
  for: 5m
  labels:
    severity: page
    service: checkout
    runbook: slo-burn
  annotations:
    summary: '99.9 SLO burning fast (5m and 1h)'

This is the model: capacity = limits minus the distance you’re traveling toward them. When backlog growth and throttling rise together, you don’t argue—you act.

Instrumentation that earns its keep

Skip the yak-shaving and standardize early:

OpenTelemetry for traces and metrics; export to Prometheus and a trace backend (Tempo/Jaeger).
Standard labels: service, version, env, team. You’ll use them to gate rollouts.
A single, versioned repo for recording rules; ship via ArgoCD or Flux.

Collector example routing app metrics and traces:

# otel-collector.yaml
receivers:
  otlp:
    protocols:
      http:
      grpc:
exporters:
  prometheus:
    endpoint: 0.0.0.0:9464
  otlp:
    endpoint: tempo:4317
    tls:
      insecure: true
processors:
  batch: {}
service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp]

Make sure every service exports in_flight requests or derive concurrency with Little’s Law. Also expose queue length and pool sizes from the app—don’t rely solely on black-box exporters.

Wire telemetry to autoscaling and rollout automation

Here’s where most teams stop. Don’t. If your model doesn’t change the system automatically, it’s just a pretty graph.

Autoscale on real bottlenecks (not CPU%)

Kafka/SQS workers: scale on lag.
API pods with throttling: scale on throttling or in-flight requests.

KEDA example scaling checkout workers from Kafka lag:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: checkout-workers
  namespace: prod
spec:
  scaleTargetRef:
    name: checkout-worker-deployment
  cooldownPeriod: 120
  maxReplicaCount: 80
  minReplicaCount: 4
  triggers:
  - type: kafka
    metadata:
      bootstrapServers: kafka-1:9092,kafka-2:9092
      consumerGroup: checkout-workers
      topic: orders
      lagThreshold: '1000'

If you prefer native HPA via Prometheus Adapter (external metrics):

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: checkout-api
  namespace: prod
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: checkout-api
  minReplicas: 6
  maxReplicas: 60
  metrics:
  - type: Pods
    pods:
      metric:
        name: job_effective_concurrency
      target:
        type: AverageValue
        averageValue: '80'

Gate rollouts on SLO and latency slope

Use Argo Rollouts with an AnalysisTemplate that fails fast when error ratio or latency slope breaches.

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: checkout-prevent-regression
spec:
  metrics:
  - name: error-rate
    interval: 1m
    failureLimit: 2
    successCondition: result < 0.002
    provider:
      prometheus:
        address: http://prometheus.monitoring:9090
        query: |
          sum(rate(http_requests_total{job="checkout",status=~"5..",version="{{args.newVersion}}"}[1m]))
          /
          sum(rate(http_requests_total{job="checkout",version="{{args.newVersion}}"}[1m]))
  - name: latency-slope
    interval: 1m
    failureLimit: 2
    successCondition: result < 0.15
    provider:
      prometheus:
        address: http://prometheus.monitoring:9090
        query: |
          stddev_over_time(
            histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job="checkout",version="{{args.newVersion}}"}[1m])) by (le))
          [5m])

Reference this template in your Rollout steps with a small canary weight. If the analysis fails, Rollouts auto-aborts. Tie this to your error budget policy: freeze deploys when burn exceeds threshold.

Triage without heroics

Your on-call shouldn’t guess. Alerts should say “what broke, why, what’s next.” Use Alertmanager labels and route to a triage channel with runbooks.

route:
  receiver: 'slack-triage'
  group_by: ['cluster','service']
  routes:
  - matchers: ['severity="page"']
    receiver: 'slack-pager'
    continue: true
receivers:
- name: 'slack-triage'
  slack_configs:
  - channel: '#prod-triage'
    title: '{{ .Labels.alertname }} — {{ .Labels.service }}'
    text: 'Runbook: https://gitplumbers.dev/runbooks/{{ .Labels.runbook }}\nSLO: https://grafana.example.com/d/checkout-slo'
- name: 'slack-pager'
  slack_configs:
  - channel: '#oncall'
    title: 'PAGE: {{ .Labels.alertname }} — {{ .Labels.service }}'
    text: 'Immediate action required.'

Add triage automations:

Auto-open a Jira incident with labels service, env, slo.
Attach the on-call’s “first 5 minutes” checklist in the runbook.
Include a rollback link for the last rollout if Argo is in play.

This turns telemetry into a playbook, not a scavenger hunt.

Prove the model in practice

You don’t trust models—you verify them. Here’s what we do with clients:

Replay production traffic in staging (TCPCopy/GoReplay, or a synthetic load in Locust/k6) while shadowing canaries.
Chaos the likely bottlenecks: lower CPU quotas to force throttling, inject DB latency with Toxiproxy, pause Kafka partitions.
Watch the leading indicators fire first; tune thresholds until they predict before SLO burn.
Validate autoscaling reacts within your MTTA target (e.g., scale within 2 minutes of lag growth > 200 msgs/min).
Verify rollout gates abort bad versions within 2–3 minutes and auto-rollback.

Metrics that matter for sign-off:

Mean warning lead time before incident (target: >15 minutes).
MTTA/MTTR improvement (target: >30% reduction).
Cost efficiency band (e.g., 45–65% resource utilization with zero throttling at p95 load).

Common traps (and what I’d do differently)

CPU% worship: Throttling kills before CPU%. Watch cfs_throttled_seconds.
Per-service heroics without shared rules: Centralize recording rules; version them.
Cardinality bombs: Don’t label by user_id or request_id. Stick to service, version, env, team.
Chasing ML: We’ve tried Prophet, Holt-Winters, even Kayenta-style scores. Useful for dashboards, but your first wins come from slope + headroom and automations.
Ignoring cost: Add alerts for under‑utilization. If HPA/KEDA never scales in, you’re burning money.
One window to rule them all: Use multi-window burn (5m + 1h). Short catches fast failures, long prevents flapping.

If you only do one thing this quarter: wire backlog growth and throttling to autoscaling and rollout gates. You’ll buy yourself hours of sleep and weeks of roadmap time.

Where GitPlumbers fits

We’ve implemented this stack at fintechs throttled by RDS, marketplaces drowning in Kafka lag, and SaaS orgs tripping on their own canaries. We’ll help you:

Map bottlenecks and model headroom per service.
Ship standardized OTel + Prometheus rules via GitOps.
Autoscale on external metrics and gate rollouts with Argo.
Prove the model with load + chaos, then hand you the runbooks.

If you’re tired of vendors selling magic, let’s make your telemetry bite back.

Related Resources

Key takeaways

Track leading indicators like backlog growth, CPU throttling, connection pool saturation, and multi-window SLO burn—these predict incidents before customers feel them.
Model capacity with Little’s Law (L = λW) and headroom thresholds; detect the curve before it breaks by watching slope changes, not averages.
Turn telemetry into action: autoscale on external metrics (Kafka lag, SQS depth), gate rollouts with Argo AnalysisTemplates, and page on error-budget burn.
Standardize recording rules and labels once; reuse everywhere (dashboards, alerts, autoscaling, canaries).
Load-test the model and verify with chaos; don’t trust spreadsheets or “CPU%” alone.
Keep cost in the loop—target a utilization band with alerting on over- and under-provisioning.

Implementation checklist

Define your SLOs and error budgets per service.
Instrument RED and USE signals with OpenTelemetry and Prometheus.
Create recording rules for backlog growth, throttling rate, pool saturation, and burn rate.
Autoscale from external metrics (Kafka/SQS lag, queue depth) via HPA/KEDA.
Gate canaries with Argo Rollouts using your recorded metrics.
Route Alertmanager to triage with runbooks and labels.
Load-test and chaos-test the model; validate predictions vs. incidents.
Set weekly review to tune thresholds and cost bands.

Questions we hear from teams

We don’t have good telemetry yet. Where do we start without boiling the ocean?: Start with the four signals that pay rent fast: error ratio (for SLO), p95 latency, backlog length (Kafka/SQS/Sidekiq), and CPU throttling. Add simple Prometheus rules and one KEDA/HPA scaler off backlog. Ship via GitOps so it sticks. Expand to pool saturation and concurrency next.
Should we use ML/forecasting for capacity?: Use ML for business forecasting (expected traffic) if you have seasonality; for incident prediction, slopes + headroom + burn rate consistently outperform fancy models in real-time. We’ve seen Prophet/Holt-Winters be fine for staffing and scheduling; production protection still hinges on leading indicators and automations.
How do we keep costs under control while autoscaling more aggressively?: Define a utilization band (e.g., 45–65%), add scale-in policies (cooldowns, smaller step-downs), and alert on under-utilization. Prefer bin-packing nodes with the right pod requests/limits and use cluster-autoscaler with spot capacity for bursty workers.
What about multi-tenant platforms (shared DBs, shared Kafka)?: Model bottlenecks per tenant where possible (labels by `tenant`), but keep recording rules at the service level to avoid cardinality blowups. Apply quotas (pool sizes, partitions per tenant), and autoscale workers on the aggregate backlog while alerting on top-N tenants driving saturation.
We’re not on Kubernetes. Does this still apply?: Yes. Use ASGs scaling on SQS depth, Kafka lag (CloudWatch or Prometheus `cloudwatch_exporter`), and gate deploys in Spinnaker or GitHub Actions with PromQL checks. The model (leading indicators -> automation) is infra-agnostic.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Get a capacity model you can trust See our Black Friday capacity case study