Capacity Planning That Doesn’t Lie: Predict Scale With Leading Indicators, Not Dashboards

If your model needs a slide deck to explain, it won’t save your on-call. Here’s the lean blueprint we use to predict capacity needs and wire telemetry into autoscaling, triage, and rollouts that don’t blow up at 2 a.m.

Capacity planning that can’t predict burn rate or queue growth is just a dashboard with a résumé.
Back to all posts

The Friday deploy that finally forced honest capacity planning

I watched a payments API on GKE faceplant at 5:12 p.m. on a Friday. Dashboards were green. CPU% hovered at 55. Autoscaler was “healthy.” Then p99 jumped 3x, Kafka lag spiked, thread pools starved, and the on-call got paged into a goose chase. The root cause wasn’t a mystery: rising consumer_lag slope + container_cpu_cfs_throttled_seconds_total + a slow bake of new code increased service time. None of that shows up in a vanity CPU% widget.

That weekend we ripped out our “capacity model” spreadsheet and built a real one tied to leading indicators and automated rollouts. The next peak? Zero sev-1s, predictable spend bump, and no heroics. Here’s the playbook we use at GitPlumbers when a team is done being surprised by scale.

Track signals that actually predict incidents

Stop staring at CPU% and request counts in isolation. The predictors that have saved my bacon repeatedly are saturation, backpressure, tail drift, and burn rate.

  • Web/API services:

    • histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)) – tail-latency drift, not instantaneous spikes.
    • sum(rate(container_cpu_cfs_throttled_seconds_total[5m])) by (pod) – CPU throttling = silent performance killer on bursty workloads.
    • sum(rate(process_runtime_gc_pause_seconds_sum[5m])) or jvm_gc_pause_seconds_sum – GC pauses predict tail blowups.
    • sum(rate(threadpool_queue_length[1m])) by (pool) or go_goroutines blocked signals – thread pool starvation precedes 5xx.
  • Data pipelines / Kafka:

    • kafka_consumergroup_lag{group="payments"} and deriv(kafka_consumergroup_lag[10m]) – lag slope is the early warning.
    • sum(rate(kafka_network_requestmetrics_requests_total{request="Produce"}[5m])) vs. broker disk iowait – broker saturation predicts backlogs.
    • job_queue_depth and rate(job_queue_depth[5m]) – queue growth under steady arrival rate means service time just got slower.
  • Databases:

    • db_connections_in_use vs. max_connections and lock_waits_total – connection pool saturation and lock contention.
    • pg_stat_activity long transactions and checkpoint_write_time – write amplification predicts nasty latency cliffs.
  • Inference/GPU services:

    • gpu_mem_used_bytes / gpu_mem_total_bytes – model swaps hurt latency.
    • request_concurrency vs. max_batch_size – service-time step functions when batchers saturate.
  • Platform signals:

    • node_load1 / cpu_count and run_queue per core – CPU queueing before utilization looks scary.
    • tcp_retransmits_total and packet_loss – network drops make everything look “slow.”
    • Evictions: kube_pod_evict_events_total and oom_kills_total – bad binpacking and noisy neighbors.

If a metric can’t forecast pain 10–30 minutes ahead, it doesn’t belong in your capacity model.

From signals to a model you can explain under pressure

Skip the black-box ML unless you’ve nailed the basics. The model we ship fits on a whiteboard and survives postmortems.

  1. Quantify demand and service time

    • Use Little’s Law: L = λ * W.
      • λ (arrival rate): sum(rate(http_requests_total[5m])) or events/s.
      • W (service time): median or p90 request duration, or time_in_handler.
      • L (concurrency): expected in-flight requests; helps size thread pools and pods.
    • Cross-check with real concurrency: sum(http_server_active_requests) or work_in_progress gauges.
  2. Fit resource usage vs. work

    • For each service, fit a simple linear model: CPU_cores = α + β * rps_effective and mem = α + β * concurrency.
    • PromQL quick-and-dirty slope:
# CPU cores per 100 RPS over last 1h
( sum(rate(container_cpu_usage_seconds_total{container="api"}[5m])) by (pod)
) / scalar(sum(rate(http_requests_total{service="api"}[5m]))) * 100
  • Validate with a controlled load test (10–20 minutes) in staging using k6 or vegeta. Lock versions and configs.
  1. Incorporate saturation breakpoints

    • Identify cliffs: GC pauses > 100ms, thread pool queue > 50, kafka_consumergroup_lag derivative > 0 for 10 minutes.
    • These are where linear fits stop working; your model must clamp or switch regimes.
  2. Forecast short-term demand

    • Keep it boring: Holt-Winters or Prophet for weekly seasonality works.
from prophet import Prophet
import pandas as pd
m = Prophet(weekly_seasonality=True, daily_seasonality=True)
m.fit(df)  # df: ts + y=rps
future = m.make_future_dataframe(periods=72, freq='H')
forecast = m.predict(future)
  • Feed predicted λ into the resource fit to get pod/node counts. Stick to 48–72 hours; beyond that, you’re speculating.
  1. Translate to headroom and pre-scaling
    • Pick a headroom policy: 30–40% over predicted p95 demand or “one node spare per AZ.”
    • Pre-scale before known spikes (marketing emails, region cutovers) and when burn_rate rises during “normal” periods.

Autoscaling that tracks reality, not dashboards

Kubernetes HPA v2 and KEDA can scale on the metrics that matter if you wire the adapters correctly.

  • Use k8s-prometheus-adapter to surface custom metrics like queue_length_per_pod or active_requests.
  • Prefer per-pod or object metrics over cluster-averaged CPU%.
  • Keep scaling behavior conservative: stabilization windows, max surge, and min replicas tuned to avoid flapping.

Example: scale API pods on in-flight requests, not CPU:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 6
  maxReplicas: 60
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 120
      policies:
        - type: Percent
          value: 100
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 50
          periodSeconds: 300
  metrics:
    - type: Pods
      pods:
        metric:
          name: http_server_active_requests
        target:
          type: AverageValue
          averageValue: "50"  # target 50 active requests per pod

Example: scale consumers on Kafka lag with KEDA:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: payments-consumer
spec:
  minReplicaCount: 3
  maxReplicaCount: 100
  scaleTargetRef:
    name: payments-consumer
  triggers:
    - type: kafka
      metadata:
        bootstrapServers: kafka:9092
        consumerGroup: payments
        topic: payments
        lagThreshold: "5000"
        activationLagThreshold: "500"

Don’t forget the boring bits: Cluster Autoscaler limits, node pool SKUs, and PodDisruptionBudget/PriorityClass so the cluster can actually add capacity without evicting the hot path.

Tie telemetry to triage and rollout automation

A capacity model isn’t done until it drives decisions automatically. We gate rollouts and slash MTTR using the same leading indicators.

  • SLO-aware alerts
    • Multi-window error_budget_burn_rate is the north star.
# 1h and 6h burn; alert when both exceed thresholds
expr: (
  sum(rate(http_requests_total{status=~"5.."}[5m]))
  /
  sum(rate(http_requests_total[5m]))
) > (1 - slo_target)
  • Pair with tail-latency slope and queue growth to catch regressions early.

  • Gate canaries with Argo Rollouts

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: api-slo-check
spec:
  metrics:
    - name: p99-latency
      interval: 1m
      count: 5
      successCondition: result < 0.250  # 250ms
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service="api",version="canary"}[1m])) by (le))
    - name: burn-rate
      interval: 1m
      count: 5
      successCondition: result < 2  # 2x budget
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            (
              sum(rate(http_requests_total{service="api",version="canary",status=~"5.."}[1m]))
            ) /
            (
              sum(rate(http_requests_total{service="api",version="canary"}[1m]))
            ) / (1 - 0.995)
---
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: api
spec:
  strategy:
    canary:
      steps:
        - setWeight: 10
        - analysis:
            templates:
              - templateName: api-slo-check
        - setWeight: 25
        - analysis:
            templates:
              - templateName: api-slo-check
        - setWeight: 50
        - analysis:
            templates:
              - templateName: api-slo-check
        - setWeight: 100
  • Runbooks wired to alerts

    • Every alert links to a runbook with kubectl, kafka-consumer-groups, and Grafana panels.
    • First triage step is always “check saturation/lag slope,” not “restart things.”
  • ChatOps for human-in-the-loop

    • A /promote command only appears when SLO checks pass; /rollback posts the failing metrics inline.

A 48-hour Black Friday prediction that held under fire

A retailer (K8s on EKS, msk for Kafka, Aurora Postgres, Redis cache) asked us to sanity-check their scale plan 48 hours before Black Friday. Their dashboards said 60% CPU, “we’re fine.” Our model said otherwise:

  • Leading indicators

    • container_cpu_cfs_throttled_seconds_total rising 10–15%/h on the API tier.
    • kafka_consumergroup_lag derivative > 0 despite steady rps.
    • db_connections_in_use brushing the limit during traffic spikes; lock_waits_total inching up.
  • Actions in 24 hours

    • Switched HPA target to http_server_active_requests and raised minReplicas from 8 to 20.
    • Added a larger c5.4xlarge node group for API with cpuManagerPolicy: static to kill throttling.
    • Pre-warmed Redis cluster and doubled maxmemory with eviction policy tune.
    • KEDA policy on consumer lag with activationLagThreshold to wake up earlier.
    • Gated canary with burn-rate and p99 checks via Argo Rollouts.
  • Results

    • Black Friday: p95 latency held at 120ms (was 210ms previous year), 0 sev-1, 1 auto-rollback caught by burn-rate.
    • Spend +12% for the weekend, but cost per order down 18%.
    • On-call slept. Business shipped promos without rollback fear.

What I’d do differently next time

  • Turn “one-off” changes into code. We landed Terraform modules for nodegroups and KEDA, Sloth for SLOs, and versioned them under platform/.
  • Tighten labels. Cardinality explosions in OTel traces mask the real signals; sample intelligently at tail.
  • Make saturation tests part of CI. We now block merges if a 10-minute k6 run shows p99 drift > 10% at baseline load.
  • Review headroom weekly. Marketing doesn’t tell engineering; the model should catch early demand shifts.

Field checklist you can copy-paste

  1. Define SLIs/SLOs that the business actually cares about: availability, p95/p99 latency, and freshness for async pipelines.
  2. Instrument leading indicators: queue depths and slopes, CPU throttling, GC pauses, thread-pool backlog, consumer lag, connection pool usage.
  3. Build the model: Little’s Law for concurrency + linear fit of resource vs. load; annotate known saturation cliffs.
  4. Validate with synthetic load; update coefficients each deploy or significant config change.
  5. Forecast 48–72 hours out with simple seasonality. Pre-scale for spikes and when burn-rate rises under “normal” load.
  6. Wire autoscaling to custom metrics using HPA v2 and KEDA; set sensible stabilization windows.
  7. Gate rollouts with Argo Rollouts AnalysisTemplates tied to SLOs; auto-rollback on regressions.
  8. Keep runbooks and alerts glued together; first triage step checks leading indicators.
  9. Postmortem misses and bake changes back into code within 24 hours. Don’t let tribal knowledge rot.

Related Resources

Key takeaways

  • Track leading indicators that predict incidents: saturation, queue growth rate, tail latency drift, error-budget burn, and resource throttling.
  • Use simple, explainable models first: Little’s Law + linear fits from telemetry beat black-box ML for capacity planning.
  • Wire metrics to autoscaling through `HPA v2` and `KEDA` using custom metrics like `queue_length_per_pod` and `kafka_consumer_lag`.
  • Gate canary promotions with SLO-aligned signals using `Argo Rollouts` or `Flagger` to automate rollbacks.
  • Create headroom and warm paths proactively; predict and pre-scale before peak windows.
  • Keep dashboards honest: prefer multi-window burn-rate and slope-based alerts over instantaneous thresholds.

Implementation checklist

  • Define SLIs/SLOs that matter: availability, tail latency, and freshness (for pipelines).
  • Instrument saturation and backpressure: CPU throttling, run queue, GC pauses, thread-pool queue length, and queue depth slope.
  • Model capacity with Little’s Law and linear fits of load vs. resource usage; validate with synthetic load.
  • Forecast near-term demand (48–72 hours) with weekly seasonality; pre-scale when error-budget burn rises under normal load.
  • Deploy `HPA v2`/`KEDA` for autoscaling on custom metrics tied to SLOs, not CPU%.
  • Gate rollouts with `Argo Rollouts` AnalysisTemplates that watch burn-rate and p99 drift; auto-rollback on regressions.
  • Document triage runbooks linked in alerts; include commands and dashboards for each metric.
  • Run postmortems on capacity misses and update models/alerts within 24 hours.

Questions we hear from teams

Why not just use CPU or memory for autoscaling?
Because CPU% and RSS are lagging and often misleading. They don’t capture contention (throttling, run-queue) or backpressure (queue growth, lag slope). Scale on signals tied to service time and concurrency like active requests, queue depth per pod, or consumer lag.
Do we need ML to forecast demand?
No. A simple model (Little’s Law + linear fit + Holt-Winters seasonality) usually beats black-box ML because it’s explainable and debuggable in incident reviews. Add complexity only when residuals demand it.
How do we avoid flapping when scaling on custom metrics?
Use stabilization windows, minimum replicas, and conservative policy steps. Smooth with 1–5 minute rates, and clamp on known saturation cliffs. Validate in staging with a small load test.
What about databases? Autoscaling won’t save us there.
Correct. DBs are capacity-planned, not autoscaled. Watch connection pool saturation, lock waits, and storage IOPS headroom. Pre-scale replicas, partition hot tables, and warm caches before peak windows. Tie app-level backpressure to protect the DB.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to GitPlumbers about a capacity and SLO tune-up Download our SLO-driven autoscaling templates