Capacity Planning That Actually Predicts Outages (Not Just Makes Grafana Pretty)

Build models from leading indicators—saturation, queues, and SLO burn—then wire them into triage and progressive delivery so scaling and rollbacks happen before customers feel it.

Capacity planning isn’t a spreadsheet. It’s a feedback loop that ends in an automated decision.
Back to all posts

The dashboard said we were fine—right up until we weren’t

I’ve watched teams “capacity plan” by staring at average CPU and declaring victory. Then the incident hits: p95 latency goes vertical, queues explode, customers start rage-refreshing, and the on-call learns (again) that CPU ≠ capacity.

What actually predicts scaling needs is boring and brutally consistent:

  • Saturation of the real bottleneck (CPU or DB connections or thread pools or GC or I/O)
  • Queue growth (work arriving faster than you can drain it)
  • SLO burn rate (your error budget evaporating faster than you can react)

If you model those and wire them into automation, you stop “reactive scaling” and start preventing the page.

Stop planning with vanity metrics (and start with failure modes)

Vanity metrics are the ones that look good in a weekly email and tell you nothing about imminent pain:

  • Average CPU across the cluster
  • Total requests per minute (without latency/error context)
  • Node count, pod count, “pods running”

Leading indicators are tied to how systems actually fail:

  • Utilization at the constraint (e.g., container_cpu_usage, DB pool usage, http_server_requests concurrency)
  • Queue depth / lag (Kafka consumer lag, SQS visible messages, internal work queues)
  • Tail latency slope (p95/p99 trend, not just current value)
  • Error budget burn (multi-window burn rate beats static thresholds)

Concrete examples I’ve seen predict incidents reliably:

  • Kafka: consumer lag increasing for 10–20 minutes even while consumer CPU looks “fine”
  • Postgres: connection pool at 95% → request concurrency piles up → p95 goes nuts
  • JVM services: GC pause time trending up under load → latency spikes → retries amplify load

Capacity planning starts by writing down the failure mode: “We fall over when DB connections saturate and queue latency crosses X.” Then you instrument and model that.

A capacity model that doesn’t require a PhD

Here’s what actually works in practice: a simple model you can explain in five minutes during an incident review.

  1. Pick the constraint metric (the real bottleneck)
  2. Link load → utilization (how fast you approach saturation)
  3. Link utilization → latency/queue growth (where things get nonlinear)

A workhorse mental model is Little’s Law:

L = λ * W (items in system = arrival rate * time in system)

You don’t need perfect accuracy—you need a model that answers: “If traffic grows 30%, when do we hit the knee of the curve?”

For an HTTP service behind Kubernetes, you can often approximate:

  • Arrival rate: RPS
  • Concurrency: in-flight requests
  • Service time: p95 latency

Then:

  • If concurrency increases faster than pod count, your queueing starts
  • If p95 latency starts rising while RPS is steady, you’re saturating a hidden constraint (DB, locks, GC, downstream)

A practical output is a “time-to-pain” forecast:

  • “At current trend, Kafka lag hits 1M messages in ~42 minutes.”
  • “At current burn rate, we exhaust the 30-day error budget in ~6 hours.”

That’s actionable for scaling and rollouts.

Prometheus queries for leading indicators (with teeth)

These are the kinds of PromQL queries we use at GitPlumbers when we’re cleaning up a system that’s been “observed” but not actually operated.

1) SLO burn rate (multi-window)

Example: availability SLO based on 5xx ratio.

# Error ratio over 5m
(
  sum(rate(http_requests_total{job="api",status=~"5.."}[5m]))
  /
  sum(rate(http_requests_total{job="api"}[5m]))
)

Burn rate alerting uses multiple windows to avoid flapping and catch fast burns.

# prometheus-rule.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: api-slo-burn
spec:
  groups:
  - name: api-slo
    rules:
    - alert: ApiFastBurn
      expr: |
        (
          sum(rate(http_requests_total{job="api",status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total{job="api"}[5m]))
        ) > 0.01
        and
        (
          sum(rate(http_requests_total{job="api",status=~"5.."}[1h]))
          /
          sum(rate(http_requests_total{job="api"}[1h]))
        ) > 0.005
      for: 5m
      labels:
        severity: page
      annotations:
        summary: "API error budget burning fast"
        runbook_url: "https://gitplumbers.com/runbooks/api-fast-burn"

This is not “pretty.” This is predictive: burn starts before customer support tickets pile up.

2) Queue growth (trend, not snapshot)

Kafka lag is a classic leading indicator.

# Current lag
sum(kafka_consumergroup_lag{consumergroup="billing"})

Trend-based early warning (sustained positive slope):

# Lag increasing over last 15m
predict_linear(
  sum(kafka_consumergroup_lag{consumergroup="billing"})[15m],
  30*60
) > 1000000

That query answers: “Will lag exceed 1,000,000 in the next 30 minutes?” That’s capacity planning you can act on.

3) Saturation at the real choke point

If your DB pool is the limiter:

# Example: HikariCP pool utilization
max(hikaricp_connections_active{pool="main"})
/
max(hikaricp_connections_max{pool="main"})

Alert on sustained saturation and correlate it with p95 latency:

histogram_quantile(
  0.95,
  sum(rate(http_server_request_duration_seconds_bucket{job="api"}[5m])) by (le)
)

When pool utilization and p95 rise together, you’re not “under-provisioned CPU.” You’re under-provisioned concurrency at the DB boundary.

Tie capacity signals to triage (so on-call isn’t guessing)

I’ve seen too many orgs where alerts fire, everyone joins a Zoom, and the first 20 minutes are archaeology.

Make the telemetry do the routing and the first steps:

  • Alert labels encode ownership: service, team, tier, runbook_url
  • Runbooks start with the leading indicator: “Is queue growing? Is saturation rising? Is burn rate fast?”
  • Dashboards are linked directly in alert annotations

Example alert annotations that reduce MTTR:

annotations:
  summary: "Billing consumer lag forecast to breach in 30m"
  description: "predict_linear indicates lag > 1,000,000 in 30m"
  dashboard: "https://grafana.example.com/d/abc123/billing-consumer"
  runbook_url: "https://gitplumbers.com/runbooks/billing-lag"
  triage_hint: "Check consumer throughput vs partition count; scale replicas or increase max.poll.records"

Operationally, the win is that your incident starts at minute 0 with:

  • the probable bottleneck
  • the time horizon
  • the known good actions

That’s how you turn capacity planning into incident prevention.

Wire it into rollout automation (pause/abort before the blast radius grows)

Capacity regressions often arrive via deploys: a new N+1 query, a retry loop, a “harmless” logging change that melts disks. You can catch that by gating rollouts on the same leading indicators.

With Argo Rollouts, you can automate analysis using Prometheus queries.

# argo-rollouts-analysis.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: api-capacity-guardrails
spec:
  metrics:
  - name: error-rate
    interval: 30s
    successCondition: result[0] < 0.01
    provider:
      prometheus:
        address: http://prometheus.monitoring.svc:9090
        query: |
          sum(rate(http_requests_total{job="api",status=~"5..",pod=~"{{args.podRegex}}"}[2m]))
          /
          sum(rate(http_requests_total{job="api",pod=~"{{args.podRegex}}"}[2m]))
  - name: p95-latency
    interval: 30s
    successCondition: result[0] < 0.4
    provider:
      prometheus:
        address: http://prometheus.monitoring.svc:9090
        query: |
          histogram_quantile(0.95,
            sum(rate(http_server_request_duration_seconds_bucket{job="api",pod=~"{{args.podRegex}}"}[2m])) by (le)
          )
  - name: db-pool-saturation
    interval: 30s
    successCondition: result[0] < 0.85
    provider:
      prometheus:
        address: http://prometheus.monitoring.svc:9090
        query: |
          max(hikaricp_connections_active{pool="main",pod=~"{{args.podRegex}}"})
          /
          max(hikaricp_connections_max{pool="main",pod=~"{{args.podRegex}}"})

Now your rollout can pause when saturation creeps up (before customers notice), and your SRE team doesn’t have to be the human circuit breaker.

This is the bridge most orgs miss: telemetry → automated decision → controlled blast radius.

What we do differently at GitPlumbers (and what to steal)

I’ve seen this fail when teams try to boil the ocean: “Let’s build a global capacity model for the whole platform.” Six months later: abandoned spreadsheet, same pages.

Here’s what actually works:

  1. Pick your top 3 customer-facing services and define one constraint each.
  2. Add two leading indicator alerts per service:
    • one for SLO burn
    • one for queue/saturation trend
  3. Add one automation hook:
    • rollout analysis gate, or
    • autoscaling policy based on queue trend (not CPU)
  4. Run one game day and tune thresholds.

If you’re running Kubernetes, you’ll often combine:

  • HPA for fast reaction (but feed it better signals than CPU)
  • Cluster Autoscaler for node capacity
  • VPA for right-sizing after you understand utilization

And if you’re dealing with AI-generated changes (we’re seeing a lot of this), treat them like any other risky deploy: gate on burn + saturation. “Vibe-coded” retry logic can torch your downstreams in minutes.

If you want a second set of eyes, GitPlumbers typically comes in, identifies the real constraints, fixes the telemetry so it matches reality, and wires the whole thing into ArgoCD/Argo Rollouts so it stays fixed.

Capacity planning isn’t a spreadsheet. It’s a feedback loop that ends in an automated decision.

Next steps

  • Audit one service this week: identify the constraint, add a queue/saturation trend query, and link it to a runbook.
  • Add an SLO burn alert that pages only when it matters.
  • Gate your next risky rollout on those indicators.

If that sounds like the kind of work your team never has time to do while also shipping features, that’s exactly where GitPlumbers helps: we’ll turn your existing Prometheus/Grafana data into predictable scaling and fewer 3 a.m. surprises.

Related Resources

Key takeaways

  • Capacity planning works when it’s tied to **saturation + queues + SLO burn**, not CPU averages.
  • Use simple models (Little’s Law + utilization curves) and validate them against incident timelines.
  • Treat **queue growth** and **error-budget burn** as early-warning signals; they predict incidents hours before dashboards turn red.
  • Wire telemetry into action: alerts route to the right runbook, and rollouts can automatically pause/abort on bad trends.
  • Forecast “time-to-pain” (e.g., time until consumer lag breaches) and scale on that, not on request rate alone.

Implementation checklist

  • Define 2–4 **leading indicators** per service: saturation, queue depth, SLO burn rate, and a domain-specific limiter (DB connections, thread pools, Kafka lag).
  • Create an SLO with an **error budget** and at least one **burn-rate** alert.
  • Write PromQL queries that detect **trend + persistence** (not spikes) for queues and saturation.
  • Build a capacity model: map load → utilization → latency/queue growth; validate against real incidents.
  • Connect alerts to triage: link to runbooks, auto-attach dashboards, and page the owning team only.
  • Add rollout gates (Argo Rollouts/Flagger) that auto-pause/rollback on SLO burn or queue growth.
  • Review quarterly: update model coefficients after infra/app changes and post-incident learnings.

Questions we hear from teams

Why isn’t CPU enough for capacity planning?
Because most incidents are constraint-driven elsewhere: DB connection pools, downstream rate limits, lock contention, GC, disk I/O, or queues. CPU can look fine while latency and backlog climb. Plan around the real bottleneck and you’ll predict incidents earlier.
What’s the single best leading indicator across services?
If you can only pick one, pick **SLO burn rate** because it measures user impact directly. If you can pick two, add a **queue/lag trend** metric because it forecasts approaching failure before errors spike.
How do we prevent alert fatigue while adding these signals?
Page on multi-window burn and sustained trends (with `for:` and slope-based queries), not spikes. Put noisy signals into ticket/slack severity, and reserve paging for conditions that threaten the error budget or forecast imminent breach.
How does this connect to autoscaling?
Scale on predictors like queue depth/lag and concurrency, not CPU averages. Use Kubernetes `HPA` with custom metrics (Prometheus Adapter) and validate scaling behavior during load tests so you don’t just scale yourself into a downstream bottleneck.
Can rollout automation really prevent incidents?
Yes—when your canary analysis watches leading indicators (burn, p95 slope, saturation) and automatically pauses/aborts. This catches capacity regressions introduced by deploys before you shift 100% of traffic.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to GitPlumbers about capacity forecasting that prevents pages See GitPlumbers Reliability & Observability work

Related resources