What’s a good starting set of leading indicators?

For web APIs: p95 latency on the hot path, error rate, worker/thread pool queue depth, DB connection pool utilization, Kafka/SQS backlog, CPU throttling ratio, and OOM/restart spikes. Tie them to SLOs instead of static thresholds.

How do we keep playbooks from going stale?

Treat them like code: store in Git, lint in CI (validate schema and external links), and run monthly game days. Expire or fix any playbook step that fails during drills.

Can we do this with Datadog/New Relic instead of Prometheus?

Yes. Use Datadog monitors with multi-window burn-rate formulas and tags for team/service. The same concepts apply—just wire the dashboards, runbook URLs, and rollout metadata into annotations and PagerDuty/Slack.

We tried AI to draft runbooks. Worth it?

Sure—as a first draft. But review for executability and accuracy. We’ve done a lot of vibe code cleanup where AI hallucinated commands or paths. Add validation, dry-run support, and drill before trusting it on-call.

How does this play with feature flags?

Great. Put kill switches and traffic shapers in the mitigate section and expose them as commands. Tie flags to telemetry so you can degrade gracefully during incidents while the canary auto-aborts if metrics fail.

Reliability-observability · Nov 27, 2025 · 10 minute read

Your Incidents Are Predictable: Build Playbooks That Route, Triage, and Roll Back Themselves

Stop paging on vanity graphs. Use leading indicators, wire telemetry to triage, and let rollouts auto-protect you. Here’s how to make incident playbooks that actually scale across teams.

Alex Ramirez

Partner, Reliability & Platforms

Alex has led SRE and platform teams at high-growth startups and Fortune 100s, fixing observability messes and shipping boring, reliable platforms. At GitPlumbers, he helps teams turn AI-assisted chaos and legacy debt into systems that page the right human—if at all.

Incidents aren’t random. If you watch saturation and tail latency—and connect them to rollout automation—you’ll see them coming and stop them before customers do.

Back to all posts

The outage that wasn’t an outage

If you’ve ever watched a team wake up to a wall of 500s only to discover the root cause was a slow dependency or a saturated thread pool, you know the pain. I’ve seen teams at unicorns and 40-year-old insurers page on average CPU while the real problem was p95 latency creep and connection pool exhaustion. By the time the 500s arrive, you’re already negotiating with the CFO about the SLA credits.

What actually scales across multiple teams isn’t a PDF runbook and a hope. It’s playbooks that are:

Driven by leading indicators that give you 10–30 minutes of warning
Executable, not just readable
Self-routing to the right on-call with context
Integrated with rollout automation that can stop the bleeding without heroics

This is the pattern we’ve implemented at GitPlumbers in shops using Prometheus, Alertmanager, Grafana, OpenTelemetry, Argo Rollouts or Flagger, and ArgoCD under a GitOps model.

Stop measuring vibes: choose leading indicators that predict incidents

Vanity metrics page late and often. The leading indicators that actually forecast pain:

Saturation: thread/worker pool queue depth, DB connection pool utilization, kube_pod_container_resource_limits vs throttled_seconds_total
Tail latency: p95/p99 on the hot path close to SLO budget
Backlog: Kafka consumer lag, SQS message age, job queue length derivative
Memory & GC pressure: GC pause time, rising oom_kills_total
Dependency health: upstream error rate/latency trends, circuit breaker open rate

Anchor alerts to SLO math instead of thresholds by feel. Multi-window, multi-burn SLO alerts give fast and slow detectors that won’t flap. Example Prometheus rule (99% SLO):

# prometheus-rule-slo.yaml
# Error budget burn: fast (5m/1h) and slow (30m/6h)
- groups:
  - name: slo-burn
    rules:
    - alert: SLOErrorBudgetBurnFast
      expr: |
        (sum(rate(http_requests_total{job="api",status=~"5.."}[5m]))
         /
         sum(rate(http_requests_total{job="api"}[5m]))) 
        > (1 - 0.99) * 14.4
      for: 5m
      labels:
        severity: critical
        team: payments
        service: api
      annotations:
        summary: "High error budget burn (fast) for api"
        runbook_url: "https://git.example.com/org/api/blob/main/ops/runbooks/slo-burn.md"
        dashboard: "https://grafana.example.com/d/abc123/api-slo"

    - alert: SLOErrorBudgetBurnSlow
      expr: |
        (sum(rate(http_requests_total{job="api",status=~"5.."}[30m]))
         /
         sum(rate(http_requests_total{job="api"}[30m]))) 
        > (1 - 0.99) * 6
      for: 30m
      labels:
        severity: warning
        team: payments
        service: api
      annotations:
        summary: "Elevated error budget burn (slow) for api"
        runbook_url: "https://git.example.com/org/api/blob/main/ops/runbooks/slo-burn.md"
        dashboard: "https://grafana.example.com/d/abc123/api-slo"

And some saturation/backlog predictors:

# Queue depth trending up faster than drain rate
- alert: KafkaConsumerLagTrend
  expr: derivative(kafka_consumergroup_lag{group="payments"}[10m]) > 0 
        and avg_over_time(kafka_consumergroup_lag{group="payments"}[10m]) > 5e4
  for: 10m
  labels:
    severity: critical
    team: payments
  annotations:
    summary: "Consumer lag rising (payments)"
    runbook_url: "https://git.example.com/org/payments/ops/runbooks/kafka-lag.md"

# CPU throttling indicates impending latency blowups on K8s
- alert: KubernetesCPUThrottlingHigh
  expr: rate(container_cpu_cfs_throttled_seconds_total{container!=""}[5m])
        /
        rate(container_cpu_cfs_periods_total{container!=""}[5m]) > 0.2
  for: 10m
  labels:
    severity: warning
    team: platform
  annotations:
    summary: "High CPU throttling (possible pod starvation)"
    runbook_url: "https://git.example.com/org/platform/ops/runbooks/cpu-throttle.md"

Notice the team, service, and runbook_url baked into the alerts. That’s the backbone of scale.

Make playbooks executable, not PDFs

Playbooks should run. At minimum, they should:

Declare a trigger (alertname/labels)
Provide a 1-page triage path with commands
Offer automated rollback/toggle steps
Link to dashboards, logs, and recent deploys

A simple, repo-hosted runbook pattern:

# ops/runbooks/slo-burn.yaml
apiVersion: ops.gitplumbers.io/v1
kind: Runbook
metadata:
  name: api-slo-burn
spec:
  trigger:
    alertname: SLOErrorBudgetBurnFast
    matchLabels:
      service: api
  triage:
    - name: Inspect recent deploys
      run: bash
      script: |
        echo "Last 3 deploys:" && \
        kubectl -n prod rollout history deploy/api | tail -n 10
    - name: Check tail latency & errors
      run: bash
      script: |
        hurl https://grafana.example.com/d/abc123/api-slo?var-service=api
    - name: Query top offenders
      run: bash
      script: |
        honeycomb-query --dataset api --query-file ops/queries/top-endpoints.json
  mitigate:
    - name: Reduce traffic by 20% (Istio circuit breaker)
      run: bash
      script: |
        kubectl -n prod apply -f ops/manifests/dst-rule-degrade.yaml
    - name: Rollback last rollout
      run: bash
      script: |
        kubectl -n prod rollout undo deploy/api

Wire it up with a tiny dispatcher all teams share:

# gp-runbook (installed in on-call container images)
ALERT="$1" # path to alert JSON
RUNBOOK=$(jq -r '.annotations.runbook_url' "$ALERT")
# Fetch the YAML and execute steps safely
curl -sL "$RUNBOOK" | yq '.spec.triage[].script' -r | while read -r step; do
  echo "[triage] $step" && bash -lc "$step" || break
done

Yes, we’ve cleaned up a lot of AI-generated “vibe runbooks” that looked plausible but weren’t executable. Treat runbooks like code: code review, CI lint, and periodic game days.

Wire telemetry to triage: enrich and route automatically

Pages without context burn minutes. Add the context at the source:

Route by team/service in Alertmanager
Enrich with links to Grafana, runbook_url, OpenTelemetry traces (e.g., Honeycomb), and the last ArgoCD sync/commit
Auto-open incidents in your chosen tool with labels that match

Alertmanager snippet that scales across orgs:

# alertmanager.yaml
route:
  group_by: ['alertname','service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 3h
  receiver: default
  routes:
    - matchers:
        - team="payments"
      receiver: pagerduty-payments
    - matchers:
        - team="platform"
      receiver: pagerduty-platform

receivers:
  - name: pagerduty-payments
    pagerduty_configs:
      - routing_key: PD_ROUTING_KEY_PAYMENTS
        details:
          runbook: '{{ .CommonAnnotations.runbook_url }}'
          dashboard: '{{ .CommonAnnotations.dashboard }}'
  - name: pagerduty-platform
    pagerduty_configs:
      - routing_key: PD_ROUTING_KEY_PLATFORM

We also tack on deploy context using ArgoCD webhook enrichment (simple webhook service that reads the alert, fetches app.kubernetes.io/version and argocd app history, and adds a deploy_sha link). That cuts MTTA by 30–50% in my experience because the first question is always, “What changed?”

Automate rollouts with real guardrails (Argo Rollouts/Flagger)

Don’t wait for a human to decide whether a canary is bad. Let your rollout controller query the same signals that page you.

Argo Rollouts with Prometheus-backed analysis:

# analysis-template.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: api-slo
spec:
  metrics:
  - name: error-rate
    interval: 1m
    count: 5
    successCondition: result < 0.01  # 1% error budget for 99% SLO
    failureCondition: result >= 0.01
    provider:
      prometheus:
        address: http://prometheus.monitoring.svc.cluster.local:9090
        query: |
          sum(rate(http_requests_total{job="api",status=~"5.."}[1m]))
          /
          sum(rate(http_requests_total{job="api"}[1m]))
  - name: p95-latency
    interval: 1m
    count: 5
    successCondition: result < 0.250  # 250ms
    failureCondition: result >= 0.250
    provider:
      prometheus:
        address: http://prometheus.monitoring.svc.cluster.local:9090
        query: |
          histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job="api"}[1m])) by (le))

Tie it into a canary with auto-abort:

# rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: api
spec:
  strategy:
    canary:
      steps:
      - setWeight: 10
      - pause: {duration: 60}
      - analysis:
          templates:
          - templateName: api-slo
          args: []
      - setWeight: 50
      - pause: {duration: 120}
      - analysis:
          templates:
          - templateName: api-slo
      trafficRouting:
        istio:
          virtualService:
            name: api-vs
            routes:
            - primary
      abortOnFailure: true

Prefer Flagger? Same idea: reference Prometheus metrics, configure maxWeight/stepWeight, and let it shift traffic based on SLO adherence. Either way, the point is: use the exact same leading indicators for canary decisions and paging.

Scale across teams with templates and contracts

You don’t get scale from copy/paste. You get it from contracts:

Labels and annotations: team, service, severity, runbook_url, dashboard
SLO policy: one doc, per-tier defaults (e.g., Tier1 99.9%, Tier2 99%)
Golden queries: shared promql library for latency, error, and backlog
Runbook template: YAML contract with trigger, triage, mitigate
GitOps structure: ops/alerts/, ops/runbooks/, ops/rollouts/

Example Terraform module that every service consumes:

# modules/slo/main.tf
variable "service" {}
variable "team" {}
variable "slo" { default = 0.99 }

module "prometheus_rules" {
  source = "./prometheus-rules"
  service = var.service
  team    = var.team
  slo     = var.slo
}

module "alert_routing" {
  source = "./alertmanager-routing"
  team   = var.team
}

Teams get a paved road. They pass service, team, and slo; they inherit burn-rate alerts, routing, and runbook links.

If a new team adds a service and you have to touch Alertmanager by hand, you don’t have a platform—you have a spreadsheet.

Prove it works: drills, metrics, and paying down debt

Playbooks rot unless you exercise them. Bake the following into your operating rhythm:

Game days: monthly, rotate ownership, include rollback drills
MTTA and MTTR: track per team and per incident class; publish a weekly scorecard
False-page rate: anything over 2% is a fix-it; either improve signals or thresholds
Runbook coverage: every critical alert must have a runbook_url that passes a linter
Dependency SLIs: if a third-party API drives your SLO, monitor their SLI on your side too

We’ve seen teams cut MTTR by 40–60% in two quarters just by making runbooks executable and plugging rollout decisions into Prometheus. The surprise for many execs: incidents went down while deploy frequency went up, because the guardrails were real—not vibes.

If you’re inheriting AI-generated code and “vibe runbooks,” clean them up the same way you’d refactor a flaky test suite: codify the contract, add validation, and make failure obvious. GitPlumbers has done this in legacy monoliths and hyperactive microservices shops alike.

What I’d do tomorrow if I owned your pager

Pick one Tier1 service. Define a 99% SLO and implement the burn-rate rules above.
Add team, service, runbook_url, and dashboard to every alert.
Put a one-page, executable runbook in ops/runbooks/. Test it with a dry-run dispatcher.
Add an AnalysisTemplate and canary steps to your Argo Rollouts (or Flagger).
Drill next week. Measure MTTA/MTTR. Fix the rough edges.
Template it in Terraform/Helm. Roll it to the next 5 services.

This is boring, methodical work. It’s also what keeps your team off the 3 a.m. roulette wheel and lets you ship faster without praying. If you want a partner who’s done the boring bits before, we’re here.

Related Resources

Key takeaways

Use leading indicators (saturation, tail latency, queue depth, throttling) instead of vanity metrics.
Make playbooks executable: alerts carry team, runbook_url, dashboards, and recent deploy links.
Tie the same signals to progressive delivery so rollouts abort automatically.
Standardize templates, labels, and SLO math so every team scales without bespoke glue.
Continuously drill, measure MTTA/MTTR, and retire flaky alerts and brittle steps.

Implementation checklist

Define SLOs and leading indicators per service; agree on thresholds with product.
Codify multi-window burn-rate alerts; add runbook_url and team labels.
Enrich pages with links to Grafana, recent deploy SHA, and feature flags.
Adopt progressive delivery (Argo Rollouts/Flagger) with Prometheus-based AnalysisTemplates.
Create a reusable runbook template with diagnostics and rollback commands.
Route alerts in Alertmanager by `team` and `service`; auto-open incidents with tags.
Drill monthly; track MTTA/MTTR and false-page rate; prune or fix noisy steps.

Questions we hear from teams

What’s a good starting set of leading indicators?: For web APIs: p95 latency on the hot path, error rate, worker/thread pool queue depth, DB connection pool utilization, Kafka/SQS backlog, CPU throttling ratio, and OOM/restart spikes. Tie them to SLOs instead of static thresholds.
How do we keep playbooks from going stale?: Treat them like code: store in Git, lint in CI (validate schema and external links), and run monthly game days. Expire or fix any playbook step that fails during drills.
Can we do this with Datadog/New Relic instead of Prometheus?: Yes. Use Datadog monitors with multi-window burn-rate formulas and tags for team/service. The same concepts apply—just wire the dashboards, runbook URLs, and rollout metadata into annotations and PagerDuty/Slack.
We tried AI to draft runbooks. Worth it?: Sure—as a first draft. But review for executability and accuracy. We’ve done a lot of vibe code cleanup where AI hallucinated commands or paths. Add validation, dry-run support, and drill before trusting it on-call.
How does this play with feature flags?: Great. Put kill switches and traffic shapers in the mitigate section and expose them as commands. Tie flags to telemetry so you can degrade gracefully during incidents while the canary auto-aborts if metrics fail.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to an engineer about scaling your incident playbooks See how we fix vibe code and broken runbooks