The Runbook-Driven Game Day That Cut MTTR From 72 Minutes to 14

Stop worshiping dashboards. Wire leading indicators into runbooks and progressive delivery so incidents fix themselves—or at least page the right human with the right commands.

If your runbook starts with “Log into the console,” you’ve already lost. Make the computer do the boring parts.
Back to all posts

You don’t need more dashboards. You need leading indicators.

I’ve sat through too many postmortems where someone points at a beautiful Grafana board and says, “We didn’t see it coming.” Of course you didn’t—because the widgets were trailing indicators. If you want MTTR to drop, you need to page on signals that predict user pain before the 500 storm hits.

The winners we see in the wild:

  • Error budget burn rate: multi-window alerts that catch rapid degradation long before the SLO is blown.
  • Saturation: DB connection pool usage, JVM thread pool queue length, Node.js event loop lag, CPU steal, eBPF-derived run queue length.
  • Queueing/lag: Kafka consumer group lag, Redis blocked_clients, SQS AgeOfOldestMessage.
  • Latency shape: p95/p99 growth rate (not just absolute), tail amplification under load.

The losers:

  • Pure CPU% or memory% charts without context.
  • “Requests per second” without error/latency mix.
  • Synthetic pings that don’t hit real dependencies.

Pick three leading indicators per critical service. Those become your guardrails and your alert sources. Everything else is nice-to-have, not page-worthy.

Design runbooks like you’re paging a future you at 3 a.m.

Most runbooks read like a wiki novel. At 3 a.m., no one’s reading paragraphs. Good runbooks are single-screen, copy/paste-friendly, and opinionated. They answer three questions fast:

  1. What just paged and why (with the exact query/graph)?
  2. What’s the fastest safe mitigation?
  3. What evidence do I capture for the postmortem?

Use a consistent structure and embed real commands, not vibes.

# Runbook: checkout service – fast burn on error budget

Trigger
- Alert: `ErrorBudgetBurnFast` (5m and 1h windows)
- SLO: 98% success / 30d

Immediate actions (pick highest confidence, lowest risk)
1. Pause rollout if in progress
   - `kubectl argo rollouts pause rollout/checkout -n shop`
2. Scale canary down if error rate correlates with new version
   - `kubectl scale deploy/checkout-canary -n shop --replicas=0`
3. Feature flag suspected path (idempotent)
   - Toggle `CHECKOUT_V2=false` in LaunchDarkly

Triage (copy/paste)
- Error spike diff new vs stable:
  - `kubectl logs -n shop deploy/checkout-canary --since=5m | grep -E "ERROR|Exception" | tail -n 100`
  - `kubectl logs -n shop deploy/checkout-stable --since=5m | grep -E "ERROR|Exception" | tail -n 100`
- DB saturation check:
  - `kubectl exec -n shop deploy/checkout-stable -- curl -s localhost:9090/metrics | grep db_pool_active`
- Kafka lag:
  - `kafkacat -b kafka:9092 -L | grep checkout-events` (or Grafana panel link)

Rollback
- `kubectl argo rollouts rollback rollout/checkout -n shop`

Owner
- Service: checkout
- On-call: #oncall-checkout (PagerDuty: service=checkout)

Evidence
- Paste Grafana panel and `kubectl get events -n shop --sort-by=.metadata.creationTimestamp | tail`

Put these in runbooks/ in the repo next to the service code, reviewed via PRs. If it’s not versioned with the service, it will rot.

Wire telemetry directly into triage and paging

Alerts should carry context and a link to the exact runbook section. If your alert says “High error rate” and nothing else, you’re paying an on-call tax you don’t need to.

Prometheus rules for burn rate and saturation with runbook annotations:

# prometheus-rules.yaml
groups:
- name: checkout-slo
  rules:
  - alert: ErrorBudgetBurnFast
    expr: (
      sum(rate(http_requests_total{job="checkout",status=~"5.."}[5m]))
      /
      sum(rate(http_requests_total{job="checkout"}[5m]))
    ) > (0.02 * 14.4)
    for: 5m
    labels:
      severity: page
      service: checkout
    annotations:
      summary: "Checkout SLO fast burn"
      description: "5xx ratio > fast-burn threshold over 5m. Investigate canary and DB saturation."
      runbook_url: "https://git.example.com/shop/checkout/runbooks/checkout.md#runbook-checkout-service-–-fast-burn-on-error-budget"

  - alert: DBPoolSaturation
    expr: max(db_pool_active{service="checkout"}) / max(db_pool_size{service="checkout"}) > 0.9
    for: 2m
    labels:
      severity: page
      service: checkout
    annotations:
      summary: "DB pool > 90% used"
      description: "Connection pool saturation predicts timeouts and 500s."
      runbook_url: "https://git.example.com/shop/checkout/runbooks/checkout.md#triage"

Alertmanager to Slack with runbook link and a one-click triage slash command:

# alertmanager.yaml (snippet)
receivers:
- name: slack-oncall
  slack_configs:
  - channel: '#oncall-checkout'
    title: '{{ .CommonAnnotations.summary }}'
    text: >-
      {{ .CommonAnnotations.description }}
      Runbook: {{ .CommonAnnotations.runbook_url }}
      Triage: /triage checkout {{ .CommonLabels.alertname }}
    send_resolved: true

Back it up with an actual /triage command that runs the first five checks automatically:

#!/usr/bin/env bash
# triage-checkout.sh
set -euo pipefail
ns=shop
app=checkout
rollout=checkout

say() { printf "\n### %s\n" "$1"; }

say "Current rollout status"
kubectl argo rollouts get rollout/$rollout -n $ns

say "Pod health"
kubectl get pods -n $ns -l app=$app -o wide

say "Error logs (5m)"
kubectl logs -n $ns deploy/${app}-canary --since=5m | grep -E "ERROR|Exception" | tail -n 50 || true

say "DB pool metrics"
kubectl exec -n $ns deploy/${app}-stable -- curl -s localhost:9090/metrics | grep db_pool_ || true

say "Resource pressure"
kubectl top pods -n $ns -l app=$app || true

Glue this into Slack via a small bot or use PagerDuty’s event orchestration to trigger the script and paste results.

Progressive delivery that rolls back before users notice

Runbooks are great; auto-mitigation is better. If your telemetry says the canary is hurting the error budget, the system should pause or roll back without waiting for a human.

Argo Rollouts AnalysisTemplate gating on Prometheus:

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: checkout-health
spec:
  metrics:
  - name: p95-latency
    interval: 1m
    successCondition: result < 350
    failureLimit: 3
    provider:
      prometheus:
        address: http://prometheus.monitoring.svc.cluster.local:9090
        query: |
          histogram_quantile(0.95, sum(rate(http_server_request_duration_seconds_bucket{job="checkout",le!="+Inf"}[2m])) by (le))
  - name: error-budget-burn
    interval: 1m
    failureLimit: 2
    successCondition: result < (0.02 * 1.0) # burn < budget
    provider:
      prometheus:
        address: http://prometheus.monitoring.svc.cluster.local:9090
        query: |
          sum(rate(http_requests_total{job="checkout",status=~"5.."}[1m])) / sum(rate(http_requests_total{job="checkout"}[1m]))

Attach it to your rollout:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: checkout
spec:
  strategy:
    canary:
      steps:
      - setWeight: 10
      - pause: {duration: 2m}
      - analysis:
          templates:
          - templateName: checkout-health
      - setWeight: 50
      - pause: {duration: 3m}
      - analysis:
          templates:
          - templateName: checkout-health

If p95 or burn rate fails, Argo pauses and can auto-rollback. You still page, but the blast radius is small and your MTTR starts at “already mitigated.” Flagger offers a similar UX if you prefer Helm/Ingress route-based canaries.

Game days that aren’t tabletop theater

The best way to make runbooks real is to rehearse them. Not a whiteboard. Real commands against production-like traffic with timers running.

What to simulate first:

  • DB goes read-only: triggers pool saturation and timeouts.
  • Kafka lag spike: backpressure and stale data.
  • Regional DNS outage: path-level health checks, failover.
  • TLS cert near-expiry: check monitors and issuance pipeline.
  • CPU throttling: noisy neighbor or bad limits.

How to run a 60-minute game day that shrinks MTTR:

  1. Pick one service and one failure mode. Announce scope and rollback criteria.
  2. Use traffic replay or a load gen (e.g., vegeta, k6) to approximate prod.
  3. Trigger the failure with reversible tools:
    • DB latency via toxiproxy:
      docker run -d --name toxiproxy -p 8474:8474 shopify/toxiproxy
      toxiproxy-cli create checkout-db --listen 0.0.0.0:5433 --upstream db:5432
      toxiproxy-cli toxic add -t latency -a latency=250 -a jitter=100 checkout-db
    • CPU throttle on a pod (k8s limits lowered or stress-ng).
  4. Start the clock. Page the on-call like it’s real. No hints.
  5. Require the runbook. Track time-to-detection (MTTD) and time-to-mitigation (TTM).
  6. Debrief. Update runbooks and alerts immediately. PR the changes.

Do this monthly. Rotate facilitators. Ban “hero moves.” If a step requires console clicks, automate it or at least document with a CLI equivalent.

Measure what matters (and nothing else)

You can’t improve MTTR if you can’t measure it. We track:

  • MTTD: from incident start to first signal detected.
  • MTTR/TTM: to “impact mitigated,” not “root cause fixed.”
  • Change fail rate: % of deploys that cause an incident.
  • Error budget burn: % consumed per week.
  • Page volume: actionable pages per on-call shift.

Hook incident tooling into your metrics. With PagerDuty, we export incidents and compute MTTR weekly.

-- Snowflake: MTTR by service over last 4 weeks
select service, avg(resolved_at - triggered_at) as mttr_minutes
from pagerduty.incidents
where triggered_at >= dateadd(week, -4, current_timestamp())
group by service
order by mttr_minutes asc;

For SLOs, use Sloth to generate Prometheus rules so burn alerts are correct-by-construction:

apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
  name: checkout-slo
spec:
  service: checkout
  slos:
  - name: requests-availability
    objective: 98
    description: Requests success ratio
    sli:
      events:
        errorQuery: sum(rate(http_requests_total{job="checkout",status=~"5.."}[5m]))
        totalQuery: sum(rate(http_requests_total{job="checkout"}[5m]))
    alerting:
      name: CheckoutSLOBurn
      labels:
        severity: page
      annotations:
        runbook: https://git.example.com/shop/checkout/runbooks/checkout.md
      pageAlert:
        labels:
          team: checkout

Kill vanity metrics. If a graph never changes an action, delete it.

What changed when we did this for real

At a fintech client on EKS, they had 60+ dashboards, zero runbooks, and a 72-minute median MTTR. We:

  • Moved alerts to burn rate + saturation and added runbook_url to every page.
  • Wrote 11 one-page runbooks with copy/paste triage.
  • Added Argo Rollouts AnalysisTemplates for p95 and burn gating.
  • Ran three game days (DB read-only, Kafka lag, canary regression) in four weeks.

Results in 45 days:

  • MTTR: 72m → 14m (p50), p90 from 180m to 38m.
  • Change fail rate: 12% → 3.5%.
  • Pages per on-call shift: 9 → 3.
  • Two canaries auto-rolled back with near-zero user impact.

We didn’t add a single dashboard.

Start small this week

  • Pick one service. Define three leading indicators (burn, saturation, queue/lag).
  • Write a single-screen runbook with real commands.
  • Add runbook_url to alerts and a /triage command for top five checks.
  • Gate your canary at 10% with p95 + burn checks; enable auto-pause.
  • Schedule a 60-minute game day that hits one failure mode.
  • Measure MTTD/MTTR next Monday; review what actually moved the needle.

If your runbook starts with “Log into the console,” you’ve already lost. Make the computer do the boring parts.

Related Resources

Key takeaways

  • Dashboards don’t shrink MTTR—runbooks that execute on leading indicators do.
  • Alert on burn rate, saturation, and queueing, not just raw 500s and CPU% charts.
  • Embed `runbook_url` and one-click triage commands directly in alerts and ChatOps.
  • Use progressive delivery (Argo Rollouts/Flagger) to auto-pause/rollback on bad signals.
  • Game days should rehearse real failure modes with production-like traffic and timers.
  • Measure MTTD, MTTR, change fail rate, and error budget burn. Kill vanity metrics.

Implementation checklist

  • Identify 3 leading indicators per critical service (burn rate, saturation, queue depth).
  • Create a 1-page runbook per indicator with copy/paste-safe triage commands.
  • Add `runbook_url` and structured annotations to every page-worthy alert.
  • Wire Argo Rollouts/Flagger to Prometheus queries for auto-pause/rollback.
  • Schedule a 60-minute game day that hits one failure mode end-to-end.
  • Instrument incidents to capture start/stop and remediation steps for MTTR trends.

Questions we hear from teams

What’s the difference between a runbook and a playbook?
Playbooks describe the process (roles, comms, timelines). Runbooks are technical, single-issue, and executable—focused on specific alerts and the exact mitigation/triage steps. You need both, but runbooks shrink MTTR.
How do we keep runbooks from rotting?
Version them with the service, review via PR, and validate them in monthly game days. Any step that breaks in rehearsal gets fixed that day. Tie alerts to specific runbook anchors so edits are obvious when links break.
Which tools do you recommend to start?
Prometheus + Alertmanager for signal; OpenTelemetry for traces; Argo Rollouts or Flagger for progressive delivery; PagerDuty or Opsgenie for paging; Grafana for viz; Sloth to generate SLO/burn rate rules; a simple Slack bot for /triage. Keep it boring and standard.
Do we need chaos tooling to run game days?
Nice to have, not required. Start with `toxiproxy`, `stress-ng`, and traffic generators. Graduate to LitmusChaos or Gremlin when you’re ready for orchestrated scenarios and blast-radius controls.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to GitPlumbers about shrinking MTTR Get our runbook/game day starter kit

Related resources