Runbooks and Game Days That Actually Shrink MTTR
If your alerts point to dashboards instead of decisions, you’re paying on-call tax. Here’s how we wire telemetry to triage and rollouts so incidents resolve themselves (or nearly).
Runbooks are only useful if they run; otherwise they’re just books.Back to all posts
The incident that flipped our playbook
At a fintech client, a Thursday release turned their api-gateway into a retry factory. Dashboards were green until p99 latency hit a cliff. By the time on-call got past the Grafana scavenger hunt, kafka lag was 1.2M and the payments backlog took hours to drain. Classic: trailing indicators, runbooks that read like a wiki from 2019, and rollbacks gated by human nerves.
We swapped the “pretty charts” for leading signals and wired alerts to actions: runbook links, owners, and one-click rollbacks. Canary analysis made bad builds self-revert before customers noticed. MTTR dropped from 71 minutes to 18 in six weeks. Not magic—just plumbing.
Stop measuring the wrong things
If your top alerts are average CPU and “requests per minute,” you’re watching the rearview mirror. We’ve had better luck with indicators that predict pain:
Error budget burn rate: tells you how quickly you’re consuming your SLO—hours before customers churn.
Saturation: queue depth, connection pool utilization, CPU throttling ratio, thread pool queue length.
Work backlog growth: Kafka consumer lag, pending jobs, durable queue age.
Control plane stress: circuit breaker open rate, retry storms, DNS/mesh timeouts.
A few PromQLs we actually ship with:
# 1. 4x/1h and 1x/6h burn rate for 99% availability SLO
# Fires fast and slow to avoid flapping
- alert: APIHighErrorBudgetBurn
expr: |
(sum(rate(http_requests_total{job="api",status=~"5.."}[5m]))
/ sum(rate(http_requests_total{job="api"}[5m])))
> (0.01 * 4) or
(sum(rate(http_requests_total{job="api",status=~"5.."}[30m]))
/ sum(rate(http_requests_total{job="api"}[30m])))
> 0.01
for: 10m
labels:
severity: page
service: api
annotations:
summary: "API error budget burning hot"
runbook_url: "https://git.company.local/runbooks/api-5xx-spike"
dashboard: "https://grafana.local/d/api-overview?var-service=api&panelId=42"# 2. Kafka backlog growth (predictive: slope > threshold)
- alert: KafkaLagGrowing
expr: deriv(kafka_consumergroup_lag{consumergroup="payments"}[5m]) > 200
for: 5m
labels:
severity: warn
service: payments
annotations:
summary: "Payments backlog growing; check consumer health"
runbook_url: "https://git.company.local/runbooks/payments-lag"# 3. CPU throttling ratio by pod (k8s)
- alert: PodCpuThrottlingHigh
expr: |
sum(rate(container_cpu_cfs_throttled_seconds_total{container!="",pod!=""}[5m])) by (pod)
/ sum(rate(container_cpu_cfs_periods_total{container!="",pod!=""}[5m])) by (pod) > 0.2
for: 10m
labels:
severity: warn
annotations:
summary: "Pod experiencing sustained CPU throttling; expect latency regression"If an alert can’t tell me where to look and what to do in 60 seconds, it’s noise.
Make alerts clickable to action
I want my 2 a.m. page to have a button. That means putting the runbook, escalation, and automation right inside the alert. Alertmanager supports rich annotations; PagerDuty/Incident.io handle custom fields just fine.
# alertmanager.yaml (excerpt)
route:
receiver: pagerduty-high
routes:
- matchers:
- severity=~"page|critical"
receiver: pagerduty-high
receivers:
- name: pagerduty-high
pagerduty_configs:
- routing_key: ${PAGERDUTY_ROUTING_KEY}
severity: critical
details:
runbook: '{{ .Annotations.runbook_url }}'
dashboard: '{{ .Annotations.dashboard }}'
service: '{{ .Labels.service }}'
automation_rollback: 'https://rundeck.local/project/ops/job/rollback?service={{ .Labels.service }}'
slack: '#oncall-api'Runbooks shouldn’t be novels. They should be living, testable docs with executable snippets and verification steps. We keep them in ops/runbooks/$service.md, validated in CI so commands don’t rot.
---
service: api
severity: sev1
owner: team-api
links:
dashboard: https://grafana.local/d/api-overview?var-service=api
logs: https://kibana.local/app/discover#/?_a=(query:(language:kuery,query:'service:api'))
automation:
rollback: https://rundeck.local/project/ops/job/rollback-api
feature_flag: api_canary_enabled
---
# API 5xx Spike Runbook
1. Confirm burn: open dashboard panel 42; if error rate > 1% for 10m, proceed.
2. Check canary:
```bash
kubectl -n prod get rollout api -o wide
kubectl -n prod argo rollouts get rollout api- Rollback (safe, idempotent):
curl -s -X POST "$RUNDECK_ROLLBACK_URL" -H "X-Auth-Token: $TOKEN" - Verify:
kubectl -n prod rollout status deploy/api --timeout=5m - If still burning, toggle feature flag to off (partial mitigation).
- Post-restore: create ticket
INC-{{incident_id}}with root causes and attach Grafana snapshot.
## Wire telemetry into rollouts, not just dashboards
If your deployment pipeline can’t stop itself when the error budget catches fire, you’ll keep paging humans for machine-speed mistakes. We prefer canaries with **Argo Rollouts** (or Flagger) and analysis gates that hit Prometheus.
```yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: api-canary-analysis
spec:
metrics:
- name: error-rate
initialDelay: 2m
interval: 1m
count: 5
failureLimit: 1
successCondition: result < 0.01
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
sum(rate(http_requests_total{job="api",status=~"5.."}[1m]))
/ sum(rate(http_requests_total{job="api"}[1m]))
- name: latency-p99
initialDelay: 2m
interval: 1m
count: 5
successCondition: result < 0.9
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job="api"}[1m])) by (le))
---
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: api
spec:
strategy:
canary:
steps:
- setWeight: 10
- analysis:
templates:
- templateName: api-canary-analysis
- pause: {duration: 60}
- setWeight: 50
- analysis:
templates:
- templateName: api-canary-analysis
- pause: {duration: 120}
- setWeight: 100Tie feature flags to the same gates. If api_canary_enabled is on and error-rate trips, auto-toggle off via a webhook. Don’t philosophize during an incident—let the automation revert first, then you can investigate at human speed.
If you’re more into service meshes, Istio’s VirtualService with match/route weights plus Flagger can do this with out-of-the-box SLO checks.
Codify runbooks as code (and test them)
Docs-as-code beats tribal knowledge. We ship runbooks with CI checks that execute non-destructive commands against a staging cluster and lint the YAML frontmatter. That prevents the classic “dead command” issue you discover on page duty.
Store under
ops/runbooks/and require PR reviews from SRE + service owner.Include a “pre-check” section to verify credentials and cluster context.
Embed ready-to-copy commands for triage, mitigation, and verification.
Link to golden dashboards with deep links: include
panelIdand variables pre-filled.
CI validation example:
#!/usr/bin/env bash
set -euo pipefail
for rb in ops/runbooks/*.md; do
yq e '.service' "$rb" >/dev/null
grep -q 'kubectl' "$rb" || { echo "No kubectl in $rb"; exit 1; }
# Dry-run commands in staging where possible
if grep -q 'kubectl -n staging rollout status' "$rb"; then
echo "Validating rollout status command in $rb"
kubectl -n staging rollout status deploy/placeholder --timeout=1s || true
fi
doneWe also pin environment assumptions at the top (K8s version, mesh version, DB endpoints). When the platform shifts, runbooks fail CI before prod fails you.
Game days: build incident muscle, not theater
I don’t care how many incident retros you’ve written—if you don’t practice, you won’t execute. We run compact, brutal game days that mimic real pagers and measure the right slices of MTTR.
Cadence: monthly per service, quarterly cross-cutting (e.g., auth outage).
Scope: one hypothesis per drill (“What if Kafka broker 2 goes down during a canary?”).
Tooling:
chaos-meshfor k8s faults,toxiproxyfor latency/packet loss, simplekubectlfor pod failures.
Injecting realistic faults:
# chaos-mesh: inject 400ms latency to the DB for api pods
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: api-db-latency
spec:
action: delay
mode: all
selector:
namespaces:
- prod
labelSelectors:
app: api
delay:
latency: '400ms'
jitter: '50ms'
direction: to
target:
selector:
namespaceSelectors:
- prod
labelSelectors:
app: postgres
mode: allScoring what matters:
MTTA (time to acknowledge) target < 2m.
Time-to-triage (first correct hypothesis) target < 7m.
Time-to-mitigate (rollback/flag/route) target < 10m.
Verification (SLO restored, alarms cleared) target < 5m.
Drill flow we use:
Trigger the chaos. Page the on-call via the real path (Alertmanager -> PagerDuty).
Force use of the runbook. No hunting through Confluence.
Time each phase. Capture what was missing or slow.
Update the runbook and automation the same day. Open a PR; block the next release if the fix is critical.
We’ve done this at startups and at a FAANG-adjacent org; the difference isn’t size, it’s discipline.
Results you can actually feel
Teams that adopt this pattern consistently see:
MTTR cut by 50–75% in the first two months.
Noise reduced by 30–40% because non-actionable alerts get culled.
60%+ of bad canaries self-rollback without paging a human.
On-call stress down (we measure it with a quarterly on-call NPS).
A client running Istio + Argo Rollouts went from “every release is a cliff dive” to 6 weeks without a customer-facing incident. The win wasn’t a new tool; it was wiring the tools together with intent.
What to implement this week
Define one SLO per critical user journey. Add dual-window burn alerts.
Add
runbook_url,dashboard, andautomation_rollbackannotations to your top three alerts.Convert the noisiest incident’s wiki page into a validated runbook in
git.Gate your primary service’s rollout with a Prometheus
AnalysisTemplate.Schedule a 60-minute game day and inject a fault you actually fear.
If you want a second set of eyes, GitPlumbers has shipped this at startups and Fortune 500s. We’ll help you pick the right signals, wire the automation, and run the first game day without theater.
Key takeaways
- Measure leading indicators (burn rate, saturation, queue growth) instead of aggregated CPU or request counts.
- Make alerts actionable: include owner, severity, runbook link, and one-click automation.
- Wire telemetry to rollouts: canary analysis with Prometheus gates to auto-halt or rollback.
- Codify runbooks as code with pre-validated commands, known-good configs, and verification steps.
- Run game days that test the runbook and automation, score MTTR components, and iterate weekly.
Implementation checklist
- Define SLOs and implement burn-rate alerts with Prometheus.
- Add Alertmanager annotations: owner, runbook URL, dashboard deep links, automation buttons.
- Turn runbooks into docs-as-code with executable snippets validated in CI.
- Adopt canary rollouts with AnalysisTemplates (Argo Rollouts or Flagger).
- Instrument leading indicators: queue lag, throttling, connection pool saturation, GC stalls.
- Schedule monthly game days; inject real failures with chaos-mesh or toxiproxy.
- Track MTTA, time-to-triage, time-to-mitigate, and post-restore verification times.
Questions we hear from teams
- What leading indicators should I start with if I have limited bandwidth?
- Start with error budget burn for your top SLO, Kafka or queue backlog growth, and CPU throttling ratio. Those three catch most failure modes early: correctness, throughput, and saturation.
- Do I need Argo Rollouts to do this?
- No. Flagger, Spinnaker, or even bespoke scripts can gate deploys on Prometheus/Grafana Cloud metrics. The key is automated analysis and an automatic halt/rollback when metrics regress.
- How do I avoid alert fatigue while adding more signals?
- Use multi-window burn and slope-based alerts with a minimum ‘for’ duration. Every page must include a runbook link and an automation action. Anything else goes to a ticket or is deleted.
- We’re not on Kubernetes—does this still apply?
- Yes. The same ideas work with EC2/ASGs, Nomad, or on-prem: feature flags for fast mitigation, canaries via weighted load balancer config, and Prometheus or Datadog metrics to gate rollouts.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
