Your Pager Is Loud Because Your Runbooks Are Quiet: Game Days That Actually Shrink MTTR
Runbooks don’t fail because they’re missing. They fail because they’re not wired into telemetry, triage, and rollouts. Here’s the pattern we use to make MTTR drop for real—without the vanity metrics theater.
> “Runbooks don’t reduce MTTR. Runbooks that are wired into alerts, validated by game days, and backed by safe automation reduce MTTR.”Back to all posts
Key takeaways
- If your alerts don’t include a **runbook URL**, **owner**, and **next command to run**, you don’t have runbooks—you have docs.
- Optimize for **leading indicators** (burn rate, saturation, retries, queue depth, deploy regression signals), not uptime dashboards and request counts.
- Make runbooks **executable**: one-click or one-command actions with guardrails and audit trails.
- Game days should test the whole chain: telemetry → alert → triage context → mitigation → rollout/rollback automation.
- Tie rollouts to observability using automated analysis (e.g., `Argo Rollouts` + `Prometheus` queries) so you stop shipping incidents.
Implementation checklist
- Replace “CPU > 80%” alerts with **saturation and burn-rate** alerts tied to SLOs.
- Add `runbook_url`, `service`, `owner`, and `severity` labels to every paging alert.
- Make every runbook start with: **What changed?** (deploy/flag/config) and **What’s the customer impact?**
- Create a “first 5 minutes” triage block: `kubectl`, logs, traces, last deploy, feature flags.
- Add a rollback/mitigation automation step (Argo rollback, flag kill-switch, rate-limit, circuit breaker).
- Run at least one game day per service per quarter that proves: alert fires, runbook is correct, automation works, and MTTR improves.
- Track leading indicators for runbook quality: % alerts with runbook links, % incidents with first action < 5 minutes, rollback time, and false-page rate.
Questions we hear from teams
- What’s the fastest way to identify leading indicators for my service?
- Start from your last 5–10 real incidents. For each one, ask: what signal showed up first (queue depth, retries, p95, error budget burn, dependency errors)? Alert on that signal with a short `for:` window, then validate it in a game day.
- How many runbooks should we write?
- Fewer than you think. Write runbooks for the failure modes that page you and cost you money: deploy regressions, dependency timeouts, saturation, and data-store issues. A small set of correct, executable runbooks beats a wiki full of outdated docs.
- Should we automate mitigations like rollback? Isn’t that risky?
- It’s riskier to rely on manual heroics at 2 a.m. Automate the safest mitigations first (abort canary, rollback one revision, disable a flag) with guardrails and audit logs. Then validate repeatedly in game days.
- What tools do you recommend for game days?
- On Kubernetes: `Chaos Mesh` or `LitmusChaos` for fault injection; `Prometheus`/`Grafana`/`Loki`/`Tempo` for signals; `Argo Rollouts` for automated rollout gates. The tools matter less than proving the end-to-end loop works.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
