The Runbooks and Game Days That Turned 2‑Hour Outages into 12‑Minute Blips
If your “runbook” is a wiki page nobody opens and your game days are improv, you’re paying an MTTR tax. Here’s how to wire telemetry to triage and rollouts so incidents shrink themselves.
> If the first click after a page isn’t a runbook with a rollback button, you’re paying an MTTR luxury tax.Back to all posts
The deploy that looked green… until the queue exploded
You’ve lived this: canary looked fine, graphs were green, then 30 minutes later Kafka lag shot to the moon, retriers melted the DB, PagerDuty lit up, and your “runbook” was a Confluence doc last updated during the monolith era. I’ve seen that movie at retailers, fintechs, and a unicorn I won’t name. The pattern is always the same: teams stare at vanity charts and improv their way through the first 15 minutes.
Let’s fix the boring parts that actually shrink MTTR: pick leading indicators, make runbooks executable, and wire telemetry to your rollout tooling so bad deploys abort themselves.
Key takeaways
- Track leading indicators like queue depth, retry rate, and p99 latency, not vanity dashboards.
- Runbooks should be executable checklists with one-click links and commands, not prose.
- Wire alerts to runbooks and rollouts: telemetry should pause/abort bad deploys automatically.
- Game days must rehearse the first 15 minutes with real tools, not tabletop hypotheticals.
- Measure the loop: detection time, decision time, rollback time, and restoration verification.
Implementation checklist
- Define 3-5 leading indicators per service (latency, saturation, errors, retries, queue depth).
- Add `runbook_url` to every alert and verify it resolves during game days.
- Codify rollback paths (Argo Rollouts/LaunchDarkly kill-switch) and rehearse them.
- Implement multi-window SLO burn-rate alerts to avoid alert floods and slow burns.
- Automate “first actions” in runbooks: links to dashboards, `kubectl` commands, rollback buttons.
- Schedule monthly game days with rotating on-calls and realistic failure injection.
- Track MTTR as: first page -> mitigation -> full restore; improve each slice.
Questions we hear from teams
- How many leading indicators per service is ideal?
- Three to five. One for latency tail, one for saturation (queue depth or pool usage), one for errors/retries, optionally one cost or dependency signal. More than that and responders will ignore them.
- Do we need Argo Rollouts or can we do this with LaunchDarkly/Spinnaker?
- Use what you have. LaunchDarkly can be your kill-switch; Spinnaker, Argo Rollouts, or Flagger can gate canaries with Prometheus. The key is that telemetry drives automated pause/abort and the runbook documents the commands.
- What if most of our code was AI-generated and observability is inconsistent?
- Start with a vibe code cleanup pass: standardize OpenTelemetry middleware, adopt a common metrics library per language, and ship a service bootstrap that emits the same RED + USE metrics and trace attributes. You can retrofit this in a sprint per service with a small platform team.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
