The Runbooks and Game Days That Turned 2‑Hour Outages into 12‑Minute Blips

If your “runbook” is a wiki page nobody opens and your game days are improv, you’re paying an MTTR tax. Here’s how to wire telemetry to triage and rollouts so incidents shrink themselves.

> If the first click after a page isn’t a runbook with a rollback button, you’re paying an MTTR luxury tax.
Back to all posts

The deploy that looked green… until the queue exploded

You’ve lived this: canary looked fine, graphs were green, then 30 minutes later Kafka lag shot to the moon, retriers melted the DB, PagerDuty lit up, and your “runbook” was a Confluence doc last updated during the monolith era. I’ve seen that movie at retailers, fintechs, and a unicorn I won’t name. The pattern is always the same: teams stare at vanity charts and improv their way through the first 15 minutes.

Let’s fix the boring parts that actually shrink MTTR: pick leading indicators, make runbooks executable, and wire telemetry to your rollout tooling so bad deploys abort themselves.

Related Resources

Key takeaways

  • Track leading indicators like queue depth, retry rate, and p99 latency, not vanity dashboards.
  • Runbooks should be executable checklists with one-click links and commands, not prose.
  • Wire alerts to runbooks and rollouts: telemetry should pause/abort bad deploys automatically.
  • Game days must rehearse the first 15 minutes with real tools, not tabletop hypotheticals.
  • Measure the loop: detection time, decision time, rollback time, and restoration verification.

Implementation checklist

  • Define 3-5 leading indicators per service (latency, saturation, errors, retries, queue depth).
  • Add `runbook_url` to every alert and verify it resolves during game days.
  • Codify rollback paths (Argo Rollouts/LaunchDarkly kill-switch) and rehearse them.
  • Implement multi-window SLO burn-rate alerts to avoid alert floods and slow burns.
  • Automate “first actions” in runbooks: links to dashboards, `kubectl` commands, rollback buttons.
  • Schedule monthly game days with rotating on-calls and realistic failure injection.
  • Track MTTR as: first page -> mitigation -> full restore; improve each slice.

Questions we hear from teams

How many leading indicators per service is ideal?
Three to five. One for latency tail, one for saturation (queue depth or pool usage), one for errors/retries, optionally one cost or dependency signal. More than that and responders will ignore them.
Do we need Argo Rollouts or can we do this with LaunchDarkly/Spinnaker?
Use what you have. LaunchDarkly can be your kill-switch; Spinnaker, Argo Rollouts, or Flagger can gate canaries with Prometheus. The key is that telemetry drives automated pause/abort and the runbook documents the commands.
What if most of our code was AI-generated and observability is inconsistent?
Start with a vibe code cleanup pass: standardize OpenTelemetry middleware, adopt a common metrics library per language, and ship a service bootstrap that emits the same RED + USE metrics and trace attributes. You can retrofit this in a sprint per service with a small platform team.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Schedule a reliability assessment Download the runbook template

Related resources