The Night Our Observability Went Dark in Peak Traffic—and How We Turned It Into a Real MTTR Playbook
When the telemetry goes quiet, you don’t just lose dashboards—you lose trust. This is how we built runbooks that actually shrink MTTR.
Observability lied to us this week; runbooks and game days gave us back control and real MTTR shrinkage.Back to all posts
Reliability today hinges on signals that predict failures, not dashboards that celebrate uptime. We built a framework where each service has a minimal set of leading indicators tailored to its fault modes, so triage is guided by data, not rumor. This is how you stop firefighting with more dashboards and start firefight
The next move is codifying runbooks as code. Versioned YAML playbooks sit alongside your Kubernetes manifests, with triggers, owners, escalation paths, and auto-verification checks. If a dashboard misleads or a dependency degrades, the runbook snaps into action, regardless of who is on-call.
Telemetry is not a collection of independent panels; it’s a decision engine. We connect OpenTelemetry traces, Prometheus metrics, and logs to a triage algorithm that can trigger canaries, scale-downs, or a controlled rollback when a signal crosses the threshold. The result: faster containment and fewer manual hops.
Game days are not a trolling exercise; they’re a disciplined test of the entire incident lifecycle. We run them with guardrails, documented objectives, and a postgame blueprint that closes the loop from root cause to preventive change. The metric that matters is MTTR, not the number of pages you can survive during a 2:
We learned that the business impact of reliability is not a number in a dashboard; it’s your customers' ability to complete a checkout, a loan application, or a refund request without friction. Our approach aligns engineering, platform, and product teams around a single fate: predictable delivery under pressure.
Related Resources
Key takeaways
- Leading indicators beat vanity metrics every time; measure signal quality, not surface area.
- Runbooks must be code, versioned, and tested; treat incidents like deployments.
- Game days should be rehearsals with guardrails, not scary live drills; automate triage steps.
- Telemetry must drive automation for triage, rollback, and progressive delivery to shrink MTTR.
Implementation checklist
- Inventory critical services, map SLOs and error budgets to leading indicators
- Create versioned runbooks in a Git repo with clear owners and verification checks
- Instrument telemetry with OpenTelemetry, Prometheus, and structured logs to support automated triage
- Design monthly game days with tracked MTTR targets and postmortems
- Implement auto-rollback and canary deployment hooks via Argo Rollouts
- Publish blameless postmortems and feed learnings back into runbooks
Questions we hear from teams
- What exactly is a leading indicator in this context?
- Leading indicators are metrics that correlate with impending incidents and guide action before customers are affected, such as rising queue depth, increasing error budgets burn, or degrading p95 latency.
- How often should teams run game days?
- Start monthly for the first quarter, then adjust to quarterly or semi-annual as you scale, ensuring safety gates, synthetic traffic, and postmortems are baked in.
- Can automation replace human triage entirely?
- No. Automation handles repetitive, high-signal steps, but humans stay in control for exception handling, policy checks, and complex judgment calls; design runbooks with explicit human-in-the-loop sections.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.