What exactly is a leading indicator in this context?

Leading indicators are metrics that correlate with impending incidents and guide action before customers are affected, such as rising queue depth, increasing error budgets burn, or degrading p95 latency.

How often should teams run game days?

Start monthly for the first quarter, then adjust to quarterly or semi-annual as you scale, ensuring safety gates, synthetic traffic, and postmortems are baked in.

Can automation replace human triage entirely?

No. Automation handles repetitive, high-signal steps, but humans stay in control for exception handling, policy checks, and complex judgment calls; design runbooks with explicit human-in-the-loop sections.

Reliability-observability · Sep 29, 2025 · 8 minute read

The Night Our Observability Went Dark in Peak Traffic—and How We Turned It Into a Real MTTR Playbook

When the telemetry goes quiet, you don’t just lose dashboards—you lose trust. This is how we built runbooks that actually shrink MTTR.

Alex Kim

VP of Engineering, GitPlumbers

Veteran SRE and platform engineer with 20+ years guiding reliability transformations at scale. Built runbooks, SRE playbooks, and GitOps-driven incident response for multi-region fintechs and cloud-hy

Observability lied to us this week; runbooks and game days gave us back control and real MTTR shrinkage.

Back to all posts

Reliability today hinges on signals that predict failures, not dashboards that celebrate uptime. We built a framework where each service has a minimal set of leading indicators tailored to its fault modes, so triage is guided by data, not rumor. This is how you stop firefighting with more dashboards and start firefight

The next move is codifying runbooks as code. Versioned YAML playbooks sit alongside your Kubernetes manifests, with triggers, owners, escalation paths, and auto-verification checks. If a dashboard misleads or a dependency degrades, the runbook snaps into action, regardless of who is on-call.

Telemetry is not a collection of independent panels; it’s a decision engine. We connect OpenTelemetry traces, Prometheus metrics, and logs to a triage algorithm that can trigger canaries, scale-downs, or a controlled rollback when a signal crosses the threshold. The result: faster containment and fewer manual hops.

Game days are not a trolling exercise; they’re a disciplined test of the entire incident lifecycle. We run them with guardrails, documented objectives, and a postgame blueprint that closes the loop from root cause to preventive change. The metric that matters is MTTR, not the number of pages you can survive during a 2:

We learned that the business impact of reliability is not a number in a dashboard; it’s your customers' ability to complete a checkout, a loan application, or a refund request without friction. Our approach aligns engineering, platform, and product teams around a single fate: predictable delivery under pressure.

Related Resources

Key takeaways

Leading indicators beat vanity metrics every time; measure signal quality, not surface area.
Runbooks must be code, versioned, and tested; treat incidents like deployments.
Game days should be rehearsals with guardrails, not scary live drills; automate triage steps.
Telemetry must drive automation for triage, rollback, and progressive delivery to shrink MTTR.

Implementation checklist

Inventory critical services, map SLOs and error budgets to leading indicators
Create versioned runbooks in a Git repo with clear owners and verification checks
Instrument telemetry with OpenTelemetry, Prometheus, and structured logs to support automated triage
Design monthly game days with tracked MTTR targets and postmortems
Implement auto-rollback and canary deployment hooks via Argo Rollouts
Publish blameless postmortems and feed learnings back into runbooks

Questions we hear from teams

What exactly is a leading indicator in this context?: Leading indicators are metrics that correlate with impending incidents and guide action before customers are affected, such as rising queue depth, increasing error budgets burn, or degrading p95 latency.
How often should teams run game days?: Start monthly for the first quarter, then adjust to quarterly or semi-annual as you scale, ensuring safety gates, synthetic traffic, and postmortems are baked in.
Can automation replace human triage entirely?: No. Automation handles repetitive, high-signal steps, but humans stay in control for exception handling, policy checks, and complex judgment calls; design runbooks with explicit human-in-the-loop sections.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Book a modernization assessment Schedule a consultation

Related Resources

Key takeaways

Implementation checklist

Questions we hear from teams

Ready to modernize your codebase?

Related resources