The SLOs That Actually Changed On-Call (and Cut Incident Volume by 30%)

Stop tracking “99.9% uptime” and start measuring the signals that predict the 2am page—then wire them into triage and rollout gates.

If an alert doesn’t tell on-call what to do next, it’s not observability—it’s noise with a graph.
Back to all posts

Related Resources

Key takeaways

  • If your SLO doesn’t change what the on-call engineer does in the first 10 minutes, it’s a vanity metric.
  • Leading indicators beat lagging indicators: measure saturation, queue growth, retry/circuit-breaker behavior, and error-budget burn—before customers scream.
  • Alerts should point to a decision, not a dashboard safari: bake in ownership, suspected failure mode, and the next command to run.
  • Wire SLO burn into deploy automation (canary analysis + auto-rollback) to prevent incidents instead of post-mortems.
  • Keep the first SLO set small (2–4 per service) and align it with real user journeys and dependency boundaries.

Implementation checklist

  • Pick 1–2 user journeys per service and define SLIs that map to “did the user succeed?”
  • Add 2–3 leading indicators per journey: saturation, queue depth, dependency error rate, retry rate, circuit-breaker opens.
  • Define SLOs with explicit windows and burn-rate thresholds (fast + slow) tied to paging severity.
  • Add Alertmanager annotations: runbook, owner, dashboard, trace search, and a single recommended next step.
  • Create recording rules for burn-rate and leading indicators; avoid raw-metric alerts.
  • Gate production rollouts with canary analysis on the same SLIs/SLOs; auto-rollback on sustained burn.
  • Review weekly: top pages, top near-misses, and which SLOs produced actions vs noise.

Questions we hear from teams

How many SLOs should we start with per service?
For a Tier-1 service: 2 outcome SLIs (key user journeys) and 2–3 leading indicators tied to known failure modes. More than that and you’ll spend your time arguing about math instead of reducing pages.
Should we page on leading indicators like CPU or queue depth?
Usually not directly. Use leading indicators to speed triage and to gate rollouts. Page on fast error-budget burn (customer impact) and let leading indicators tell you where to look and what action to take.
What if our instrumentation is messy or AI-generated “vibe code” made metrics inconsistent?
Normalize first: consistent metric names, stable labels (`service`, `route`, `status`, `deployment.version`), and basic golden signals per critical path. We often do a short “vibe code cleanup” pass to remove high-cardinality labels, fix broken counters, and make metrics usable for SLO math.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to GitPlumbers about SLOs that reduce pages See our Reliability & Observability playbooks

Related resources