The SLOs That Actually Changed On-Call (and Cut Incident Volume by 30%)
Stop tracking “99.9% uptime” and start measuring the signals that predict the 2am page—then wire them into triage and rollout gates.
If an alert doesn’t tell on-call what to do next, it’s not observability—it’s noise with a graph.Back to all posts
Key takeaways
- If your SLO doesn’t change what the on-call engineer does in the first 10 minutes, it’s a vanity metric.
- Leading indicators beat lagging indicators: measure saturation, queue growth, retry/circuit-breaker behavior, and error-budget burn—before customers scream.
- Alerts should point to a decision, not a dashboard safari: bake in ownership, suspected failure mode, and the next command to run.
- Wire SLO burn into deploy automation (canary analysis + auto-rollback) to prevent incidents instead of post-mortems.
- Keep the first SLO set small (2–4 per service) and align it with real user journeys and dependency boundaries.
Implementation checklist
- Pick 1–2 user journeys per service and define SLIs that map to “did the user succeed?”
- Add 2–3 leading indicators per journey: saturation, queue depth, dependency error rate, retry rate, circuit-breaker opens.
- Define SLOs with explicit windows and burn-rate thresholds (fast + slow) tied to paging severity.
- Add Alertmanager annotations: runbook, owner, dashboard, trace search, and a single recommended next step.
- Create recording rules for burn-rate and leading indicators; avoid raw-metric alerts.
- Gate production rollouts with canary analysis on the same SLIs/SLOs; auto-rollback on sustained burn.
- Review weekly: top pages, top near-misses, and which SLOs produced actions vs noise.
Questions we hear from teams
- How many SLOs should we start with per service?
- For a Tier-1 service: 2 outcome SLIs (key user journeys) and 2–3 leading indicators tied to known failure modes. More than that and you’ll spend your time arguing about math instead of reducing pages.
- Should we page on leading indicators like CPU or queue depth?
- Usually not directly. Use leading indicators to speed triage and to gate rollouts. Page on fast error-budget burn (customer impact) and let leading indicators tell you where to look and what action to take.
- What if our instrumentation is messy or AI-generated “vibe code” made metrics inconsistent?
- Normalize first: consistent metric names, stable labels (`service`, `route`, `status`, `deployment.version`), and basic golden signals per critical path. We often do a short “vibe code cleanup” pass to remove high-cardinality labels, fix broken counters, and make metrics usable for SLO math.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
