SLOs That Actually Change On‑Call Behavior (and Cut Incident Volume)

Stop paging on vanity metrics. Start paging on error‑budget burn, saturation, and deploy health—and wire the same signals into triage and rollout automation.

If an SLI doesn’t change what on-call does at 2am, it’s not an SLI. It’s a dashboard artifact.
Back to all posts

The “We Have SLOs” Lie You Tell Yourself on a Calm Tuesday

I’ve walked into a lot of orgs that proudly point to an SLO doc… while on-call is getting paged for CPU > 80% and PodRestartCount > 0. That’s not an SLO program. That’s a monitoring museum.

Here’s the tell: when a page fires, the responder’s first move is to open Grafana, then click around for 10 minutes trying to guess what’s broken. If that’s you, your SLIs are not driving behavior—they’re just trivia with alerts attached.

The goal isn’t “more observability.” The goal is fewer incidents and faster recovery. That happens when:

  • SLIs are user-impacting and predictive
  • Alerts fire on error-budget burn, not raw noise
  • Telemetry is wired to triage (what to do next)
  • The same signals gate rollouts (so failures don’t ship)

Let’s talk about SLIs/SLOs that actually change what humans do—and reduce incident volume.

Pick SLIs That Predict Pain (Not Vanity Metrics)

If you can’t answer “what user action is failing?” you’re probably tracking the wrong thing. I’ve seen teams obsess over node-level CPU while checkout is failing due to a downstream timeout. CPU was a spectator sport.

Start with user journeys and service contracts:

  • Availability / Success rate: good / total for the request that matters (e.g., POST /checkout, not all HTTP)
  • Latency: p95 or p99 for the same journey (and don’t average it into uselessness)
  • Correctness: mismatched totals, failed invariants, duplicate events, stale reads
  • Freshness (for pipelines): “time from ingest to queryable” (staleness is a leading indicator of missed SLAs)

A good SLI has three properties:

  • Actionable: you can mitigate it (rollback, shed load, fail over)
  • Attributable: you can slice by region, tenant, release, dependency
  • Hard to game: it reflects real user experience, not internal optimism

If your SLI can be improved by changing a dashboard query, it’s not an SLI. It’s a mood.

Leading Indicators: The Stuff That Breaks Before the Outage

Most incidents have a “pre-incident phase” that looks boring until you’ve been burned enough times to recognize it. These are the leading indicators that predict pages tomorrow:

  • Error-budget burn rate: the single best “are we dying?” metric
  • Saturation: queue depth, consumer lag, connection pool exhaustion, thread pool starvation
  • Retry/timeout rate: the early warning for cascading failures and thundering herds
  • Dependency health: external API latency/5xx, DB lock wait time, cache hit rate collapse
  • Deploy health: canary error deltas, rollout duration, crashloop after a new SHA

Concrete examples I like because they’re predictive:

  • Kafka: consumer_lag and rebalances/sec (lag + churn predicts incident-grade backlog)
  • Postgres: deadlocks, lock_wait_seconds, connections_used / max_connections
  • JVM: gc_pause_seconds spikes plus rising p99 (the slow-burn latency incident)
  • Node/Go services: retry storms (timeouts + retries) before total failure

These aren’t vanity metrics because they point to mitigation:

  • Queue depth rising? Scale consumers or shed load.
  • Retries spiking? Trip a circuit breaker, reduce retry budgets, fail fast.
  • Deploy health degrading? Auto-rollback.

SLOs That Drive Pages: Multi-Window Burn, Not Static Thresholds

I’ve seen this fail: teams set an SLO (say 99.9%), then page on “error rate > 1% for 5m.” That’s not SLO-based alerting. That’s a random threshold wearing an SRE hat.

What actually works is paging on burn rate: “How fast are we spending the error budget?” You want two kinds of detection:

  • Fast burn (you’re on fire): catch major outages quickly
  • Slow burn (you’re bleeding): catch regressions before they become week-long incident factories

A classic pattern is multi-window, multi-burn (popularized by Google SRE):

  • Page if burn rate is very high over a short window and elevated over a longer window
  • Ticket if burn rate is moderate over longer windows

Here’s a concrete example using Sloth (generates Prometheus rules from SLOs). This is the kind of config GitPlumbers drops into repos so it stays versioned with the service.

# slo.yaml
apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
  name: checkout-slo
spec:
  service: "checkout"
  labels:
    owner: "payments"
  slos:
    - name: "checkout-availability"
      objective: 99.9
      description: "Successful checkouts over 30d"
      sli:
        events:
          errorQuery: |
            sum(rate(http_requests_total{job="checkout",route="/checkout",method="POST",status=~"5..|429"}[{{.window}}]))
          totalQuery: |
            sum(rate(http_requests_total{job="checkout",route="/checkout",method="POST"}[{{.window}}]))
      alerting:
        name: "CheckoutAvailability"
        labels:
          severity: page
        pageAlert:
          labels:
            team: payments
        ticketAlert:
          labels:
            severity: ticket

Sloth will generate burn-rate alerts (fast + slow) that map to your objective. The behavioral change is immediate: on-call stops arguing about “is 1% error bad?” and starts answering “are we burning budget fast enough to wake someone up?”

Make Alerts Self-Triaging: Every Page Needs a Next Click

On-call burnout isn’t just pages—it’s pages that don’t contain enough context to act.

If the alert doesn’t include:

  • what user journey is impacted
  • what changed recently (deploy SHA, feature flag)
  • where to look next (trace exemplar, logs, dashboard)

…then responders will do the same ritual: open Grafana, open logs, guess, ping someone, and only then mitigate. That’s where MTTR goes to die.

Two practical moves that work across stacks:

  1. Propagate deploy metadata into metrics/logs/traces: git_sha, release_id, env, region.
  2. Use exemplars + tracing so the page can jump directly to a failing trace.

Example: add OpenTelemetry tracing and include a release_id attribute (language-agnostic idea; this is TypeScript-ish):

import { trace } from "@opentelemetry/api";

const tracer = trace.getTracer("checkout");

export async function checkout(req, res) {
  return tracer.startActiveSpan("POST /checkout", async (span) => {
    span.setAttribute("service.name", "checkout");
    span.setAttribute("deployment.environment", process.env.ENV || "prod");
    span.setAttribute("service.version", process.env.GIT_SHA || "unknown");

    try {
      // ... handle checkout ...
      res.status(200).send({ ok: true });
    } catch (e) {
      span.recordException(e);
      span.setStatus({ code: 2 });
      res.status(500).send({ ok: false });
    } finally {
      span.end();
    }
  });
}

Then in Alertmanager, include links that “just work”:

# alertmanager.yaml (snippet)
receivers:
  - name: payments-oncall
    slack_configs:
      - channel: "#oncall-payments"
        title: "{{ .CommonLabels.alertname }} ({{ .CommonLabels.service }})"
        text: |
          **SLO Burn**: {{ .CommonAnnotations.summary }}
          **Runbook**: https://runbooks.example.com/{{ .CommonLabels.service }}/{{ .CommonLabels.alertname }}
          **Dashboard**: https://grafana.example.com/d/slo-{{ .CommonLabels.service }}
          **Traces**: https://tempo.example.com/search?tag=service.name%3D{{ .CommonLabels.service }}
          **Recent deploys**: https://github.com/acme/{{ .CommonLabels.service }}/deployments

This is where incident volume drops: not because alerts are “smarter,” but because responders can confirm impact and mitigate in minutes, instead of escalating everything.

Tie SLO Telemetry to Rollout Automation (So You Don’t Ship Incidents)

The best incident is the one that never makes it past canary.

I’ve seen teams do “canaries” that were basically vibes: 5% traffic, stare at a dashboard, then promote. That’s how you end up shipping a slow burn that becomes tomorrow’s paging storm.

Instead, use the same SLO/burn queries as automated gates. If SLO burn spikes during a rollout, rollback automatically.

Here’s an Argo Rollouts example that blocks promotion based on an SLI query in Prometheus:

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: checkout-slo-gate
spec:
  metrics:
    - name: checkout-5xx-rate
      interval: 30s
      failureLimit: 1
      provider:
        prometheus:
          address: http://prometheus.monitoring.svc:9090
          query: |
            sum(rate(http_requests_total{job="checkout",route="/checkout",method="POST",status=~"5.."}[2m]))
            /
            sum(rate(http_requests_total{job="checkout",route="/checkout",method="POST"}[2m]))
      successCondition: result[0] < 0.002  # <0.2% 5xx during canary

And reference it from the rollout:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: checkout
spec:
  strategy:
    canary:
      steps:
        - setWeight: 10
        - analysis:
            templates:
              - templateName: checkout-slo-gate
        - setWeight: 50
        - analysis:
            templates:
              - templateName: checkout-slo-gate

Now your on-call behavior changes in a very real way:

  • Fewer “new deploy broke prod” incidents
  • Clear ownership: the rollout system rolls back; humans investigate during business hours
  • Postmortems become about why the gate didn’t catch it (rare) rather than why humans missed it (common)

What This Looks Like in Practice (and the Metrics That Matter)

When we do this with clients at GitPlumbers, the first wins usually show up in 2–4 weeks:

  • Page volume drops 30–60% by deleting threshold alerts and paging only on burn + leading indicators
  • MTTR improves 20–40% because every page includes the “next click” (traces/logs/deploy)
  • Canary gates prevent the top offender: regressions that ship quietly and explode later

The cultural shift is the real payoff:

  • On-call stops being a punishment and starts being a control loop.
  • Product conversations become concrete: “we’re out of budget, freeze feature work” is stronger than “latency feels worse.”
  • Reliability becomes schedulable work because error budgets create decision pressure.

If you want a simple operating rhythm that sticks:

  1. Weekly 30-minute SLO review: top burn sources, top noisy alerts, top prevented rollbacks.
  2. Monthly SLO reset: adjust objectives based on business reality and system maturity.
  3. Every incident: add one leading indicator that would have predicted it earlier.

That’s the loop that reduces incident volume over quarters, not just weeks.

Related Resources

Key takeaways

  • If an SLI doesn’t change what on-call does at 2am, it’s a vanity metric.
  • Lead with **error-budget burn** and **saturation signals** (queues, retries, timeouts), not “CPU is high.”
  • Alert on **multi-window burn rates** to catch fast outages and slow burns without paging constantly.
  • Attach **triage context** (trace exemplars, log links, deploy SHA) to every page so responders skip the scavenger hunt.
  • Use the same SLO/burn queries as **canary gates** to prevent incidents from shipping in the first place.

Implementation checklist

  • Pick 1–3 user-journey SLIs per service (not 20 infra graphs).
  • Define SLOs with explicit windows and error budgets (e.g., 99.9% over 30d).
  • Implement **burn-rate alerting** (fast + slow) with actionable thresholds.
  • Add deploy metadata to telemetry (`git_sha`, `release_id`) and alert payloads.
  • Wire alerts to traces/logs/dashboards and a short runbook (first 5 minutes).
  • Gate canary/rollouts on the same SLO queries; enable auto-rollback.
  • Review pages weekly: delete noisy alerts, tighten missing ones, and adjust SLOs based on reality.

Questions we hear from teams

Should we page directly on SLO burn rate or on raw error rate/latency?
Page on **burn rate** for user-impacting SLOs. Keep raw error/latency alerts as **tickets** or as additional signals only when they’re clearly predictive (e.g., queue depth + retries). Burn rate aligns paging with actual budget risk and reduces noisy debates about thresholds.
How many SLOs should a service have?
Usually **1–3**. One for the primary user journey (availability), one for latency if it matters, and one domain-specific SLI (freshness/correctness) if that’s what causes incidents. More than that and you’ll drown in upkeep.
What’s the quickest way to reduce incident volume without boiling the ocean?
Replace threshold paging with **multi-window burn-rate paging** for one critical journey, and attach triage links (traces/logs/deploy SHA) to the alert. That alone typically cuts noise and shortens MTTR within weeks.
Can we do this without Prometheus?
Yes. The pattern is vendor-agnostic: define SLIs, compute burn, page on multi-window burn, and reuse the same query for canary gates. Tools differ (`Datadog SLOs`, `New Relic`, `Nobl9`, `Grafana SLO`), but the behavioral mechanics are the same.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to GitPlumbers about SLOs that reduce pages See Reliability & Observability services

Related resources