SLO vs SLA vs SLI — which do I alert on?

Alert on SLO burn (via SLIs). SLAs are contracts with customers—don’t page your team on legal terms. SLIs are the raw signals; SLOs set expectations; error budgets determine when to page.

We don’t use Prometheus. Can we still do this?

Yes. The pattern works with Datadog, New Relic, or Cloud Monitoring. Use their query languages to implement burn-rate alerts. We used Prometheus/Sloth here because it was already in place and easy to automate via GitOps.

How many SLOs per service?

Start with 2–3 per user journey (availability + latency). Too many SLOs become noise; too few miss real failures. Expand only when a journey proves important to the business.

Do SLOs slow down delivery?

They speed it up. By gating risky changes and reducing alert noise, teams shipped 38% more frequently in this case. Error budgets clarify when to push and when to pause.

What about AI features with probabilistic outputs?

Treat correctness as an SLI. Track rejection rates, groundedness checks, or human override rates. The same burn-rate principles apply—define what ‘good enough’ is and alert when you’re burning too fast.

Case-studies · Oct 16, 2025 · 9 minute read

The SLO Rollout That Stopped the Pager Storm: Cutting MTTR 77% in 90 Days

Turning noisy alerts into decisive action with Prometheus, Sloth, and error budgets.

Alex Mercer

Principal Engineer, GitPlumbers

20 years fixing broken systems from bare metal to Kubernetes. Former SRE lead at a unicorn you’ve definitely second-guessed in production. Allergic to silver bullets; fond of boring, automated reliability.

We didn’t fix incidents by adding more dashboards. We fixed them by agreeing on what ‘good’ is and letting error budgets drive the pager.

Back to all posts

The Pager Was Always On Fire

I walked into a mid-market B2B SaaS (think ~120 engineers, AWS EKS, Istio, ArgoCD) where on-call looked like a slot machine at 2 a.m. The incident channel read like a weather alert. CPU spikes. GC pauses. Disk IO. All “critical.” None tied to what users actually felt.

The numbers were ugly:

62 pages/month across platform teams
MTTR ~140 minutes
45% false-positive alerts
SLA credits paid three quarters in a row

They’d done what many of us did in the pre-SRE era: instrument everything, alert on everything, and call it “observability.” The team had Grafana dashboards for days, but zero shared truth about what “good” looked like. Leadership was asking for fewer incidents; engineers were begging for fewer alerts. I’ve seen this movie. The fix wasn’t more dashboards. It was SLOs.

Why SLOs, Not More Dashboards

Dashboards help you look; SLOs help you decide. SLOs turn reliability into a budget you can spend intentionally. If you haven’t pragmatic-SRE’d before, here’s the quick refresher:

SLI: The thing we measure (e.g., http 5xx rate, p95 latency).
SLO: The target we promise ourselves (e.g., 99.5% monthly availability).
Error budget: 100% − SLO (your allowable failure).

Industry context: Google’s SRE book made this mainstream. Teams at Slack and Shopify talk publicly about using SLOs to control change velocity. The delta between theory and reality is wiring it into your alerting and operations without hiring a dozen SREs. Our constraints here:

Mixed estate: a Ruby monolith + 14 Go microservices
Regulated customers (SOC 2), but no 24/7 NOC
One overworked platform team; no appetite for big-bang migrations
Already on Prometheus, Alertmanager, Grafana, ArgoCD, Istio – use the stack they had

Define Signals That Actually Matter

We started with two revenue-critical journeys:

Checkout API (/v1/checkout): availability and latency
Workspace load (/workspaces/:id): latency perceived by logged-in users

We set conservative monthly SLOs to earn trust:

Availability: 99.5% (≈216 minutes error budget/month)
Latency: p95 < 300ms

We used Sloth (a simple SLO generator for Prometheus) to create recording rules and burn-rate alerts. Here’s a simplified Sloth SLO for the Checkout API availability using istio_request_total:

# slo-checkout-availability.yaml
apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
  name: checkout-availability
  namespace: sre
spec:
  service: checkout-api
  labels:
    team: payments
  slos:
    - name: availability
      objective: 99.5
      description: Availability of /v1/checkout from edge
      sli:
        events:
          errorQuery: |
            sum(rate(istio_requests_total{reporter="destination",destination_workload="checkout",response_code=~"5.."}[5m]))
          totalQuery: |
            sum(rate(istio_requests_total{reporter="destination",destination_workload="checkout"}[5m]))
      alerting:
        name: checkout-availability
        labels:
          severity: page
        annotations:
          summary: "Checkout availability budget burn"
        # Multi-window, multi-burn-rate
        burnrates:
          - alert: PageQuick
            for: 2m
            factor: 14.4   # ~1h to exhaustion
            window: 5m
          - alert: PageSlow
            for: 15m
            factor: 6      # ~6h to exhaustion
            window: 30m
          - alert: Ticket
            for: 2h
            factor: 2      # heads-up, file a ticket
            window: 6h

For latency, we used histogram quantiles:

# slo-checkout-latency.yaml
apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
  name: checkout-latency
  namespace: sre
spec:
  service: checkout-api
  labels:
    team: payments
  slos:
    - name: p95-latency
      objective: 99.0 # percent of requests below threshold
      description: 95th percentile latency under 300ms
      sli:
        raw:
          errorRatioQuery: |
            1 - (
              sum(rate(http_request_duration_seconds_bucket{le="0.3",job="checkout"}[5m]))
              /
              sum(rate(http_request_duration_seconds_count{job="checkout"}[5m]))
            )
      alerting:
        name: checkout-latency
        labels:
          severity: page
        burnrates:
          - alert: PageQuick
            for: 5m
            factor: 14.4
            window: 5m
          - alert: PageSlow
            for: 30m
            factor: 6
            window: 30m

Apply with GitOps, not click-ops:

kubectl apply -f slo-checkout-availability.yaml
kubectl apply -f slo-checkout-latency.yaml

Wire Alerts to Error Budgets (Not Hosts)

We killed 27 host-level alerts the first week. If the error budget is healthy, I don’t care that node 7 is at 82% CPU. When the budget is burning fast, I care a lot.

The Sloth CRDs generate the Prometheus recording rules and alerting. For teams not using Sloth, here’s the gist of a manual burn-rate alert using PromQL:

# prometheus-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: checkout-slo-alerts
  namespace: sre
spec:
  groups:
    - name: checkout-slo
      rules:
        - record: job:checkout_error_ratio:5m
          expr: |
            sum(rate(istio_requests_total{destination_workload="checkout",response_code=~"5.."}[5m]))
            /
            sum(rate(istio_requests_total{destination_workload="checkout"}[5m]))
        - alert: CheckoutErrorBudgetBurn
          expr: |
            (job:checkout_error_ratio:5m > (1-0.995) * 14.4) or
            (avg_over_time(job:checkout_error_ratio:5m[30m]) > (1-0.995) * 6)
          for: 10m
          labels:
            severity: page
          annotations:
            summary: "Checkout error budget burning fast"
            runbook_url: "https://runbooks.internal/checkout-slo"

Then route in Alertmanager by severity and team:

# alertmanager.yaml (fragment)
route:
  receiver: default
  routes:
    - matchers:
        - severity="page"
        - team="payments"
      receiver: payments-pager
      group_by: [alertname, service]
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 2h
receivers:
  - name: payments-pager
    pagerduty_configs:
      - routing_key: ${PD_PAYMENTS_KEY}

Make It Operational (Runbooks, CI Policy, and Canaries)

Tech alone doesn’t change behavior. We made SLOs the contract every service had to ship with.

Git template: service-template includes an sre/slo/*.yaml folder with Sloth specs.
CI policy: PRs that change deploy/ must include SLOs or bump an existing one. We enforced it with a simple bash check.

# .ci/check-slo.sh
changed=$(git diff --name-only origin/main...HEAD)
if echo "$changed" | grep -q "deploy/"; then
  if ! echo "$changed" | grep -q "sre/slo/"; then
    echo "SLO missing: changes to deploy/ require an SLO spec" >&2
    exit 1
  fi
fi

Runbooks: Every page routes to a wiki with the SLO, SLI queries, and a rollback command.
Dashboards: Grafana shows error budget remaining front and center. If you can’t see the budget, you can’t spend it.
Change policy tied to budget:
- 50% budget remaining: free to deploy
- 20–50%: canary-only
- <20%: incident commander approval

For canaries, we used Argo Rollouts with a simple analysis template that checks the error ratio during a rollout:

# analysis-template.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: checkout-slo-analysis
spec:
  metrics:
    - name: error-ratio
      interval: 1m
      count: 10
      successCondition: result < 0.005
      failureLimit: 1
      provider:
        prometheus:
          address: http://prometheus.monitoring.svc:9090
          query: |
            sum(rate(istio_requests_total{destination_workload="checkout",response_code=~"5.."}[5m]))
            /
            sum(rate(istio_requests_total{destination_workload="checkout"}[5m]) )

Hook the template into the rollout:

# rollout.yaml (fragment)
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: checkout
spec:
  strategy:
    canary:
      steps:
        - setWeight: 20
        - pause: {duration: 60}
        - analysis:
            templates:
              - templateName: checkout-slo-analysis
        - setWeight: 50
        - pause: {duration: 120}
        - analysis:
            templates:
              - templateName: checkout-slo-analysis
        - setWeight: 100

Results After 90 Days

By week two, the team felt the difference. By day 90, leadership had numbers they could take to the board:

Pages/month: 62 → 14 (−77%)
MTTR: 140 mins → 32 mins (−77%)
False positives: 45% → 8%
Change failure rate: 26% → 11%
Deploy frequency: +38% (canaries + confidence)
SLA credits: zero for the first quarter in a year

Qualitatively, on-call stopped being a hazing ritual. People slept. The CFO stopped asking for “reliability dashboards that look green.” Product started negotiating tradeoffs with real numbers: “We have 40% of the budget left—do we ship the risky refactor this week or next?”

We did hit a snag around latency SLOs on the monolith. GC pauses made the p95 swingy. We split SLOs by endpoint and introduced a p99.9 debug panel for capacity planning, not paging. Don’t page on p99.9 unless you like being angry.

What We Learned (And What You Can Steal)

I’ve seen SLO programs die as slideware. Here’s what actually worked here:

Start with two journeys. Prove value, then expand.
Use multi-window burn-rate alerts. Google’s recipe exists for a reason.
Remove a page for every SLO page you add. Net page count must go down.
Make SLOs part of the PR template and incident review. Culture follows tooling.
Tie change policy to budget. Don’t rely on vibes to decide if you can deploy.
Keep SLOs boring. 99.5% beats a flashy 99.99% you can’t keep.

If you’re starting tomorrow:

Implement Sloth or Pyrra with Prometheus; don’t roll your own unless you love yak-shaving.
Pick SLIs that map to real user pain: 5xx rate, p95 < threshold, availability from edge.
Define an error budget policy that product agrees with.
Gate canaries on SLO queries in Argo Rollouts or Flagger.
Measure results in the metrics leadership understands: MTTR, pages/month, change failure rate.

GitPlumbers came in to glue this together using the stack they already had. No rip-and-replace, just aligning signals to outcomes. That’s the job.

Related Resources

Key takeaways

Tie alerts to error budgets, not host metrics. Burn-rate alerts cut noise without hiding real risk.
Start with 2–3 SLO-backed user journeys. Prove value before boiling the ocean.
Automate SLO creation in CI/CD so every new service ships with a contract.
Use multi-window, multi-burn-rate alerting to catch fast/regression failures and slow burns.
Make SLOs the language in incident review and change windows; the culture shift is as important as the tech.

Implementation checklist

Pick critical user journeys and define SLIs (availability, latency, correctness).
Set SLO targets and error budgets aligned to business impact.
Implement burn-rate alerts in Prometheus with Sloth or OpenSLO.
Route alerts by budget burn to reduce noise and speed triage.
Gate risky rollouts with SLO-aware canaries in Argo Rollouts/Flagger.
Embed SLO ownership in on-call, runbooks, and postmortems.
Automate SLO creation as part of your service template in GitOps.

Questions we hear from teams

SLO vs SLA vs SLI — which do I alert on?: Alert on SLO burn (via SLIs). SLAs are contracts with customers—don’t page your team on legal terms. SLIs are the raw signals; SLOs set expectations; error budgets determine when to page.
We don’t use Prometheus. Can we still do this?: Yes. The pattern works with Datadog, New Relic, or Cloud Monitoring. Use their query languages to implement burn-rate alerts. We used Prometheus/Sloth here because it was already in place and easy to automate via GitOps.
How many SLOs per service?: Start with 2–3 per user journey (availability + latency). Too many SLOs become noise; too few miss real failures. Expand only when a journey proves important to the business.
Do SLOs slow down delivery?: They speed it up. By gating risky changes and reducing alert noise, teams shipped 38% more frequently in this case. Error budgets clarify when to push and when to pause.
What about AI features with probabilistic outputs?: Treat correctness as an SLI. Track rejection rates, groundedness checks, or human override rates. The same burn-rate principles apply—define what ‘good enough’ is and alert when you’re burning too fast.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to an SLO Plumber Get our SLO starter kit (Sloth + PromQL)