The SLO Rollout That Stopped the Pager Storm: Cutting MTTR 77% in 90 Days
Turning noisy alerts into decisive action with Prometheus, Sloth, and error budgets.
We didn’t fix incidents by adding more dashboards. We fixed them by agreeing on what ‘good’ is and letting error budgets drive the pager.Back to all posts
The Pager Was Always On Fire
I walked into a mid-market B2B SaaS (think ~120 engineers, AWS EKS, Istio, ArgoCD) where on-call looked like a slot machine at 2 a.m. The incident channel read like a weather alert. CPU spikes. GC pauses. Disk IO. All “critical.” None tied to what users actually felt.
The numbers were ugly:
- 62 pages/month across platform teams
- MTTR ~140 minutes
- 45% false-positive alerts
- SLA credits paid three quarters in a row
They’d done what many of us did in the pre-SRE era: instrument everything, alert on everything, and call it “observability.” The team had Grafana dashboards for days, but zero shared truth about what “good” looked like. Leadership was asking for fewer incidents; engineers were begging for fewer alerts. I’ve seen this movie. The fix wasn’t more dashboards. It was SLOs.
Why SLOs, Not More Dashboards
Dashboards help you look; SLOs help you decide. SLOs turn reliability into a budget you can spend intentionally. If you haven’t pragmatic-SRE’d before, here’s the quick refresher:
- SLI: The thing we measure (e.g.,
http 5xx rate,p95 latency). - SLO: The target we promise ourselves (e.g., 99.5% monthly availability).
- Error budget: 100% − SLO (your allowable failure).
Industry context: Google’s SRE book made this mainstream. Teams at Slack and Shopify talk publicly about using SLOs to control change velocity. The delta between theory and reality is wiring it into your alerting and operations without hiring a dozen SREs. Our constraints here:
- Mixed estate: a Ruby monolith + 14 Go microservices
- Regulated customers (SOC 2), but no 24/7 NOC
- One overworked platform team; no appetite for big-bang migrations
- Already on
Prometheus,Alertmanager,Grafana,ArgoCD,Istio– use the stack they had
Define Signals That Actually Matter
We started with two revenue-critical journeys:
- Checkout API (
/v1/checkout): availability and latency - Workspace load (
/workspaces/:id): latency perceived by logged-in users
We set conservative monthly SLOs to earn trust:
- Availability: 99.5% (≈216 minutes error budget/month)
- Latency: p95 < 300ms
We used Sloth (a simple SLO generator for Prometheus) to create recording rules and burn-rate alerts. Here’s a simplified Sloth SLO for the Checkout API availability using istio_request_total:
# slo-checkout-availability.yaml
apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
name: checkout-availability
namespace: sre
spec:
service: checkout-api
labels:
team: payments
slos:
- name: availability
objective: 99.5
description: Availability of /v1/checkout from edge
sli:
events:
errorQuery: |
sum(rate(istio_requests_total{reporter="destination",destination_workload="checkout",response_code=~"5.."}[5m]))
totalQuery: |
sum(rate(istio_requests_total{reporter="destination",destination_workload="checkout"}[5m]))
alerting:
name: checkout-availability
labels:
severity: page
annotations:
summary: "Checkout availability budget burn"
# Multi-window, multi-burn-rate
burnrates:
- alert: PageQuick
for: 2m
factor: 14.4 # ~1h to exhaustion
window: 5m
- alert: PageSlow
for: 15m
factor: 6 # ~6h to exhaustion
window: 30m
- alert: Ticket
for: 2h
factor: 2 # heads-up, file a ticket
window: 6hFor latency, we used histogram quantiles:
# slo-checkout-latency.yaml
apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
name: checkout-latency
namespace: sre
spec:
service: checkout-api
labels:
team: payments
slos:
- name: p95-latency
objective: 99.0 # percent of requests below threshold
description: 95th percentile latency under 300ms
sli:
raw:
errorRatioQuery: |
1 - (
sum(rate(http_request_duration_seconds_bucket{le="0.3",job="checkout"}[5m]))
/
sum(rate(http_request_duration_seconds_count{job="checkout"}[5m]))
)
alerting:
name: checkout-latency
labels:
severity: page
burnrates:
- alert: PageQuick
for: 5m
factor: 14.4
window: 5m
- alert: PageSlow
for: 30m
factor: 6
window: 30mApply with GitOps, not click-ops:
kubectl apply -f slo-checkout-availability.yaml
kubectl apply -f slo-checkout-latency.yamlWire Alerts to Error Budgets (Not Hosts)
We killed 27 host-level alerts the first week. If the error budget is healthy, I don’t care that node 7 is at 82% CPU. When the budget is burning fast, I care a lot.
The Sloth CRDs generate the Prometheus recording rules and alerting. For teams not using Sloth, here’s the gist of a manual burn-rate alert using PromQL:
# prometheus-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: checkout-slo-alerts
namespace: sre
spec:
groups:
- name: checkout-slo
rules:
- record: job:checkout_error_ratio:5m
expr: |
sum(rate(istio_requests_total{destination_workload="checkout",response_code=~"5.."}[5m]))
/
sum(rate(istio_requests_total{destination_workload="checkout"}[5m]))
- alert: CheckoutErrorBudgetBurn
expr: |
(job:checkout_error_ratio:5m > (1-0.995) * 14.4) or
(avg_over_time(job:checkout_error_ratio:5m[30m]) > (1-0.995) * 6)
for: 10m
labels:
severity: page
annotations:
summary: "Checkout error budget burning fast"
runbook_url: "https://runbooks.internal/checkout-slo"Then route in Alertmanager by severity and team:
# alertmanager.yaml (fragment)
route:
receiver: default
routes:
- matchers:
- severity="page"
- team="payments"
receiver: payments-pager
group_by: [alertname, service]
group_wait: 30s
group_interval: 5m
repeat_interval: 2h
receivers:
- name: payments-pager
pagerduty_configs:
- routing_key: ${PD_PAYMENTS_KEY}Make It Operational (Runbooks, CI Policy, and Canaries)
Tech alone doesn’t change behavior. We made SLOs the contract every service had to ship with.
- Git template:
service-templateincludes ansre/slo/*.yamlfolder with Sloth specs. - CI policy: PRs that change
deploy/must include SLOs or bump an existing one. We enforced it with a simplebashcheck.
# .ci/check-slo.sh
changed=$(git diff --name-only origin/main...HEAD)
if echo "$changed" | grep -q "deploy/"; then
if ! echo "$changed" | grep -q "sre/slo/"; then
echo "SLO missing: changes to deploy/ require an SLO spec" >&2
exit 1
fi
fi- Runbooks: Every page routes to a wiki with the SLO, SLI queries, and a rollback command.
- Dashboards: Grafana shows error budget remaining front and center. If you can’t see the budget, you can’t spend it.
- Change policy tied to budget:
50% budget remaining: free to deploy
- 20–50%: canary-only
- <20%: incident commander approval
For canaries, we used Argo Rollouts with a simple analysis template that checks the error ratio during a rollout:
# analysis-template.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: checkout-slo-analysis
spec:
metrics:
- name: error-ratio
interval: 1m
count: 10
successCondition: result < 0.005
failureLimit: 1
provider:
prometheus:
address: http://prometheus.monitoring.svc:9090
query: |
sum(rate(istio_requests_total{destination_workload="checkout",response_code=~"5.."}[5m]))
/
sum(rate(istio_requests_total{destination_workload="checkout"}[5m]) )Hook the template into the rollout:
# rollout.yaml (fragment)
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: checkout
spec:
strategy:
canary:
steps:
- setWeight: 20
- pause: {duration: 60}
- analysis:
templates:
- templateName: checkout-slo-analysis
- setWeight: 50
- pause: {duration: 120}
- analysis:
templates:
- templateName: checkout-slo-analysis
- setWeight: 100Results After 90 Days
By week two, the team felt the difference. By day 90, leadership had numbers they could take to the board:
- Pages/month: 62 → 14 (−77%)
- MTTR: 140 mins → 32 mins (−77%)
- False positives: 45% → 8%
- Change failure rate: 26% → 11%
- Deploy frequency: +38% (canaries + confidence)
- SLA credits: zero for the first quarter in a year
Qualitatively, on-call stopped being a hazing ritual. People slept. The CFO stopped asking for “reliability dashboards that look green.” Product started negotiating tradeoffs with real numbers: “We have 40% of the budget left—do we ship the risky refactor this week or next?”
We did hit a snag around latency SLOs on the monolith. GC pauses made the p95 swingy. We split SLOs by endpoint and introduced a p99.9 debug panel for capacity planning, not paging. Don’t page on p99.9 unless you like being angry.
What We Learned (And What You Can Steal)
I’ve seen SLO programs die as slideware. Here’s what actually worked here:
- Start with two journeys. Prove value, then expand.
- Use multi-window burn-rate alerts. Google’s recipe exists for a reason.
- Remove a page for every SLO page you add. Net page count must go down.
- Make SLOs part of the PR template and incident review. Culture follows tooling.
- Tie change policy to budget. Don’t rely on vibes to decide if you can deploy.
- Keep SLOs boring. 99.5% beats a flashy 99.99% you can’t keep.
If you’re starting tomorrow:
- Implement
SlothorPyrrawith Prometheus; don’t roll your own unless you love yak-shaving. - Pick SLIs that map to real user pain:
5xx rate,p95 < threshold,availability from edge. - Define an error budget policy that product agrees with.
- Gate canaries on SLO queries in
Argo RolloutsorFlagger. - Measure results in the metrics leadership understands: MTTR, pages/month, change failure rate.
GitPlumbers came in to glue this together using the stack they already had. No rip-and-replace, just aligning signals to outcomes. That’s the job.
Key takeaways
- Tie alerts to error budgets, not host metrics. Burn-rate alerts cut noise without hiding real risk.
- Start with 2–3 SLO-backed user journeys. Prove value before boiling the ocean.
- Automate SLO creation in CI/CD so every new service ships with a contract.
- Use multi-window, multi-burn-rate alerting to catch fast/regression failures and slow burns.
- Make SLOs the language in incident review and change windows; the culture shift is as important as the tech.
Implementation checklist
- Pick critical user journeys and define SLIs (availability, latency, correctness).
- Set SLO targets and error budgets aligned to business impact.
- Implement burn-rate alerts in Prometheus with Sloth or OpenSLO.
- Route alerts by budget burn to reduce noise and speed triage.
- Gate risky rollouts with SLO-aware canaries in Argo Rollouts/Flagger.
- Embed SLO ownership in on-call, runbooks, and postmortems.
- Automate SLO creation as part of your service template in GitOps.
Questions we hear from teams
- SLO vs SLA vs SLI — which do I alert on?
- Alert on SLO burn (via SLIs). SLAs are contracts with customers—don’t page your team on legal terms. SLIs are the raw signals; SLOs set expectations; error budgets determine when to page.
- We don’t use Prometheus. Can we still do this?
- Yes. The pattern works with Datadog, New Relic, or Cloud Monitoring. Use their query languages to implement burn-rate alerts. We used Prometheus/Sloth here because it was already in place and easy to automate via GitOps.
- How many SLOs per service?
- Start with 2–3 per user journey (availability + latency). Too many SLOs become noise; too few miss real failures. Expand only when a journey proves important to the business.
- Do SLOs slow down delivery?
- They speed it up. By gating risky changes and reducing alert noise, teams shipped 38% more frequently in this case. Error budgets clarify when to push and when to pause.
- What about AI features with probabilistic outputs?
- Treat correctness as an SLI. Track rejection rates, groundedness checks, or human override rates. The same burn-rate principles apply—define what ‘good enough’ is and alert when you’re burning too fast.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
