From Pager Hell to Predictable On-Call: How SLOs Cut Pages 65% in 90 Days
We stopped chasing CPU graphs and started defending error budgets. The result: fewer pages, faster MTTR, calmer engineers, safer releases.
We didn’t fix incidents; we fixed what we alert on.Back to all posts
The incident treadmill we walked into
I walked into a Monday standup where the on-call had slept on the couch in the office. Again. The company (let’s call them MarketForge) runs a high-traffic marketplace on AWS EKS with Istio, Prometheus, and Alertmanager. They'd bolted on Datadog for host metrics, Sentry for errors, and PagerDuty for paging.
The problem: incidents were defined by whatever metric tripped first. CPU spikes? Page. Pod restarts? Page. A minor 5xx blip on a non-critical endpoint at 3 a.m.? Page. They were doing drive-by dashboarding and vibe debugging. AI-assisted PRs were shipping faster than the monitoring could keep up, and the team was drowning in noise.
- Average pages per week: 38 (with bursts >60 during releases)
- MTTR: ~6 hours
- MTTD: ~20 minutes (read: humans noticed before dashboards)
- Change failure rate: ~32%
- Compliance constraint: PCI scope on checkout; zero tolerance for silent failures there
I’ve seen this movie: without SLOs, you’re optimizing for graphs, not users. We flipped the script.
Why SLOs changed the game
Dashboards are for humans; SLOs are contracts with your users. When we centered incident response on SLOs, three things happened immediately:
- We stopped paging on infra noise and started paging on user pain.
- We could quantify risk with error budgets instead of arguing about severity.
- We created a common language across engineering, product, and compliance.
We anchored on two simple rules:
- Define SLIs around critical user journeys:
login,search,checkout. - Alert only when SLO error budgets burned at meaningful rates (fast and slow).
We didn’t fix incidents; we fixed what we alert on.
What we implemented (the boring, critical details)
We kept it boring on purpose. Fancy observability with no governance is just expensive noise.
- SLIs for critical paths
- Availability: ratio of
2xx|3xxto all responses per service - Latency:
p95under a threshold (e.g.,300msfor search,500msfor checkout) - Source:
Istiometrics exported toPrometheus
PromQL examples:
# Checkout availability (5xx + throttles considered errors)
sum(rate(istio_requests_total{destination_workload="checkout", response_code=~"5..|429"}[5m]))
/
sum(rate(istio_requests_total{destination_workload="checkout"}[5m]))
# Checkout latency SLI (p95 under 500ms)
histogram_quantile(0.95,
sum by (le) (rate(istio_request_duration_milliseconds_bucket{destination_workload="checkout"}[5m]))
) < 0.5- SLOs with error budgets
- Checkout availability SLO: 99.9% over 28 days
- Search latency SLO: p95 < 300ms for 99% of requests over 28 days
- Login availability SLO: 99.95% over 28 days
We codified these with Sloth so SLOs, alerts, and dashboards are generated from one YAML source of truth and deployed with ArgoCD.
# sloth.yaml
service: checkout
slos:
- name: http-availability
objective: 99.9
description: "HTTP 2xx/3xx ratio over 28d"
labels:
team: core-commerce
tier: critical
sli:
events:
error_query: |
sum(rate(istio_requests_total{destination_workload="checkout",response_code=~"5..|429"}[5m]))
total_query: |
sum(rate(istio_requests_total{destination_workload="checkout"}[5m]))
alerting:
name: checkout-slo
labels:
severity: page
annotations:
runbook: https://runbooks.marketforge.internal/checkout/slo
page_alert: { disable: false }
ticket_alert: { disable: false }- Multi-window, multi-burn rate alerts
We used the Google SRE pattern to catch both fast burns (explosions) and slow burns (leaks):
- Fast burn: 2h window, burn rate > 14x (page now)
- Slow burn: 6h and 24h windows, burn rate > 6x and > 1x (ticket or page depending on tier)
# prometheus-rules.yaml (generated by Sloth, simplified)
groups:
- name: slo-burn
rules:
- alert: SLOErrorBudgetBurnFast
expr: |
(
sum(rate(istio_requests_total{destination_workload="checkout",response_code=~"5..|429"}[5m]))
/
sum(rate(istio_requests_total{destination_workload="checkout"}[5m]))
) > (14 * (1 - 0.999))
for: 10m
labels:
severity: page
service: checkout
annotations:
summary: "Checkout SLO fast burn"
runbook: "https://runbooks.marketforge.internal/checkout/slo"
- alert: SLOErrorBudgetBurnSlow
expr: |
(
avg_over_time(
(
sum(rate(istio_requests_total{destination_workload="checkout",response_code=~"5..|429"}[5m]))
/
sum(rate(istio_requests_total{destination_workload="checkout"}[5m]))
)[6h:]
) > (6 * (1 - 0.999))
)
for: 30m
labels:
severity: ticket
service: checkout
annotations:
summary: "Checkout SLO slow burn"
runbook: "https://runbooks.marketforge.internal/checkout/slo"- Pager routing that respects sleep
Alertmanager routed only severity: page from SLO burns to PagerDuty. Everything else opened a ticket in Jira or Slack.
# alertmanager.yaml (routing snippet)
route:
receiver: default
routes:
- matchers:
- severity="page"
receiver: pagerduty
group_by: [service]
group_wait: 30s
group_interval: 5m
repeat_interval: 3h
- matchers:
- severity="ticket"
receiver: jira
receivers:
- name: pagerduty
pagerduty_configs:
- routing_key: ${PAGERDUTY_KEY}
- name: jira
webhook_configs:
- url: https://jira.marketforge.internal/hooks/alerts- GitOps everything
- SLO YAML lived under
sre/slos/in the mono-repo. ArgoCDsyncedPrometheusRuleCRDs,Alertmanagerconfig, andGrafanadashboards.- Changes required a PR, code review, and a canary rollout. No click-ops.
- Make it visible
We carved out a Grafana folder: one dashboard per service, top-left panel is remaining error budget over 28 days, plus burn rate sparkline. The first graph every on-call saw was user impact, not pod counts.
How incident response actually changed
We didn’t just add alerts; we rewrote the on-call contract.
- Paging policy: Only SLO burn pages. Node/pod/K8s noise became tickets with rational priorities.
- Triage flow started at the SLI panel, not the pod list. The second thing on-call checked was the
ArgoCDdeploy history. 80% of SLO burns correlated with a deploy in the last 30 minutes. Shocking, I know. - Rollback and flag strategy:
Argo Rolloutsfor canary,LaunchDarklyfor feature flags. If a canary burned 2% of the error budget in 15 minutes,rollout abortwas the default, not a debate. - Runbooks: Each SLO had a runbook with
kubectlone-liners,istioctl proxy-status, SLI queries, and feature flag kill switches.
# Quick triage snippets from the runbook
kubectl -n checkout get pods -o wide --sort-by=.status.containerStatuses[0].restartCount
# Check last deploy
argocd app history checkout | head -n 5
# Compare SLI before/after deploy
promtool query instant http://prometheus:9090 \
'sum(rate(istio_requests_total{destination_workload="checkout",response_code=~"5..|429"}[5m]))
/
sum(rate(istio_requests_total{destination_workload="checkout"}[5m]))'The social change was the hardest part. We had to deprogram the “CPU>85% == page” reflex. But once engineers saw reduced noise and clearer priorities, they leaned in.
Results after 30/60/90 days
We measured hard outcomes, not vibes.
- Pages/week: 38 → 14 (−63%) by day 60; stabilized at 13–16 by day 90
- MTTR: 6h → 1h 50m at day 30 → 48m at day 90 (−87%)
- MTTD: 20m → 5m (multi-window alerts caught slow burns early)
- Change failure rate: 32% → 12% (canaries + flag kills on burn)
- On-call satisfaction (internal survey): 2.1/5 → 4.0/5
- Unplanned downtime on checkout (28d): 220 minutes → 24 minutes
- Compliance posture: PCI evidence packs included SLO dashboards and error budget policy; zero NCs in the audit
Business impact wasn’t subtle: conversion recovered 1.8 points after we stopped drowning checkout with noisy restarts and started defending the SLO. Product stopped arguing with SRE about “is this critical?” We had numbers.
Lessons learned (and what I’d do differently)
- Don’t start with 20 SLOs. We started with three services and two SLIs each. That was enough to flip the culture.
- Pick SLO targets you can actually meet. Shipping a 99.99% SLO on day one just means permanent pages and no credibility.
- Keep SLIs boring. We resisted “weighted blended” weirdness. Ratios and histograms won.
- Codify or it didn’t happen. YAML +
Sloth+ArgoCDmeant changes were reviewable and auditable. - Tie to release policy. Error budget exhaustion paused non-critical launches for a sprint. Product grumbled, then loved the predictability.
- Watch for AI-induced regressions. The fastest route to a burned budget was an “optimizing” AI patch that subtly changed retry semantics. SLOs turned those from Friday-night mysteries into Tuesday-morning blips.
Do this next week: a 7-step playbook
- List your top three user journeys and pick an availability and latency SLI for each.
- Set SLOs with a 28d window and error budget policies (what pauses when you breach?).
- Generate Prometheus rules via
Sloth(or hand-roll if you must) and deploy withArgoCD. - Implement two alert rules per SLO: fast (2h burn) pages, slow (6h/24h) tickets.
- Route non-SLO alerts away from
PagerDuty. Sleep is a feature. - Write runbooks that start with SLI graphs and recent deploys. Practice once a month.
- Add canary/flags to shorten MTTR:
Argo Rollouts+LaunchDarklyis a strong combo.
If you want a sanity check on your first SLOs or help wrangling the PromQL, this is exactly what we do at GitPlumbers. We’ve cleaned up enough AI-generated “observability” YAML to know where the footguns are.
Key takeaways
- Page on user pain, not node metrics: alert on SLO error budget burn, not CPU or pod restarts.
- Define SLIs from top user journeys; keep them simple and objective.
- Use multi-window, multi-burn-rate alerts to catch both fast and slow burns without noisy flapping.
- Codify SLOs, alerts, and routing via GitOps so changes get reviewed, tested, and rolled out predictably.
- Tie incident response to error budgets: on breach, slow changes, add guardrails, and fix causes, not symptoms.
Implementation checklist
- Map 3-5 critical user journeys and define one availability and one latency SLI for each.
- Pick realistic SLO targets (e.g., 99.9% over 28d) with a clear error budget policy.
- Implement multi-window burn rate alerts (e.g., 2h/1h fast, 6h/24h slow) with `severity: page`.
- Route non-SLO alerts to tickets; reserve pages for error budget burns.
- Codify SLOs with a generator like Sloth, deploy via `ArgoCD`, and visualize error budgets in `Grafana`.
- Create runbooks that start with SLI graphs and recent deploys; practice drills monthly.
- Use feature flags/canaries (`LaunchDarkly`, `Argo Rollouts`) to reduce blast radius when SLOs burn.
Questions we hear from teams
- How do I pick my first SLO targets?
- Start with what you can meet based on historical data. If your checkout availability has been 99.7–99.85% over the last quarter, set 99.8% or 99.85%, not 99.99%. Use a 28-day window and revisit quarterly. The point is to create a useful error budget signal, not win an uptime beauty contest.
- Do I need Istio to implement SLIs/SLOs?
- No, but a mesh makes it easier. You can use NGINX Ingress, Envoy, or app-level metrics (OpenTelemetry) to expose request counts and latency histograms. What matters is consistent metrics for total vs error counts and latency buckets.
- What about composite services and partial failures?
- Model user journeys at the edge if possible (e.g., via API gateway metrics) and add service-level SLOs where necessary. If the page is triggered from the edge SLO, use service SLOs to triage. Avoid clever weighted composites until you’ve mastered basics.
- Won’t slow-burn alerts cause alert fatigue?
- Use multi-window thresholds and route slow burns to tickets unless they continue to burn. We page on fast burns; we ticket on slow burns. Most teams see fewer, earlier, and more actionable signals with this pattern.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
