The SLIs That Actually Change On‑Call: Predict Failures, Gate Rollouts, Ship Calmly
Stop paging on CPU and start shipping based on error budgets that predict pain before customers feel it.
“If your alert doesn’t change someone’s behavior, it’s not an alert. It’s a dashboard.”Back to all posts
You don’t need another dashboard. You need better predictors.
I’ve sat through too many 2 a.m. pages where the only signal was CPU 92% on some node pool. Nobody made a better decision because of it. The incidents that bit us were always telegraphed by earlier, quieter signals: tail latency creeping, retry storms, queue backlogs, connection pool saturation. When we flipped our SLOs to track those, on‑call went from whack‑a‑mole to calm, and incident volume dropped.
If your SLOs don’t change how you roll out or triage, they’re just KPIs in a nicer font.
This is the playbook we implement at GitPlumbers when we’re called into burnt‑out teams who’ve tried “observability” and got only more noise.
Define SLIs that predict pain (not vanity)
Your SLIs should be the earliest reliable proxies for user harm on critical journeys. A few that consistently work:
- Request success rate for customer‑facing endpoints, not service‑to‑service. Example:
POST /checkoutsuccess, not “cluster 2xx”. Tie it touser_journey=checkout. - Tail latency (p95/p99) on those same journeys. p50 tells you the happy path; p95 tells you who’s about to file a ticket.
- Saturation headroom where contention hurts first: DB connection pool (e.g.,
pgbouncer_stats_free_slots), thread pools, Kafka consumer lag, Kubernetes workqueue depth. - Retry and throttle signals: 429/503 ratios, client retry counts, circuit‑breaker open rate. Retry storms predict meltdowns.
- Backlog growth: SQS/Kafka consumer lag,
workqueue_depthin controllers. Backlog climbing + steady input == future SLO breach.
What to drop from paging:
- Raw CPU/memory/disk unless they directly cause user harm and have no better proxy.
- Averages of anything. Averages are where incidents go to hide.
- Global availability across 10 services—scope SLIs to the user journey and owning team.
Instrument consistently:
- Use
OpenTelemetryto attachservice.name,deployment.environment,team,tier, anduser_journeyto traces/metrics/logs. - Normalize HTTP metrics: status code family, method, route template. Avoid cardinality bombs (no full URLs).
Put SLOs where the pager changes hands (and behavior)
An SLO isn’t “99.9% because marketing.” It’s a contract: when the error budget burns at a certain rate, you take a different action.
- Example: For Checkout API, objective
99.9%monthly success. If burn rate > 14× over 5m, page primary immediately; if > 2× over 1h, page within business hours. - Tie actions to budget state:
- Burn < 25%: ship freely, allow canary to auto‑promote.
- Burn 25–75%: require canary analysis to pass stricter gates.
- Burn > 75%: freeze risky changes; only hotfixes with rollback prepped.
This is where I’ve seen teams finally reduce incidents: the SLO is not just plotted in Grafana; it gates rollouts and dictates triage.
Wire telemetry to triage and rollout automation
Make the data drive decisions automatically. The boring plumbing pays off fast.
- Alerting: Use multi‑window, multi‑burn rate alerts (from the Google SRE book) so you only wake people when users feel it.
- Triage routing: PagerDuty Event Orchestration uses labels to route/suppress. If the SLO isn’t burning, suppress node CPU alerts or reclassify as low.
- Runbooks in code: Every page fires an action—link a runbook, attach last failed deploy, and top suspect services from traces.
- Rollouts gated by SLO: Argo Rollouts or Flagger query Prometheus. If success rate dips or p95 spikes, the canary halts or rolls back. Feature flags follow the same rule.
I’ve watched a fintech cut Sev‑1s by 40% in a quarter by doing just this: SLO burn gates on Argo + PD orchestration to de‑noise infra alerts.
Concrete configs you can copy‑paste
Here are minimal, production‑proven snippets you can adapt.
Prometheus: recording rules and multi‑window burn alerts
# prometheus-rules.yaml
groups:
- name: checkout-slo
rules:
- record: job:http_request_total:rate5m
expr: sum by (job, user_journey) (rate(http_requests_total{user_journey="checkout", le="+Inf"}[5m]))
- record: job:http_request_errors:rate5m
expr: sum by (job, user_journey) (rate(http_requests_total{user_journey="checkout", status=~"5..|429|499"}[5m]))
- record: slo:checkout:error_ratio5m
expr: job:http_request_errors:rate5m / job:http_request_total:rate5m
- record: slo:checkout:error_ratio1h
expr: sum_over_time(slo:checkout:error_ratio5m[1h]) / 12
- record: slo:checkout:error_ratio6h
expr: sum_over_time(slo:checkout:error_ratio5m[6h]) / 72
# Multi-window burn alerts for 99.9% SLO (0.1% budget)
# Fast burn (page now): error ratio over 5m > 14x budget
- alert: CheckoutFastBurn
expr: slo:checkout:error_ratio5m > (0.001 * 14)
for: 5m
labels:
severity: critical
team: payments
annotations:
summary: "Checkout SLO fast burn"
description: "Error ratio > 14x over 5m. User impact likely."
# Slow burn (page during hours): over 6h > 2x budget
- alert: CheckoutSlowBurn
expr: slo:checkout:error_ratio6h > (0.001 * 2)
for: 10m
labels:
severity: warning
team: payments
annotations:
summary: "Checkout SLO slow burn"
description: "Error ratio > 2x over 6h. Investigate within business hours."Sloth: codify the SLO (so it’s reviewable and versioned)
# slo-checkout.yaml (Sloth)
apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
name: checkout-availability
spec:
service: checkout
labels:
team: payments
user_journey: checkout
slos:
- name: availability
objective: 99.9
description: Checkout success rate monthly
sli:
events:
errorQuery: sum(rate(http_requests_total{user_journey="checkout",status=~"5..|429|499"}[5m]))
totalQuery: sum(rate(http_requests_total{user_journey="checkout"}[5m]))
alerting:
name: Checkout
labels:
severity: page
annotations:
runbook: https://runbooks.example.com/checkout
pageAlert: true
ticketAlert: trueArgo Rollouts: gate promotion on SLO queries
# analysis-template.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: checkout-canary-analysis
spec:
metrics:
- name: success-rate
interval: 1m
count: 5
failureLimit: 1
provider:
prometheus:
address: http://prometheus:9090
query: |
1 - (sum(rate(http_requests_total{user_journey="checkout",status=~"5..|429|499",version="{{args.version}}"}[1m])) /
sum(rate(http_requests_total{user_journey="checkout",version="{{args.version}}"}[1m])))
successCondition: result[0] >= 0.999
- name: p95-latency
interval: 1m
count: 5
failureLimit: 1
provider:
prometheus:
address: http://prometheus:9090
query: |
histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket{user_journey="checkout",version="{{args.version}}"}[1m])))
successCondition: result[0] < 0.3# rollout.yaml (snippet)
spec:
strategy:
canary:
steps:
- setWeight: 10
- pause: {duration: 60}
- analysis:
templates:
- templateName: checkout-canary-analysis
args:
- name: version
valueFrom:
podTemplateHashValue: Latest
- setWeight: 50
- pause: {duration: 120}
- analysis:
templates:
- templateName: checkout-canary-analysis
args:
- name: version
valueFrom:
podTemplateHashValue: LatestFlagger + Istio: automatic rollback with success rate and latency
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: checkout
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: checkout
service:
meshName: istiod
port: 80
gateways:
- public-gateway
hosts:
- checkout.example.com
analysis:
interval: 1m
threshold: 5
maxWeight: 50
stepWeight: 10
metrics:
- name: request-success-rate
thresholdRange:
min: 99.9
interval: 30s
- name: request-duration
thresholdRange:
max: 300
interval: 30sIstio: stop retry storms and eject bad pods early
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: checkout-dr
spec:
host: checkout
trafficPolicy:
outlierDetection:
consecutive5xx: 5
interval: 5s
baseEjectionTime: 30s
maxEjectionPercent: 50
connectionPool:
http:
http1MaxPendingRequests: 100
maxRequestsPerConnection: 100
retries:
attempts: 2
perTryTimeout: 300msPagerDuty Event Orchestration: suppress vanity alerts when SLO is healthy
{
"rules": [
{
"conditions": [
{"operator": "and", "subconditions": [
{"field": "details.alert_type", "operator": "equals", "value": "cpu_high"},
{"field": "details.slo_burning", "operator": "equals", "value": false}
]}
],
"actions": {"suppress": true, "annotations": {"note": "CPU high suppressed unless SLO burning"}}
}
]
}Make leading indicators part of triage
Now that you’ve got the wiring, teach the pager to speak the language of prediction.
- Page titles: include
burn_rate, top impacteduser_journey, and last deploy SHA. Example: “Checkout SLO fast burn (14x) – deploy 3f2a9bc 8m ago”. - Enrich alerts with suspects: top N spans by error rate from
traces(e.g.,SpanMetricsProcessor), DB saturation headroom, and retry ratios. - First actions in runbook:
- Check canary status; if failing, roll back (
kubectl argo rollouts undo ...). - If retries > threshold, tighten Istio outlier detection by 1 notch.
- If backlog rising, scale consumers before chasing code paths.
- Check canary status; if failing, roll back (
On‑call should move from “hunt for dashboards” to “confirm and act in <5 minutes.”
Run the loop weekly: tune, delete, and dare to freeze
The teams that win treat SLOs as a feedback loop, not a one‑time ceremony.
- Review error‑budget burn every week. If burn was near zero, your SLO is either too loose or you’re under‑shipping—loosen rollout gates. If you ran hot, freeze risky changes next sprint.
- Prune alerts that didn’t change behavior. If an alert never changed the on‑call action in 90 days, it’s a dashboard, not a page.
- Tighten thresholds on leading indicators (retry rate, backlog depth) as you gain headroom.
- Keep SLOs in Git next to service code via
SlothorNobl9. PRs change SLOs; releases reference their budget state.
I’ve seen this fail when leadership treats SLOs as vanity goals. The fix is simple: wire budget state to the deploy pipeline and hold teams accountable to their own gates.
What we stopped measuring (and what it bought us)
At one marketplace client, we killed 60% of alerts in two weeks:
- Deleted: node CPU/mem pages, generic 5xx across the whole mesh, cluster “NotReady” spam. None changed the on‑call action.
- Added:
checkoutp99 latency, DB pool free slots for payment write path, consumer lag onorderstopic, 429/503 ratio from the edge. - Gated rollouts: Argo Rollouts promoted only when success rate ≥ 99.9% and p95 < 300ms for 5 consecutive minutes.
Results in 90 days:
- 42% fewer Sev‑1/Sev‑2s.
- MTTR down from 52m to 19m.
- Page volume down 58%, and engineers started sleeping again.
No silver bullets. Just SLIs that predict pain and plumbing that acts on them. That’s the boring, durable win.
Related Resources
Key takeaways
- Pick SLIs that are leading indicators of customer pain: tail latency, saturation, retry storms, and backlog growth.
- Express SLOs where a human would take a different action at the pager—otherwise it’s not an SLO, it’s a metric.
- Use multi-window burn-rate alerts to page only when it matters and to classify urgency.
- Gate rollouts and feature flags with error budgets; automate rollback when the burn spikes.
- Tie telemetry to triage: route, suppress, or enrich alerts based on SLO state and ownership.
Implementation checklist
- Define 3–5 user-journey SLIs (p95 latency, success rate, backlog depth, saturation headroom).
- Codify SLOs with error budgets and multi-window burn-rate alerts.
- Tag telemetry with `service`, `team`, `tier`, and `user_journey` via OpenTelemetry.
- Integrate SLO state with rollout controllers (Argo Rollouts or Flagger) to auto‑halt/rollback.
- Adopt weekly error‑budget reviews and tighten/relax gates based on data.
- Delete vanity alerts (CPU, disk) that don’t change on‑call behavior.
Questions we hear from teams
- How many SLOs per service is reasonable?
- Start with 1–2 SLOs per critical user journey, usually 3–5 per product. More than that and you’ll spend your life in meetings. Keep the long tail of metrics on dashboards, not pagers.
- What if we don’t have Prometheus/Argo/Istio?
- Great. Use what you have: Datadog monitors for burn rate, LaunchDarkly or OpenFeature for flag gates, AWS ALB/NLB metrics for success/latency, Spinnaker for automated rollouts. The pattern matters more than the tools.
- How do we pick the right objective (99.9 vs 99.95)?
- Back into it from user tolerance and incident review. If a 10‑minute checkout outage a month is acceptable, 99.98% is your ceiling. Then stress‑test with historical data to ensure you can operate within budget most months.
- Won’t gating rollouts slow us down?
- Only if your quality is already poor. Healthy teams pass gates quickly. And when things do go wrong, automatic rollback is much faster than a human hunt.
- What about internal platforms and batch jobs?
- Use SLIs that match the job: queue latency, backlog depth, and SLA completion window success. Page on predicted misses (backlog growing + steady input), not on CPU spikes.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
