Stop the Pager Pinball: Intelligent Alert Routing that Predicts Incidents and Triggers Safe Rollbacks
Cut pages by focusing on SLO burn and saturation—not CPU blips—and wire alerts to triage, canaries, and feature flags so rollouts self-protect.
Page on burn rates and saturation, not CPU. Then let the rollout abort itself. That’s how you sleep at night.Back to all posts
The 2 a.m. Slack Symphony You Don’t Miss
A few years back, a retail platform’s Black Friday deploy flooded on‑call with 60+ pages in 10 minutes. CPU. Disk. JVM GC. Kafka lag. Every downstream service screamed. No root cause, just noise. We fixed it in a week by doing what should’ve been done from day one: page only on leading indicators, route to owners with context, and wire the alerts to pause the rollout. Next Friday was quiet—two pages, both legit, and the canary auto‑aborted itself before customers felt it.
This isn’t magic. It’s a boring, disciplined stack: Prometheus + Alertmanager + PagerDuty (or Opsgenie) + Argo Rollouts (or Flagger) + your feature flags (LaunchDarkly, Unleash) + a service catalog (Backstage).
Stop Paging on Vanity Metrics
If you page on raw CPU > 80% or disk > 85%, you’re burning human attention. Those are lagging or irrelevant. Page on leading indicators that correlate with user pain or systemic failure:
- SLO burn rate: short + long window, e.g., 5m/1h for fast burn. If burn rate is hot, users are hurting—page.
- Saturation/backpressure: queue depth growing (Kafka consumer lag), thread pool saturation, DB connection pool at capacity.
- Latency tail growth: p99/p99.9 exploding toward SLO thresholds, even before hard errors.
- Circuit breakers: open ratio > threshold, retry storms detected.
- Resource exhaustion with context: container
oom_killcount, not “memory > 80%”.
What goes to Slack/email only (no page):
- Node CPU jitters, pod restarts without user impact, cache miss rate if no SLO burn, “instance down” in a scalable ASG.
Tie it to business. If checkout’s 99.9% availability over 28 days gives a 0.1% error budget, alerting should be framed around “how fast we’re burning the budget,” not “CPU spiked.”
Build a Routing Topology That Respects Ownership
You need labels on metrics and alerts: service, env, tier, owner. Without them, routing is guesswork. Use Alertmanager to group, dedupe, and inhibit cascades (don’t page every downstream if api-gateway is red in prod).
# alertmanager.yml
global:
resolve_timeout: 5m
route:
group_by: ['alertname','service','env']
group_wait: 30s
group_interval: 2m
repeat_interval: 2h
receiver: 'slack-low'
routes:
- matchers:
- severity="page"
- env="prod"
receiver: 'pagerduty'
continue: false
- matchers:
- severity="ticket"
receiver: 'jira'
- matchers:
- env="prod"
receiver: 'slack-prod'
inhibit_rules:
# If upstream is down, suppress child alerts in same env
- source_matchers: ['alertname="UpstreamDown"']
target_matchers: ['env=~"prod|staging"']
equal: ['env']
receivers:
- name: 'pagerduty'
pagerduty_configs:
- routing_key: ${PAGERDUTY_KEY}
severity: '{{ .CommonLabels.severity | default "critical" }}'
details:
service: '{{ .CommonLabels.service }}'
env: '{{ .CommonLabels.env }}'
owner: '{{ .CommonLabels.owner }}'
runbook: '{{ .CommonAnnotations.runbook_url }}'
dashboard: '{{ .CommonAnnotations.dashboard_url }}'
- name: 'slack-prod'
slack_configs:
- api_url: ${SLACK_WEBHOOK}
channel: '#prod-alerts'
title: '[{{ .Status | toUpper }}] {{ .CommonLabels.service }} {{ .CommonLabels.env }}: {{ .CommonLabels.alertname }}'
text: '{{ template "slack.text" . }}'
- name: 'slack-low'
slack_configs:
- api_url: ${SLACK_WEBHOOK}
channel: '#noise'
title: '{{ .CommonLabels.alertname }}'
text: '{{ .CommonAnnotations.summary }}'
templates:
- '/etc/alertmanager/templates/*.tmpl'Tips I’ve seen work:
- Group by
serviceandenvto avoid 20 tickets for one incident. - Use inhibition aggressively to stop downstream storms when the upstream gateway/db is red.
- Keep
repeat_intervallong enough to avoid churning responders; rely on dashboards for updates. - Use your service catalog (Backstage) to populate
ownerso PagerDuty auto-assigns to the right on‑call.
Make Alerts Predictive: SLO Burn + Saturation Rules
Use multi-window multi-burn SLO alerts. Don’t reinvent the math; steal from Google’s SRE workbook. Create recording rules and then alert off burned ratios.
# prometheus-rules.yml
groups:
- name: slo-burn
rules:
- record: slo:checkout:error_ratio:rate5m
expr: |
sum(rate(http_request_errors_total{job="checkout",env="prod"}[5m]))
/
sum(rate(http_requests_total{job="checkout",env="prod"}[5m]))
- record: slo:checkout:error_ratio:rate1h
expr: |
sum(rate(http_request_errors_total{job="checkout",env="prod"}[1h]))
/
sum(rate(http_requests_total{job="checkout",env="prod"}[1h]))
# Convert to burn rate given 99.9% SLO (0.1% budget)
- record: slo:checkout:burnrate5m
expr: slo:checkout:error_ratio:rate5m / 0.001
- record: slo:checkout:burnrate1h
expr: slo:checkout:error_ratio:rate1h / 0.001
- name: predictive-signals
rules:
- alert: SLOFastBurn
expr: slo:checkout:burnrate5m > 14.4 and slo:checkout:burnrate1h > 6
for: 5m
labels:
severity: page
service: checkout
env: prod
owner: payments-oncall
annotations:
summary: 'Checkout SLO fast-burn'
runbook_url: 'https://runbooks.acme.internal/checkout/slo-burn'
dashboard_url: 'https://grafana.acme.internal/d/checkout'
- alert: KafkaConsumerLagGrowing
expr: increase(kafka_consumergroup_lag{consumergroup="checkout"}[5m]) > 10000
for: 10m
labels:
severity: page
service: checkout
env: prod
owner: data-platform
annotations:
summary: 'Checkout consumer lag is growing quickly'
- alert: DBCxnPoolExhaustion
expr: avg(db_client_pool_in_use{service="checkout",env="prod"})
/ avg(db_client_pool_capacity{service="checkout",env="prod"}) > 0.9
for: 5m
labels:
severity: page
service: checkout
env: prod
owner: payments-oncall
annotations:
summary: 'Checkout DB connection pool near exhaustion'Other good predictors I’ve used:
rpc_client_pending_requeststrending up + tail latency rising.hystrix_circuit_open_total(orresilience4j_circuitbreaker_state= open) past threshold.container_oom_events_totalspikes.- GC pause p99 growing with allocation rate steady (memory leak forming).
Tie Telemetry to Triage: Context or It Didn’t Happen
Pages without context might as well be spam. Enrich at collection time and in the alert payload:
- Add
owner,service,tier,runbook_url,dashboard_urlas labels/annotations. - Make PagerDuty incidents auto-assign to the team from
owner. - Include a query link to the relevant Grafana panel and the latest deploy SHA.
Using OpenTelemetry Collector to stamp ownership and link runbooks:
receivers:
otlp:
protocols:
http:
grpc:
processors:
attributes:
actions:
- key: service.namespace
action: upsert
value: prod
- key: service.owner
action: upsert
value: payments-oncall
- key: runbook.url
action: upsert
value: https://runbooks.acme.internal/checkout
exporters:
prometheus:
endpoint: 0.0.0.0:9464
service:
pipelines:
metrics:
receivers: [otlp]
processors: [attributes]
exporters: [prometheus]Pro tip: keep runbooks short. First screen should answer: Is this user-impacting? What’s the fastest mitigation? Who owns the dependency? Link to kubectl commands, feature flag names, and rollback instructions.
Fast suppression matters, too. Maintenance windows? Silence with amtool:
amtool silence add alertname=DeployWindow env=prod \
--duration=2h \
--comment="prod deploy window" \
--author="$(whoami)"I’ve seen teams cut MTTA by 50% just by adding runbook_url and real ownership labels. No fancy AI required; just basic hygiene.
Close the Loop: Alert-Driven Rollouts and Kill Switches
Alerting should change the system, not just wake people. Wire your burn-rate alerts to pause/abort canaries with Argo Rollouts or Flagger. Example using an AnalysisTemplate that queries Prometheus and aborts if burn rate is hot:
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: checkout-slo-burn
spec:
metrics:
- name: error-burn
interval: 1m
count: 5
successCondition: result[0] < 6
failureLimit: 1
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: slo:checkout:burnrate1h
---
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: checkout
spec:
strategy:
canary:
steps:
- setWeight: 10
- pause: {duration: 90}
- analysis:
templates:
- templateName: checkout-slo-burn
- setWeight: 25
- pause: {duration: 120}
- analysis:
templates:
- templateName: checkout-slo-burn
- setWeight: 50
- pause: {}If the analysis fails, Argo pauses/aborts automatically. That one loop closes 80% of “we shipped a problem” incidents.
Feature flags are your blast-radius lever. When the burn-rate page hits, auto-toggle a kill switch via webhook. Minimal example with Alertmanager webhook -> small handler -> LaunchDarkly API:
# naive example: on alert payload, disable a flag
curl -X PATCH \
-H "Authorization: Bearer $LD_TOKEN" \
-H "Content-Type: application/json" \
https://app.launchdarkly.com/api/v2/flags/acme/checkout-kill \
-d '[{"op":"replace","path":"/environments/prod/on","value":false}]'If you’re a Flagger shop, you can simplify by embedding metric checks in the Canary CRD; same concept, fewer moving parts. For Spinnaker users, Kayenta does canary analysis with similar burn-rate gates.
What Changes When You Do This Right
Real numbers from a fintech client we helped last year:
- Pages per week: 42 -> 15 (-64%) in 30 days.
- MTTA: 11m -> 4m (owners auto-assigned, runbooks linked).
- Rollback MTTR: 23m -> 8m (canary auto-abort + flags).
- False pages: down 70% (inhibition + grouping).
Secondary effects you’ll actually feel:
- On‑call stops dreading deploys; more daytime fixes, fewer heroics.
- Fewer “no-op” incident reviews because the system self-protected.
- Product teams push more often because rollouts are guardrailed.
Common Pitfalls I’ve Seen (And How to Dodge Them)
- Alert per instance. Group by
service/envor you’ll get paged 50 times for one outage. - “Everything is critical.” Reserve
severity=pagefor user-impacting signals. Ticket or Slack the rest. - Dead dashboards. If your annotation links 404, responders will ignore them next time. Keep them fresh.
- Flappy alerts. Add
for:to require persistence. Use recording rules so expressions are stable. - Owner drift. Sync
ownerfrom Backstage nightly; fail closed (no owner -> SRE on‑call + a ticket to fix labels).
If I Had to Start Tomorrow
- Pick one critical service. Define a 28‑day availability SLO and the error budget.
- Add 5m/1h burn-rate recording rules and a
SLOFastBurnalert withseverity=page. - Add saturation alerts: DB pool at 90%, consumer lag growing, circuit breaker open ratio.
- Add
service,env,owner,runbook_urlto metrics/alerts. - Put
Alertmanagergrouping, inhibition, and PagerDuty routing in place. - Wire an
AnalysisTemplatein Argo Rollouts to pause/abort when burn tripped. - Track pages/week and MTTA for 30 days; adjust thresholds monthly.
Do that, and your 2 a.m. looks less like chaos and more like a boring, predictable system. That’s the goal.
Key takeaways
- Page on leading indicators (SLO burn, saturation, queue lag), not vanity metrics (raw CPU, disk).
- Route by ownership and severity using labels; group, dedupe, and inhibit noisy children.
- Enrich alerts with runbooks and service ownership to cut MTTA.
- Automate rollouts: pause/abort canaries when burn-rate trips; kill switches via feature flags.
- Measure alert volume/page rate and tighten thresholds iteratively.
- Silence during maintenance and block downstream pages when upstream is red.
Implementation checklist
- Define SLOs with error budgets per critical service.
- Create multi-window SLO burn-rate rules and saturation alerts (queue lag, connection pool, thread pools).
- Label metrics and alerts with `service`, `env`, `owner`, `tier` for routing.
- Implement Alertmanager routes, grouping, deduplication, and inhibition.
- Enrich events with runbook URLs and service catalog ownership.
- Wire alerts to rollout controls (Argo Rollouts/Flagger) and feature-flag kill switches.
- Track MTTA/MTTR and pages-per-engineer; iterate thresholds and routing monthly.
Questions we hear from teams
- What’s a leading indicator in alerting?
- A signal that predicts user impact before it fully materializes: SLO burn rates (multi-window), queue lag growth, tail latency acceleration, connection pool saturation, circuit breaker open ratio, and OOM events. These correlate with incidents far better than raw CPU or disk.
- How do I avoid flapping alerts?
- Use recording rules to stabilize expressions, add `for:` durations to require persistence, and set multi-window conditions (e.g., 5m AND 1h). Group alerts and use inhibition to prevent cascades. Test in staging with replayed traffic if possible.
- How do I route to the right on-call automatically?
- Label alerts with `owner` from your service catalog (e.g., Backstage). Configure Alertmanager to include that label in the PagerDuty event so incidents auto-assign to the service’s on‑call schedule.
- Can I automate rollbacks without Argo Rollouts?
- Yes. Flagger, Spinnaker (Kayenta), and even CI/CD hooks can gate deploys on Prometheus queries. As a fallback, wire an Alertmanager webhook to a small handler that pauses a deployment or flips a feature flag.
- What should I measure to know it’s working?
- Pages per engineer per week, MTTA, MTTR, false-positive rate, alert volume by source, and deploy-related incident rate. Expect a 40–70% page reduction if you shift to SLO burn + saturation and fix routing.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
