Stop Paging the Whole Org: Intelligent Alert Routing That Predicts Incidents and Drives Rollbacks
Cut 50–70% of pages by routing on leading indicators, not vanity metrics—and wire those alerts to triage bots and safe rollbacks.
You don’t fix alert fatigue with “Do Not Disturb.” You fix it by making alerts the actuator in your control loop.Back to all posts
The 2 a.m. Slackpocalypse you don’t need
We had a Monday morning deploy at a fintech where a canary tickled a database lock bug. Within minutes, Slack lit up: 200+ alerts from every layer—pods flapping, p99 latency up, error logs flooding. Three teams got paged. Half the noise was downstream symptoms. What actually mattered: the error budget burn for the checkout SLO had crossed the critical threshold, and the canary needed to roll back. When we wired routing to that single leading indicator—and taught the system to auto-pause the rollout—we cut pages by 63% and caught the incident 15 minutes earlier.
If your alerting is a firehose, it’s not just annoying; it hides the real fire. Here’s how to fix it with intelligent routing and automation that actually shifts MTTR and sleep quality.
Measure what predicts pain (not what looks pretty on a dashboard)
If an alert can’t predict an incident or trigger a decision, it’s noise. Focus on leading indicators tied to user-facing SLOs and system saturation/backlog dynamics, not vanity metrics.
Leading indicators that work:
- Error budget burn rate (multi-window): predicts breach before customers churn.
- Saturation/backlog: Kafka consumer lag, Redis connection pool saturation, NGINX upstream queue length.
- Latency tail drift: p99 rising together with retries (signals cascading failure risk).
- Resource throttling: CPU throttles in
container_cpu_cfs_throttled_seconds_totalpredict latency spikes before CPU “usage” looks bad. - GC pause ratio / heap pressure: JVM
gc_pause_secondsp95 > 200ms is a precursor to timeouts. - Network health: TCP retransmits, TLS handshake failures—catch cross-zone issues fast.
Things to retire or demote to dashboards:
- Averages (CPU, latency) without tail percentiles.
- Raw request counts, log volume, “disk usage > 80%” without rate-of-change.
- “Pod restarted” spam without rate/context (watch series drift, not single restarts).
A concrete PromQL example for multi-window, multi-burn alerts for a 99% SLO:
# prometheus-rule.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: web-slo-burn
spec:
groups:
- name: slo.web
rules:
- record: slo:http_error_rate_5m
expr: |
sum(rate(http_requests_total{job="web", code=~"5.."}[5m]))
/
sum(rate(http_requests_total{job="web"}[5m]))
- record: slo:http_error_rate_1h
expr: |
sum(rate(http_requests_total{job="web", code=~"5.."}[1h]))
/
sum(rate(http_requests_total{job="web"}[1h]))
# Critical if burning >14x budget in both 5m and 1h windows
- alert: SLOErrorBudgetBurn
expr: (slo:http_error_rate_5m > (0.01 * 14)) and (slo:http_error_rate_1h > (0.01 * 14))
for: 5m
labels:
severity: critical
service: web
env: prod
owner: team-web
annotations:
summary: "Web SLO burn rate critical"
runbook_url: "https://runbooks.company.com/web/slo-burn"
# Warning if burning >6x budget in both 30m and 6h windows (omitted for brevity)Replace HTTP with gRPC status codes, Kafka lag_seconds, or a custom RED/GOLD SLI as needed. The point: burn rate + saturation beats “CPU > 80%” every day.
Route alerts like traffic, not like a megaphone
Your routing layer decides who wakes up. Use Alertmanager (or Opsgenie/PagerDuty Event Orchestration) like a router, not a dump pipe.
What actually works:
- Group by problem, not host:
group_by: ['alertname','service','env']to collapse flapping noise. - Inhibit downstream spam: when
ApiserverDown, suppress allPodNotReadyin that cluster. - Route by owner: enrich labels so
owner=team-webpages the right on-call automatically. - Escalate by severity:
warningto Slack,criticalto PagerDuty; repeat intervals that respect human capacity. - Annotate with action: add
runbook_url,dashboard_url,rollback_cmdright in the notification.
Example Alertmanager config that does all of the above:
# alertmanager.yaml
route:
group_by: ['alertname', 'service', 'env']
group_wait: 30s
group_interval: 5m
repeat_interval: 3h
receiver: 'slack-common'
routes:
- matchers:
- severity="critical"
receiver: 'pagerduty-sre'
continue: true
- matchers:
- owner="team-web"
receiver: 'slack-team-web'
receivers:
- name: 'pagerduty-sre'
pagerduty_configs:
- routing_key: '<pd-key>'
severity: '{{ .CommonLabels.severity }}'
class: '{{ .CommonLabels.service }}'
component: '{{ .CommonLabels.env }}'
details:
runbook: '{{ .CommonAnnotations.runbook_url }}'
dashboard: '{{ .CommonAnnotations.dashboard_url }}'
- name: 'slack-common'
slack_configs:
- api_url: https://hooks.slack.com/services/XXX
channel: '#alerts'
title: '{{ .CommonLabels.alertname }} ({{ .CommonLabels.service }})'
text: '{{ template "slack.default.text" . }}'
- name: 'slack-team-web'
slack_configs:
- api_url: https://hooks.slack.com/services/YYY
channel: '#team-web-alerts'
title: '{{ .CommonLabels.alertname }} ({{ .CommonLabels.service }})'
text: '{{ template "slack.default.text" . }}'
inhibit_rules:
- source_matchers: ['alertname="ApiserverDown"']
target_matchers: ['alertname=~"Pod.*NotReady"']
equal: ['cluster']When we implemented this at a SaaS at ~200 microservices, the number of distinct pages per incident dropped from 7–12 to 1–3. The on-call got a single, actionable “SLO burn” with a runbook and dashboard link—no more “kubectl bingo” at 2 a.m.
Enrich telemetry with ownership (so routing Just Works)
Routing is only as good as your labels. If you don’t have owner, service, and env stamped on metrics, logs, and traces, you’ll route to the wrong team. Use OpenTelemetry Collector to enrich at ingest, pulling ownership from your service catalog (Backstage, OpsLevel, Cortex).
# otel-collector.yaml (snippet)
receivers:
otlp:
protocols: { grpc: {}, http: {} }
processors:
attributes/service-owner:
actions:
- key: owner
action: upsert
from_attribute: service.owner # set upstream, or map via lookup extension
resource/standard:
attributes:
- key: service
action: upsert
from_attribute: service.name
- key: env
action: upsert
from_attribute: deployment.environment
exporters:
prometheus:
endpoint: 0.0.0.0:9464
service:
pipelines:
metrics:
receivers: [otlp]
processors: [attributes/service-owner, resource/standard]
exporters: [prometheus]If you can’t add owner at source, write a small enricher that joins on service using your catalog API and injects the label. The goal: Alertmanager matches owner=team-data without regex games.
Tie alerts to triage and safe rollbacks
Alerts should kick off actions, not just ping humans. Two automations that consistently pay off:
- First-response triage bots for known failure modes.
- Example: When
KafkaConsumerLagHighinorders-consumer, scale replicas x2 and post context in Slack. - Use
StackStorm,Rundeck, orGitHub Actionstriggered by webhooks from Alertmanager or PagerDuty.
- Example: When
# stackstorm rule (gp.auto-scale.yaml)
---
name: auto-scale-kafka-consumers
pack: gp
trigger:
type: core.webhook
parameters:
url: /alerts
criteria:
trigger.body.commonLabels.alertname:
type: equals
pattern: KafkaConsumerLagHigh
action:
ref: kubernetes.scale_deployment
parameters:
name: orders-consumer
namespace: prod
replicas: 6- Rollout gates that pause/rollback on SLO regressions.
- With Argo Rollouts or Flagger, wire Prometheus queries into canary analysis.
# analysis template + rollout (argo)
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: web-slo-check
spec:
metrics:
- name: error-rate
interval: 1m
successCondition: result < 0.01
failureLimit: 3
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
sum(rate(http_requests_total{job="web-canary",code=~"5.."}[1m])) /
sum(rate(http_requests_total{job="web-canary"}[1m]))
---
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: web
spec:
strategy:
canary:
steps:
- setWeight: 10
- pause: {duration: 2m}
- analysis: {templates: [{templateName: web-slo-check}]}
- setWeight: 25
- pause: {duration: 5m}
- analysis: {templates: [{templateName: web-slo-check}]}
- setWeight: 50
- pause: {duration: 10m}
- analysis: {templates: [{templateName: web-slo-check}]}This wiring transforms alerts from FYIs to control signals. When the canary trips the SLO metric, the rollout pauses and pages the right owner with a rollback button. At one e-comm client, this cut failed-deploy impact from 35 minutes of partial outage to under 8 minutes on average.
A 30-day implementation plan that survives reality
You don’t need a platform rebuild. Ship this in four weeks:
- Week 1: SLOs and signals
- Pick 2–3 critical user journeys; define 99% or 99.5% SLOs.
- Implement multi-window burn rate rules and 2–3 saturation/backlog indicators per service.
- Add
owner,service,envlabels via OpenTelemetry Collector or sidecars.
- Week 2: Routing and noise gates
- Deploy Alertmanager grouping/inhibition; route
warningto Slack,criticalto PagerDuty. - Annotate alerts with
runbook_url,dashboard_url, and known remediation. - Hold a 60-minute noise review; delete or fix the top 10 noisiest rules.
- Deploy Alertmanager grouping/inhibition; route
- Week 3: Triage bots and runbooks
- Codify 3 high-confidence automations (scale consumer, restart stuck job, toggle feature flag via
OpenFeature/LaunchDarkly). - Standardize runbook templates; link them in alerts. Version in Git and review via PRs.
- Codify 3 high-confidence automations (scale consumer, restart stuck job, toggle feature flag via
- Week 4: Rollout gates and KPIs
- Add Argo Rollouts/Flagger analysis to 1–2 high-risk services.
- Start tracking: pages per eng per week, actionable rate, MTTA/MTTR, rollback lead time.
- Run a gameday: inject failures (Gremlin/chaos mesh) and tune thresholds.
What “good” looks like (and what to watch)
Outcomes we see after the first month when teams really commit:
- -50–70% pages with the same or better incident detection.
- -25–40% MTTR because alerts kick off first-response steps and rollouts auto-pause.
- +30–50% actionable rate as measured by “alert led to a change or decision.”
- Rollback lead time < 5 minutes post-regression.
Watchouts I’ve learned the hard way:
- Don’t route by hostname. Route by
serviceandowner, or you’ll wake the wrong team. - Don’t alert on single thresholds; prefer rate-of-change and burn rates.
- Beware high-cardinality label explosions (user_id, session_id) in Prometheus; keep labels curated.
- Keep repeat intervals humane; flooding PagerDuty repeats destroys trust.
- Bake these rules into GitOps (ArgoCD/Flux) so drift doesn’t creep in.
You don’t fix alert fatigue with “Do Not Disturb.” You fix it by making alerts the actuator in your control loop.
Tools that play nice together
- Prometheus/Alertmanager for SLO burn and routing.
- OpenTelemetry to standardize signals and enrich with ownership.
- Argo Rollouts/Flagger for canary analysis and automatic pause/rollback.
- PagerDuty/Opsgenie for on-call scheduling and event orchestration.
- Backstage/OpsLevel for service ownership that drives routing.
- Grafana dashboards tied directly from alert annotations for one-click context.
If your stack’s different (Datadog, New Relic, Honeycomb, Spinnaker/Kayenta), the same principles apply. The glue is labels and queries that reflect the business, not the machines.
Key takeaways
- Alert on leading indicators tied to SLOs (burn rate, saturation, backlog), not vanity metrics.
- Use an alert routing graph: group, inhibit, and route by `service`, `env`, and `owner` labels.
- Enrich telemetry with ownership from your service catalog to route to the right team automatically.
- Tie alerts to automation: kick off runbooks and rollout gates (Argo Rollouts, Flagger) from alert context.
- Adopt multi-window, multi-burn SLO alerts to predict incidents with fewer false positives.
- Measure outcomes: pages/eng/week, actionable alert rate, MTTA/MTTR, rollback lead time.
Implementation checklist
- Define SLOs and compute multi-window burn rates for critical user journeys.
- Instrument saturation and backlog: queue depth, connection pool usage, throttle rate, GC pause time.
- Standardize labels: `service`, `env`, `owner`, `tier`, `region` across metrics/logs/traces.
- Implement Alertmanager grouping/inhibition; route by `owner` and `severity`.
- Add runbook URLs, dashboards, and rollback commands to alert annotations.
- Enable canary analysis with Prometheus queries that gate rollouts (Argo Rollouts/Flagger).
- Automate first-response triage for known failure modes (scale, restart, feature-flag off).
- Track and review alert KPIs weekly; delete or fix noisy rules.
Questions we hear from teams
- How do we pick SLOs if we don’t have great historical data?
- Start with your most critical user journeys (checkout, login, API POST /orders). Choose a conservative 99% target and instrument SLIs you can query today (error rate, latency p95/p99). Iterate monthly. The biggest win is moving to burn-rate style alerts, not picking the perfect number on day one.
- Won’t automation cause more incidents if it acts on bad alerts?
- Scope automation to high-confidence remediations with clear rollbacks (scale a deployment, pause a rollout, toggle a feature flag). Gate actions behind multi-window conditions and short ‘for’ durations, and always notify the on-call with exactly what changed.
- We’re on Datadog/New Relic, not Prometheus—does this still apply?
- Yes. The principles are vendor-agnostic: compute burn rates, enrich with ownership, group and inhibit alerts, and wire notifications to triage and rollout APIs. Datadog monitors and event pipelines support the same patterns; so do New Relic workflows and Honeycomb triggers.
- How do we handle noisy flapping alerts during incidents?
- Use Alertmanager grouping and inhibition to collapse downstream symptoms. Increase group_wait to 30–60s to coalesce bursts. Add rate-of-change or consecutive evaluation windows so a single blip doesn’t page. After the incident, run a noise retro and fix or delete the loudest rules.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
