Stop the Pager Pinball: Intelligent Alert Routing that Predicts Incidents and Triggers Safe Rollbacks

Cut pages by focusing on SLO burn and saturation—not CPU blips—and wire alerts to triage, canaries, and feature flags so rollouts self-protect.

Page on burn rates and saturation, not CPU. Then let the rollout abort itself. That’s how you sleep at night.
Back to all posts

The 2 a.m. Slack Symphony You Don’t Miss

A few years back, a retail platform’s Black Friday deploy flooded on‑call with 60+ pages in 10 minutes. CPU. Disk. JVM GC. Kafka lag. Every downstream service screamed. No root cause, just noise. We fixed it in a week by doing what should’ve been done from day one: page only on leading indicators, route to owners with context, and wire the alerts to pause the rollout. Next Friday was quiet—two pages, both legit, and the canary auto‑aborted itself before customers felt it.

This isn’t magic. It’s a boring, disciplined stack: Prometheus + Alertmanager + PagerDuty (or Opsgenie) + Argo Rollouts (or Flagger) + your feature flags (LaunchDarkly, Unleash) + a service catalog (Backstage).

Stop Paging on Vanity Metrics

If you page on raw CPU > 80% or disk > 85%, you’re burning human attention. Those are lagging or irrelevant. Page on leading indicators that correlate with user pain or systemic failure:

  • SLO burn rate: short + long window, e.g., 5m/1h for fast burn. If burn rate is hot, users are hurting—page.
  • Saturation/backpressure: queue depth growing (Kafka consumer lag), thread pool saturation, DB connection pool at capacity.
  • Latency tail growth: p99/p99.9 exploding toward SLO thresholds, even before hard errors.
  • Circuit breakers: open ratio > threshold, retry storms detected.
  • Resource exhaustion with context: container oom_kill count, not “memory > 80%”.

What goes to Slack/email only (no page):

  • Node CPU jitters, pod restarts without user impact, cache miss rate if no SLO burn, “instance down” in a scalable ASG.

Tie it to business. If checkout’s 99.9% availability over 28 days gives a 0.1% error budget, alerting should be framed around “how fast we’re burning the budget,” not “CPU spiked.”

Build a Routing Topology That Respects Ownership

You need labels on metrics and alerts: service, env, tier, owner. Without them, routing is guesswork. Use Alertmanager to group, dedupe, and inhibit cascades (don’t page every downstream if api-gateway is red in prod).

# alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname','service','env']
  group_wait: 30s
  group_interval: 2m
  repeat_interval: 2h
  receiver: 'slack-low'
  routes:
    - matchers:
        - severity="page"
        - env="prod"
      receiver: 'pagerduty'
      continue: false
    - matchers:
        - severity="ticket"
      receiver: 'jira'
    - matchers:
        - env="prod"
      receiver: 'slack-prod'

inhibit_rules:
  # If upstream is down, suppress child alerts in same env
  - source_matchers: ['alertname="UpstreamDown"']
    target_matchers: ['env=~"prod|staging"']
    equal: ['env']

receivers:
  - name: 'pagerduty'
    pagerduty_configs:
      - routing_key: ${PAGERDUTY_KEY}
        severity: '{{ .CommonLabels.severity | default "critical" }}'
        details:
          service: '{{ .CommonLabels.service }}'
          env: '{{ .CommonLabels.env }}'
          owner: '{{ .CommonLabels.owner }}'
          runbook: '{{ .CommonAnnotations.runbook_url }}'
          dashboard: '{{ .CommonAnnotations.dashboard_url }}'
  - name: 'slack-prod'
    slack_configs:
      - api_url: ${SLACK_WEBHOOK}
        channel: '#prod-alerts'
        title: '[{{ .Status | toUpper }}] {{ .CommonLabels.service }} {{ .CommonLabels.env }}: {{ .CommonLabels.alertname }}'
        text: '{{ template "slack.text" . }}'
  - name: 'slack-low'
    slack_configs:
      - api_url: ${SLACK_WEBHOOK}
        channel: '#noise'
        title: '{{ .CommonLabels.alertname }}'
        text: '{{ .CommonAnnotations.summary }}'

templates:
  - '/etc/alertmanager/templates/*.tmpl'

Tips I’ve seen work:

  • Group by service and env to avoid 20 tickets for one incident.
  • Use inhibition aggressively to stop downstream storms when the upstream gateway/db is red.
  • Keep repeat_interval long enough to avoid churning responders; rely on dashboards for updates.
  • Use your service catalog (Backstage) to populate owner so PagerDuty auto-assigns to the right on‑call.

Make Alerts Predictive: SLO Burn + Saturation Rules

Use multi-window multi-burn SLO alerts. Don’t reinvent the math; steal from Google’s SRE workbook. Create recording rules and then alert off burned ratios.

# prometheus-rules.yml
 groups:
 - name: slo-burn
   rules:
   - record: slo:checkout:error_ratio:rate5m
     expr: |
       sum(rate(http_request_errors_total{job="checkout",env="prod"}[5m]))
       /
       sum(rate(http_requests_total{job="checkout",env="prod"}[5m]))

   - record: slo:checkout:error_ratio:rate1h
     expr: |
       sum(rate(http_request_errors_total{job="checkout",env="prod"}[1h]))
       /
       sum(rate(http_requests_total{job="checkout",env="prod"}[1h]))

   # Convert to burn rate given 99.9% SLO (0.1% budget)
   - record: slo:checkout:burnrate5m
     expr: slo:checkout:error_ratio:rate5m / 0.001
   - record: slo:checkout:burnrate1h
     expr: slo:checkout:error_ratio:rate1h / 0.001

 - name: predictive-signals
   rules:
   - alert: SLOFastBurn
     expr: slo:checkout:burnrate5m > 14.4 and slo:checkout:burnrate1h > 6
     for: 5m
     labels:
       severity: page
       service: checkout
       env: prod
       owner: payments-oncall
     annotations:
       summary: 'Checkout SLO fast-burn'
       runbook_url: 'https://runbooks.acme.internal/checkout/slo-burn'
       dashboard_url: 'https://grafana.acme.internal/d/checkout'

   - alert: KafkaConsumerLagGrowing
     expr: increase(kafka_consumergroup_lag{consumergroup="checkout"}[5m]) > 10000
     for: 10m
     labels:
       severity: page
       service: checkout
       env: prod
       owner: data-platform
     annotations:
       summary: 'Checkout consumer lag is growing quickly'

   - alert: DBCxnPoolExhaustion
     expr: avg(db_client_pool_in_use{service="checkout",env="prod"})
           / avg(db_client_pool_capacity{service="checkout",env="prod"}) > 0.9
     for: 5m
     labels:
       severity: page
       service: checkout
       env: prod
       owner: payments-oncall
     annotations:
       summary: 'Checkout DB connection pool near exhaustion'

Other good predictors I’ve used:

  • rpc_client_pending_requests trending up + tail latency rising.
  • hystrix_circuit_open_total (or resilience4j_circuitbreaker_state = open) past threshold.
  • container_oom_events_total spikes.
  • GC pause p99 growing with allocation rate steady (memory leak forming).

Tie Telemetry to Triage: Context or It Didn’t Happen

Pages without context might as well be spam. Enrich at collection time and in the alert payload:

  • Add owner, service, tier, runbook_url, dashboard_url as labels/annotations.
  • Make PagerDuty incidents auto-assign to the team from owner.
  • Include a query link to the relevant Grafana panel and the latest deploy SHA.

Using OpenTelemetry Collector to stamp ownership and link runbooks:

receivers:
  otlp:
    protocols:
      http:
      grpc:

processors:
  attributes:
    actions:
      - key: service.namespace
        action: upsert
        value: prod
      - key: service.owner
        action: upsert
        value: payments-oncall
      - key: runbook.url
        action: upsert
        value: https://runbooks.acme.internal/checkout

exporters:
  prometheus:
    endpoint: 0.0.0.0:9464

service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [attributes]
      exporters: [prometheus]

Pro tip: keep runbooks short. First screen should answer: Is this user-impacting? What’s the fastest mitigation? Who owns the dependency? Link to kubectl commands, feature flag names, and rollback instructions.

Fast suppression matters, too. Maintenance windows? Silence with amtool:

amtool silence add alertname=DeployWindow env=prod \
  --duration=2h \
  --comment="prod deploy window" \
  --author="$(whoami)"

I’ve seen teams cut MTTA by 50% just by adding runbook_url and real ownership labels. No fancy AI required; just basic hygiene.

Close the Loop: Alert-Driven Rollouts and Kill Switches

Alerting should change the system, not just wake people. Wire your burn-rate alerts to pause/abort canaries with Argo Rollouts or Flagger. Example using an AnalysisTemplate that queries Prometheus and aborts if burn rate is hot:

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: checkout-slo-burn
spec:
  metrics:
    - name: error-burn
      interval: 1m
      count: 5
      successCondition: result[0] < 6
      failureLimit: 1
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: slo:checkout:burnrate1h
---
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: checkout
spec:
  strategy:
    canary:
      steps:
        - setWeight: 10
        - pause: {duration: 90}
        - analysis:
            templates:
              - templateName: checkout-slo-burn
        - setWeight: 25
        - pause: {duration: 120}
        - analysis:
            templates:
              - templateName: checkout-slo-burn
        - setWeight: 50
        - pause: {}

If the analysis fails, Argo pauses/aborts automatically. That one loop closes 80% of “we shipped a problem” incidents.

Feature flags are your blast-radius lever. When the burn-rate page hits, auto-toggle a kill switch via webhook. Minimal example with Alertmanager webhook -> small handler -> LaunchDarkly API:

# naive example: on alert payload, disable a flag
curl -X PATCH \
  -H "Authorization: Bearer $LD_TOKEN" \
  -H "Content-Type: application/json" \
  https://app.launchdarkly.com/api/v2/flags/acme/checkout-kill \
  -d '[{"op":"replace","path":"/environments/prod/on","value":false}]'

If you’re a Flagger shop, you can simplify by embedding metric checks in the Canary CRD; same concept, fewer moving parts. For Spinnaker users, Kayenta does canary analysis with similar burn-rate gates.

What Changes When You Do This Right

Real numbers from a fintech client we helped last year:

  • Pages per week: 42 -> 15 (-64%) in 30 days.
  • MTTA: 11m -> 4m (owners auto-assigned, runbooks linked).
  • Rollback MTTR: 23m -> 8m (canary auto-abort + flags).
  • False pages: down 70% (inhibition + grouping).

Secondary effects you’ll actually feel:

  • On‑call stops dreading deploys; more daytime fixes, fewer heroics.
  • Fewer “no-op” incident reviews because the system self-protected.
  • Product teams push more often because rollouts are guardrailed.

Common Pitfalls I’ve Seen (And How to Dodge Them)

  • Alert per instance. Group by service/env or you’ll get paged 50 times for one outage.
  • “Everything is critical.” Reserve severity=page for user-impacting signals. Ticket or Slack the rest.
  • Dead dashboards. If your annotation links 404, responders will ignore them next time. Keep them fresh.
  • Flappy alerts. Add for: to require persistence. Use recording rules so expressions are stable.
  • Owner drift. Sync owner from Backstage nightly; fail closed (no owner -> SRE on‑call + a ticket to fix labels).

If I Had to Start Tomorrow

  1. Pick one critical service. Define a 28‑day availability SLO and the error budget.
  2. Add 5m/1h burn-rate recording rules and a SLOFastBurn alert with severity=page.
  3. Add saturation alerts: DB pool at 90%, consumer lag growing, circuit breaker open ratio.
  4. Add service, env, owner, runbook_url to metrics/alerts.
  5. Put Alertmanager grouping, inhibition, and PagerDuty routing in place.
  6. Wire an AnalysisTemplate in Argo Rollouts to pause/abort when burn tripped.
  7. Track pages/week and MTTA for 30 days; adjust thresholds monthly.

Do that, and your 2 a.m. looks less like chaos and more like a boring, predictable system. That’s the goal.

Related Resources

Key takeaways

  • Page on leading indicators (SLO burn, saturation, queue lag), not vanity metrics (raw CPU, disk).
  • Route by ownership and severity using labels; group, dedupe, and inhibit noisy children.
  • Enrich alerts with runbooks and service ownership to cut MTTA.
  • Automate rollouts: pause/abort canaries when burn-rate trips; kill switches via feature flags.
  • Measure alert volume/page rate and tighten thresholds iteratively.
  • Silence during maintenance and block downstream pages when upstream is red.

Implementation checklist

  • Define SLOs with error budgets per critical service.
  • Create multi-window SLO burn-rate rules and saturation alerts (queue lag, connection pool, thread pools).
  • Label metrics and alerts with `service`, `env`, `owner`, `tier` for routing.
  • Implement Alertmanager routes, grouping, deduplication, and inhibition.
  • Enrich events with runbook URLs and service catalog ownership.
  • Wire alerts to rollout controls (Argo Rollouts/Flagger) and feature-flag kill switches.
  • Track MTTA/MTTR and pages-per-engineer; iterate thresholds and routing monthly.

Questions we hear from teams

What’s a leading indicator in alerting?
A signal that predicts user impact before it fully materializes: SLO burn rates (multi-window), queue lag growth, tail latency acceleration, connection pool saturation, circuit breaker open ratio, and OOM events. These correlate with incidents far better than raw CPU or disk.
How do I avoid flapping alerts?
Use recording rules to stabilize expressions, add `for:` durations to require persistence, and set multi-window conditions (e.g., 5m AND 1h). Group alerts and use inhibition to prevent cascades. Test in staging with replayed traffic if possible.
How do I route to the right on-call automatically?
Label alerts with `owner` from your service catalog (e.g., Backstage). Configure Alertmanager to include that label in the PagerDuty event so incidents auto-assign to the service’s on‑call schedule.
Can I automate rollbacks without Argo Rollouts?
Yes. Flagger, Spinnaker (Kayenta), and even CI/CD hooks can gate deploys on Prometheus queries. As a fallback, wire an Alertmanager webhook to a small handler that pauses a deployment or flips a feature flag.
What should I measure to know it’s working?
Pages per engineer per week, MTTA, MTTR, false-positive rate, alert volume by source, and deploy-related incident rate. Expect a 40–70% page reduction if you shift to SLO burn + saturation and fix routing.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to a GitPlumbers SRE Grab our SLO alert templates

Related resources