We Cut MTTD From 14 Minutes to 90 Seconds by Alerting on What Fails Next, Not What Looks Pretty Now

Stop paging on dashboards. Start predicting breakage. Here’s the telemetry, configs, and rollout glue that actually reduces mean time to detection.

Alert on what fails next, not what looks pretty now.
Back to all posts

The outage you didn’t see coming

Two Black Fridays ago, a retailer called us when their JVM fleet kept faceplanting every 30–40 minutes. Dashboards were all green until the moment the cart API went 500-happy. The actual cause? Container CPU throttling spiked during checkout traffic bursts, GC pauses jumped, Kafka lag climbed, and then the API fell off a cliff. Their alerts were pretty (APM scorecards, CPU averages), but none predicted the cliff. We flipped the playbook: alert on leading indicators and tie those alerts directly to rollouts and ownership. MTTD dropped from ~14 minutes (customer tweets) to ~90 seconds (automated detection + canary rollback).

Stop alerting on vanity metrics

If you’re paging on average CPU, request count, or single-node disk utilization, you’re alerting on vibes. Those are debugging signals, not detectors.

Alert fatigue comes from:

  • Lagging indicators: 5xx rate after the blast radius is big.
  • Averages: hide p99 latency and tail risk.
  • Detached signals: alerts with no deployment/version context.

What works:

  • Saturation: throttling, backlog, pool exhaustion.
  • Error-budget burn: the earliest customer impact story that matters.
  • Correlation to rollout: tie alerts to version, rollout_id, and service so you can auto-rollback or auto-route.

Leading indicators that actually predict incidents

Here are the ones that have paid rent at scale (Kubernetes + microservices + Kafka + Postgres):

  • CPU throttling ratio: rate(container_cpu_cfs_throttled_seconds_total[5m]) / rate(container_cpu_usage_seconds_total[5m]) > 0.2 predicts latency cliffs in JVM, Node, and Python under burst.
  • Queue backlog / consumer lag: Kafka kafka_consumergroup_lag or RabbitMQ queue_messages_ready rising faster than consumers can drain.
  • Connection pool saturation: Postgres pg_stat_activity active / max_connections > 0.85, or driver-level pool saturation metrics.
  • Garbage collection pressure: increase(jvm_gc_pause_seconds_sum[5m]) and heap occupancy rising with allocation spikes.
  • p99 latency SLI: tail latency moves first; don’t page on averages.
  • Node resource pressure: kube_node_status_condition{condition="DiskPressure",status="true"} and inode exhaustion precede eviction storms.
  • SLO burn rate (multi-window): Catch real customer pain early without flapping.

Prometheus PrometheusRule examples that catch trouble before customers do:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: leading-indicators
  labels:
    team: payments
spec:
  groups:
  - name: saturation.rules
    rules:
    - alert: HighCPUThrottling
      expr: |
        sum by (pod,container,namespace,service,version) (
          rate(container_cpu_cfs_throttled_seconds_total{container!=""}[5m])
          / clamp_min(rate(container_cpu_usage_seconds_total{container!=""}[5m]), 0.001)
        ) > 0.2
      for: 2m
      labels:
        severity: predicted
        service: cart-api
      annotations:
        summary: "CPU throttling > 20% ({{ $labels.pod }})"
        runbook_url: https://runbooks.example.com/cpu-throttling
        rollout_id: "{{ $labels.rollout_id }}"
    - alert: KafkaConsumerLagGrowing
      expr: |
        sum by (consumergroup,topic,service,version) (rate(kafka_consumergroup_lag[5m])) > 100
      for: 5m
      labels:
        severity: predicted
      annotations:
        summary: "Kafka lag growing for {{ $labels.consumergroup }}"
        runbook_url: https://runbooks.example.com/kafka-lag
  - name: slo.rules
    rules:
    - record: job:http_error_ratio
      expr: |
        sum(rate(http_server_requests_seconds_count{status=~"5..",job="cart"}[5m]))
        /
        sum(rate(http_server_requests_seconds_count{job="cart"}[5m]))
    - alert: SLOBurnFast
      expr: job:http_error_ratio > (0.01 * 14)  # 1% SLO, 14x burn over 5m
      for: 5m
      labels:
        severity: page
      annotations:
        summary: "Fast SLO burn (5m) for cart"
        runbook_url: https://runbooks.example.com/slo-burn
    - alert: SLOBurnSlow
      expr: avg_over_time(job:http_error_ratio[1h]) > (0.01 * 6) # 6x over 1h
      for: 15m
      labels:
        severity: page
      annotations:
        summary: "Sustained SLO burn (1h) for cart"
        runbook_url: https://runbooks.example.com/slo-burn

Add deployment metadata to your metrics via labels or exemplars. With OpenTelemetry:

receivers:
  otlp:
    protocols:
      http:
exporters:
  otlphttp:
    endpoint: https://api.honeycomb.io
    headers: { "x-honeycomb-team": "${HONEYCOMB_KEY}" }
processors:
  resource:
    attributes:
    - key: service.version
      from_attribute: k8s.deployment.version
      action: upsert
    - key: deployment.rollout_id
      from_attribute: k8s.rollout.uid
      action: upsert
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [resource]
      exporters: [otlphttp]

Now every alert and span points to the rollout that likely caused it.

Wire telemetry to triage: alerts that do real work

Don’t send raw Prometheus alerts to Slack and hope humans sort it out. Route and reduce.

  1. Classify severity by predictiveness: severity=predicted (leading indicators) → Slack + Jira + feature flag disable; severity=page (confirmed SLO burn) → PagerDuty.
  2. Enrich with runbook and owner: include service, team, oncall, rollout_id, dashboard_url.
  3. Automate routing: Event orchestration that maps service=cart-api to the right on-call and auto-tags the incident with the rollout.

Alertmanager example:

route:
  receiver: default
  routes:
  - matchers:
    - severity = "predicted"
    receiver: slack-predicted
    group_wait: 30s
    group_interval: 2m
    repeat_interval: 2h
  - matchers:
    - severity = "page"
    receiver: pagerduty
    group_wait: 10s
    group_interval: 1m
    repeat_interval: 1h
receivers:
- name: slack-predicted
  slack_configs:
  - channel: '#reliability'
    title: '{{ template "slack.title" . }}'
    text: |
      *{{ .CommonLabels.alertname }}* {{ .CommonAnnotations.summary }}
      service={{ .CommonLabels.service }} version={{ .CommonLabels.version }} rollout={{ .CommonAnnotations.rollout_id }}
      runbook={{ .CommonAnnotations.runbook_url }}
- name: pagerduty
  pagerduty_configs:
  - routing_key: ${PAGERDUTY_KEY}
    severity: 'critical'
    class: '{{ .CommonLabels.service }}'
    component: '{{ .CommonLabels.service }}'
    details: {
      "version": "{{ .CommonLabels.version }}",
      "rollout": "{{ .CommonAnnotations.rollout_id }}"
    }

PagerDuty Event Orchestration can auto-assign based on service and attach a Slack war-room:

{
  "conditions": [{
    "expression": "details.rollout != '' && class == 'cart-api'",
    "actions": {
      "routes": [{"id": "cart-primary"}],
      "annotations": {"slack_channel": "#inc-cart"}
    }
  }]
}

The point: triage shouldn’t require a human to read five dashboards to figure out who owns the mess.

Close the loop: rollout automation and safe rollback

If your detector fires and a human still has to click through a runbook to undo a bad deploy, you’ve left minutes on the floor. Use canary analysis with automatic rollback.

Argo Rollouts example: canary pauses while Prometheus metrics stay healthy; rollback on failure.

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: cart-api
spec:
  strategy:
    canary:
      steps:
      - setWeight: 10
      - pause: {duration: 120}
      - analysis:
          templates:
          - templateName: prom-sli
          args:
          - name: version
            valueFrom:
              podTemplateHashValue: Latest
      - setWeight: 50
      - pause: {duration: 180}
      - analysis:
          templates:
          - templateName: prom-sli
      - setWeight: 100
  template:
    metadata:
      labels:
        app: cart-api
    spec:
      containers:
      - name: cart
        image: ghcr.io/org/cart:1.28.3
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: prom-sli
spec:
  metrics:
  - name: error-ratio
    interval: 60s
    successCondition: result < 0.01
    failureLimit: 1
    provider:
      prometheus:
        address: http://prometheus.monitoring.svc.cluster.local:9090
        query: |
          sum(rate(http_server_requests_seconds_count{job="cart",status=~"5..",rollout="{{args.rollout}}"}[2m]))
          /
          sum(rate(http_server_requests_seconds_count{job="cart",rollout="{{args.rollout}}"}[2m]))
  - name: p99-latency
    interval: 60s
    successCondition: result < 0.350
    failureLimit: 1
    provider:
      prometheus:
        query: |
          histogram_quantile(0.99, sum by (le) (rate(http_server_requests_seconds_bucket{job="cart",rollout="{{args.rollout}}"}[2m])))

Prefer Flagger? Same idea with Istio/Linkerd and threshold-based rollback. The result: push risky changes at 10% traffic, bail automatically if error or latency moves past tight SLO guardrails.

Example: shipping a risky JVM upgrade without pager fatigue

We helped a fintech move from Java 11 to 21 on a latency-sensitive service. Historically, they’d roll to 100% and pray. This time we:

  • Instrumented throttling, GC pause, p99 latency, and Postgres pool saturation.
  • Set severity=predicted alerts on throttling > 15% and GC pauses > 300ms/5m.
  • Built multi-window burn on the API SLI (0.5% error budget/day target).
  • Used Argo Rollouts canary (10% → 50% → 100%), auto-rollback on failure.
  • Wired PagerDuty only on SLO burn; predicted signals went to Slack + Jira with assignee pre-set.

Results over two weeks:

  • MTTD: 14m → 1.5m (median), 45s when the canary tripped automatically.
  • False pages: down 62%; predicted alerts were 80% “action-only” (no human page).
  • Deploy velocity: from 1/day to 6/day on that service.

We found an -XX:MaxRAMPercentage regression at 10%, rolled back automatically, tuned CPU requests to cut throttling, and shipped the upgrade the next day.

Build it in 30 days without boiling the ocean

You don’t need a platform team of 40. Do this in four sprints:

  1. Week 1: SLOs + SLIs + labels
    • Define one SLI per top-3 customer journey (availability or latency).
    • Add service, version, rollout_id labels to metrics/traces.
    • Stand up synthetic probes from the edge (e.g., cloudprober).
probe {
  name: "checkout"
  type: HTTP
  targets { host_names: "https://shop.example.com/checkout" }
  interval_msec: 15000
  timeout_msec: 3000
  http_probe { method: GET }
}
  1. Week 2: Leading indicators + burn rate

    • Add Prometheus rules for throttling, lag, pool saturation, p99.
    • Implement multi-window burn (5m/1h) for each SLO.
  2. Week 3: Triage automation

    • Alertmanager routes for predicted vs page.
    • PagerDuty Event Orchestration mapping service → escalation.
    • Enrich alerts with runbook links and dashboards.
  3. Week 4: Rollout guardrails

    • Argo Rollouts or Flagger with Prometheus analysis.
    • Feature flag killswitch for expensive paths (e.g., LaunchDarkly).
// Node + LaunchDarkly: degrade before you die
const flag = await ldClient.variation('use-new-recommender', user, false);
if (!flag || circuitBreaker.tripped()) {
  return cachedRecommendations(); // graceful degradation
}
return liveRecommendations();

Ship each week; don’t wait for perfection.

What we’d do differently (and what you can avoid)

  • Don’t combine predicted and paging alerts into one queue. Keep your SLO pages scarce.
  • Resist the urge to alert on every metric. If it’s not tied to action, it’s a dashboard.
  • Keep thresholds tight and windows short for canary analysis; expand for steady-state.
  • Surface rollout metadata everywhere—traces, logs, metrics—and make it clickable.
  • Test rollbacks during business hours with chaos drills; don’t wait for an actual fire.

If your system needs a human to notice a spike and a Slack thread to decide what to do, you’re minutes—sometimes millions—late. Make the system decide for the obvious cases.

Related Resources

Key takeaways

  • Alert on leading indicators like saturation, queue depth, and error-budget burn—not on dashboards or vanity metrics.
  • Attach deployment/version context to every signal so triage routes to the right humans and systems automatically.
  • Use multi-window SLO burn-rate alerts to avoid noise while catching real customer impact early.
  • Close the loop with automated rollbacks via Argo Rollouts or Flagger when indicators cross thresholds.
  • Keep alerts small, annotated, and actionable; everything else is a chart, not a page.

Implementation checklist

  • Define SLIs/SLOs and compute burn rate with two windows (e.g., 5m + 1h).
  • Instrument leading indicators: CPU throttling, GC pause, queue lag, connection pool saturation, p99 latency.
  • Add rollout metadata to telemetry: service, version, commit, rollout_id.
  • Create Alertmanager routes for predicted incidents vs. customer-impacting ones.
  • Automate triage with PagerDuty Event Orchestration or Opsgenie rules.
  • Enable canary analysis and rollback using Argo Rollouts or Flagger.
  • Run synthetic checks from the edge to catch DNS/TLS/CDN early.

Questions we hear from teams

Why not just alert on 5xx rate?
Because it’s a lagging indicator—customers are already impacted. Use 5xx rate as part of your SLO burn rate to page humans, and rely on leading indicators (throttling, queue backlog, p99 latency) to predict and auto-mitigate before customers feel it.
Isn’t this going to spam my team with alerts?
Not if you split signals: predicted alerts route to Slack/Jira with automation, while only SLO burn rate pages humans. Multi-window burn rules dramatically cut false pages while catching real impact early.
We’re on Datadog/New Relic/Honeycomb. Do we need Prometheus?
No. The strategy is tooling-agnostic. Datadog has anomaly and composite monitors, Honeycomb has SLOs and Burn Alerts, and New Relic supports NRQL-based multi-window logic. We show Prometheus because it’s easy to demo and widely used.
How do we add rollout metadata to metrics?
Inject labels via your metrics SDK or OpenTelemetry `resource` processor. In Kubernetes, annotate deployments with version and rollout IDs and propagate them to metrics/traces via environment variables or OTEL resource detectors.
What if automated rollback makes things worse?
Scope automation to canaries and well-understood metrics, keep failure limits low, and test during business hours with chaos drills. Argo Rollouts and Flagger both support conservative step-ups and quick aborts.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to GitPlumbers about cutting your MTTD Get our SLO burn-rate rule templates

Related resources