We Cut MTTD From 14 Minutes to 90 Seconds by Alerting on What Fails Next, Not What Looks Pretty Now
Stop paging on dashboards. Start predicting breakage. Here’s the telemetry, configs, and rollout glue that actually reduces mean time to detection.
Alert on what fails next, not what looks pretty now.Back to all posts
The outage you didn’t see coming
Two Black Fridays ago, a retailer called us when their JVM fleet kept faceplanting every 30–40 minutes. Dashboards were all green until the moment the cart API went 500-happy. The actual cause? Container CPU throttling spiked during checkout traffic bursts, GC pauses jumped, Kafka lag climbed, and then the API fell off a cliff. Their alerts were pretty (APM scorecards, CPU averages), but none predicted the cliff. We flipped the playbook: alert on leading indicators and tie those alerts directly to rollouts and ownership. MTTD dropped from ~14 minutes (customer tweets) to ~90 seconds (automated detection + canary rollback).
Stop alerting on vanity metrics
If you’re paging on average CPU, request count, or single-node disk utilization, you’re alerting on vibes. Those are debugging signals, not detectors.
Alert fatigue comes from:
- Lagging indicators: 5xx rate after the blast radius is big.
- Averages: hide p99 latency and tail risk.
- Detached signals: alerts with no deployment/version context.
What works:
- Saturation: throttling, backlog, pool exhaustion.
- Error-budget burn: the earliest customer impact story that matters.
- Correlation to rollout: tie alerts to
version,rollout_id, andserviceso you can auto-rollback or auto-route.
Leading indicators that actually predict incidents
Here are the ones that have paid rent at scale (Kubernetes + microservices + Kafka + Postgres):
- CPU throttling ratio:
rate(container_cpu_cfs_throttled_seconds_total[5m]) / rate(container_cpu_usage_seconds_total[5m])> 0.2 predicts latency cliffs in JVM, Node, and Python under burst. - Queue backlog / consumer lag: Kafka
kafka_consumergroup_lagor RabbitMQqueue_messages_readyrising faster than consumers can drain. - Connection pool saturation: Postgres
pg_stat_activityactive /max_connections> 0.85, or driver-level pool saturation metrics. - Garbage collection pressure:
increase(jvm_gc_pause_seconds_sum[5m])and heap occupancy rising with allocation spikes. - p99 latency SLI: tail latency moves first; don’t page on averages.
- Node resource pressure:
kube_node_status_condition{condition="DiskPressure",status="true"}and inode exhaustion precede eviction storms. - SLO burn rate (multi-window): Catch real customer pain early without flapping.
Prometheus PrometheusRule examples that catch trouble before customers do:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: leading-indicators
labels:
team: payments
spec:
groups:
- name: saturation.rules
rules:
- alert: HighCPUThrottling
expr: |
sum by (pod,container,namespace,service,version) (
rate(container_cpu_cfs_throttled_seconds_total{container!=""}[5m])
/ clamp_min(rate(container_cpu_usage_seconds_total{container!=""}[5m]), 0.001)
) > 0.2
for: 2m
labels:
severity: predicted
service: cart-api
annotations:
summary: "CPU throttling > 20% ({{ $labels.pod }})"
runbook_url: https://runbooks.example.com/cpu-throttling
rollout_id: "{{ $labels.rollout_id }}"
- alert: KafkaConsumerLagGrowing
expr: |
sum by (consumergroup,topic,service,version) (rate(kafka_consumergroup_lag[5m])) > 100
for: 5m
labels:
severity: predicted
annotations:
summary: "Kafka lag growing for {{ $labels.consumergroup }}"
runbook_url: https://runbooks.example.com/kafka-lag
- name: slo.rules
rules:
- record: job:http_error_ratio
expr: |
sum(rate(http_server_requests_seconds_count{status=~"5..",job="cart"}[5m]))
/
sum(rate(http_server_requests_seconds_count{job="cart"}[5m]))
- alert: SLOBurnFast
expr: job:http_error_ratio > (0.01 * 14) # 1% SLO, 14x burn over 5m
for: 5m
labels:
severity: page
annotations:
summary: "Fast SLO burn (5m) for cart"
runbook_url: https://runbooks.example.com/slo-burn
- alert: SLOBurnSlow
expr: avg_over_time(job:http_error_ratio[1h]) > (0.01 * 6) # 6x over 1h
for: 15m
labels:
severity: page
annotations:
summary: "Sustained SLO burn (1h) for cart"
runbook_url: https://runbooks.example.com/slo-burnAdd deployment metadata to your metrics via labels or exemplars. With OpenTelemetry:
receivers:
otlp:
protocols:
http:
exporters:
otlphttp:
endpoint: https://api.honeycomb.io
headers: { "x-honeycomb-team": "${HONEYCOMB_KEY}" }
processors:
resource:
attributes:
- key: service.version
from_attribute: k8s.deployment.version
action: upsert
- key: deployment.rollout_id
from_attribute: k8s.rollout.uid
action: upsert
service:
pipelines:
traces:
receivers: [otlp]
processors: [resource]
exporters: [otlphttp]Now every alert and span points to the rollout that likely caused it.
Wire telemetry to triage: alerts that do real work
Don’t send raw Prometheus alerts to Slack and hope humans sort it out. Route and reduce.
- Classify severity by predictiveness:
severity=predicted(leading indicators) → Slack + Jira + feature flag disable;severity=page(confirmed SLO burn) → PagerDuty. - Enrich with runbook and owner: include
service,team,oncall,rollout_id,dashboard_url. - Automate routing: Event orchestration that maps
service=cart-apito the right on-call and auto-tags the incident with the rollout.
Alertmanager example:
route:
receiver: default
routes:
- matchers:
- severity = "predicted"
receiver: slack-predicted
group_wait: 30s
group_interval: 2m
repeat_interval: 2h
- matchers:
- severity = "page"
receiver: pagerduty
group_wait: 10s
group_interval: 1m
repeat_interval: 1h
receivers:
- name: slack-predicted
slack_configs:
- channel: '#reliability'
title: '{{ template "slack.title" . }}'
text: |
*{{ .CommonLabels.alertname }}* {{ .CommonAnnotations.summary }}
service={{ .CommonLabels.service }} version={{ .CommonLabels.version }} rollout={{ .CommonAnnotations.rollout_id }}
runbook={{ .CommonAnnotations.runbook_url }}
- name: pagerduty
pagerduty_configs:
- routing_key: ${PAGERDUTY_KEY}
severity: 'critical'
class: '{{ .CommonLabels.service }}'
component: '{{ .CommonLabels.service }}'
details: {
"version": "{{ .CommonLabels.version }}",
"rollout": "{{ .CommonAnnotations.rollout_id }}"
}PagerDuty Event Orchestration can auto-assign based on service and attach a Slack war-room:
{
"conditions": [{
"expression": "details.rollout != '' && class == 'cart-api'",
"actions": {
"routes": [{"id": "cart-primary"}],
"annotations": {"slack_channel": "#inc-cart"}
}
}]
}The point: triage shouldn’t require a human to read five dashboards to figure out who owns the mess.
Close the loop: rollout automation and safe rollback
If your detector fires and a human still has to click through a runbook to undo a bad deploy, you’ve left minutes on the floor. Use canary analysis with automatic rollback.
Argo Rollouts example: canary pauses while Prometheus metrics stay healthy; rollback on failure.
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: cart-api
spec:
strategy:
canary:
steps:
- setWeight: 10
- pause: {duration: 120}
- analysis:
templates:
- templateName: prom-sli
args:
- name: version
valueFrom:
podTemplateHashValue: Latest
- setWeight: 50
- pause: {duration: 180}
- analysis:
templates:
- templateName: prom-sli
- setWeight: 100
template:
metadata:
labels:
app: cart-api
spec:
containers:
- name: cart
image: ghcr.io/org/cart:1.28.3
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: prom-sli
spec:
metrics:
- name: error-ratio
interval: 60s
successCondition: result < 0.01
failureLimit: 1
provider:
prometheus:
address: http://prometheus.monitoring.svc.cluster.local:9090
query: |
sum(rate(http_server_requests_seconds_count{job="cart",status=~"5..",rollout="{{args.rollout}}"}[2m]))
/
sum(rate(http_server_requests_seconds_count{job="cart",rollout="{{args.rollout}}"}[2m]))
- name: p99-latency
interval: 60s
successCondition: result < 0.350
failureLimit: 1
provider:
prometheus:
query: |
histogram_quantile(0.99, sum by (le) (rate(http_server_requests_seconds_bucket{job="cart",rollout="{{args.rollout}}"}[2m])))Prefer Flagger? Same idea with Istio/Linkerd and threshold-based rollback. The result: push risky changes at 10% traffic, bail automatically if error or latency moves past tight SLO guardrails.
Example: shipping a risky JVM upgrade without pager fatigue
We helped a fintech move from Java 11 to 21 on a latency-sensitive service. Historically, they’d roll to 100% and pray. This time we:
- Instrumented throttling, GC pause, p99 latency, and Postgres pool saturation.
- Set
severity=predictedalerts on throttling > 15% and GC pauses > 300ms/5m. - Built multi-window burn on the API SLI (0.5% error budget/day target).
- Used Argo Rollouts canary (10% → 50% → 100%), auto-rollback on failure.
- Wired PagerDuty only on SLO burn; predicted signals went to Slack + Jira with assignee pre-set.
Results over two weeks:
- MTTD: 14m → 1.5m (median), 45s when the canary tripped automatically.
- False pages: down 62%; predicted alerts were 80% “action-only” (no human page).
- Deploy velocity: from 1/day to 6/day on that service.
We found an -XX:MaxRAMPercentage regression at 10%, rolled back automatically, tuned CPU requests to cut throttling, and shipped the upgrade the next day.
Build it in 30 days without boiling the ocean
You don’t need a platform team of 40. Do this in four sprints:
- Week 1: SLOs + SLIs + labels
- Define one SLI per top-3 customer journey (availability or latency).
- Add
service,version,rollout_idlabels to metrics/traces. - Stand up synthetic probes from the edge (e.g.,
cloudprober).
probe {
name: "checkout"
type: HTTP
targets { host_names: "https://shop.example.com/checkout" }
interval_msec: 15000
timeout_msec: 3000
http_probe { method: GET }
}Week 2: Leading indicators + burn rate
- Add Prometheus rules for throttling, lag, pool saturation, p99.
- Implement multi-window burn (5m/1h) for each SLO.
Week 3: Triage automation
- Alertmanager routes for
predictedvspage. - PagerDuty Event Orchestration mapping
service→ escalation. - Enrich alerts with runbook links and dashboards.
- Alertmanager routes for
Week 4: Rollout guardrails
- Argo Rollouts or Flagger with Prometheus analysis.
- Feature flag killswitch for expensive paths (e.g.,
LaunchDarkly).
// Node + LaunchDarkly: degrade before you die
const flag = await ldClient.variation('use-new-recommender', user, false);
if (!flag || circuitBreaker.tripped()) {
return cachedRecommendations(); // graceful degradation
}
return liveRecommendations();Ship each week; don’t wait for perfection.
What we’d do differently (and what you can avoid)
- Don’t combine predicted and paging alerts into one queue. Keep your SLO pages scarce.
- Resist the urge to alert on every metric. If it’s not tied to action, it’s a dashboard.
- Keep thresholds tight and windows short for canary analysis; expand for steady-state.
- Surface rollout metadata everywhere—traces, logs, metrics—and make it clickable.
- Test rollbacks during business hours with chaos drills; don’t wait for an actual fire.
If your system needs a human to notice a spike and a Slack thread to decide what to do, you’re minutes—sometimes millions—late. Make the system decide for the obvious cases.
Key takeaways
- Alert on leading indicators like saturation, queue depth, and error-budget burn—not on dashboards or vanity metrics.
- Attach deployment/version context to every signal so triage routes to the right humans and systems automatically.
- Use multi-window SLO burn-rate alerts to avoid noise while catching real customer impact early.
- Close the loop with automated rollbacks via Argo Rollouts or Flagger when indicators cross thresholds.
- Keep alerts small, annotated, and actionable; everything else is a chart, not a page.
Implementation checklist
- Define SLIs/SLOs and compute burn rate with two windows (e.g., 5m + 1h).
- Instrument leading indicators: CPU throttling, GC pause, queue lag, connection pool saturation, p99 latency.
- Add rollout metadata to telemetry: service, version, commit, rollout_id.
- Create Alertmanager routes for predicted incidents vs. customer-impacting ones.
- Automate triage with PagerDuty Event Orchestration or Opsgenie rules.
- Enable canary analysis and rollback using Argo Rollouts or Flagger.
- Run synthetic checks from the edge to catch DNS/TLS/CDN early.
Questions we hear from teams
- Why not just alert on 5xx rate?
- Because it’s a lagging indicator—customers are already impacted. Use 5xx rate as part of your SLO burn rate to page humans, and rely on leading indicators (throttling, queue backlog, p99 latency) to predict and auto-mitigate before customers feel it.
- Isn’t this going to spam my team with alerts?
- Not if you split signals: predicted alerts route to Slack/Jira with automation, while only SLO burn rate pages humans. Multi-window burn rules dramatically cut false pages while catching real impact early.
- We’re on Datadog/New Relic/Honeycomb. Do we need Prometheus?
- No. The strategy is tooling-agnostic. Datadog has anomaly and composite monitors, Honeycomb has SLOs and Burn Alerts, and New Relic supports NRQL-based multi-window logic. We show Prometheus because it’s easy to demo and widely used.
- How do we add rollout metadata to metrics?
- Inject labels via your metrics SDK or OpenTelemetry `resource` processor. In Kubernetes, annotate deployments with version and rollout IDs and propagate them to metrics/traces via environment variables or OTEL resource detectors.
- What if automated rollback makes things worse?
- Scope automation to canaries and well-understood metrics, keep failure limits low, and test during business hours with chaos drills. Argo Rollouts and Flagger both support conservative step-ups and quick aborts.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
