Killing MTTD: Leading-Indicator Alerts That Roll Back Before Users Notice
Stop paging on vanity dashboards. Build predictive signals tied to SLOs, auto-triage, and gated rollouts that catch incidents early and fix them faster than the tweetstorm starts.
You don’t reduce MTTD by staring at dashboards; you reduce it by teaching your system to page you before users scream.Back to all posts
The outage we should’ve caught
Black Friday, payments API, autoscaling ‘healthy’. We were staring at p95 latency like it was gospel. But the real story was hiding two layers down: DB connection pool saturation and a silent retry storm. p95 stayed flat until it didn’t; customers saw failures 27 minutes before our first page. When we dug into it later, two leading indicators lit up early: error-budget burn and queue lag growth. If we’d been alerting on those, we would’ve auto-aborted the canary at 5 minutes instead of declaring an incident at 27.
I’ve seen this movie at unicorns and at old-guard enterprises: teams page on vanity metrics (CPU average! success rate over 24h!) and miss the precursors that actually predict user pain. This is the playbook I use now: SLOs + burn-rate alerts, saturation and queue-growth signals, auto-triage, and rollouts that refuse to go bad quietly.
Measure signals that move first
Vanity metrics lie. Leading indicators whisper before they scream. Track these per tier:
- Web/API (RED method)
- Requests:
rate(http_requests_total[1m]) - Errors: fast-moving error ratio (5xx + timeouts)
- Duration: p95/p99 on critical endpoints (not global averages)
- Requests:
- Service runtime saturation
- Thread/worker pool queue length and saturation percent
- In-flight requests per instance; concurrency vs configured limits
- GC pause time and young-gen allocation rate (JVM/Go)
- Data/store backpressure
- DB connection pool in-use vs max; wait time percentiles
- Redis ‘blocked_clients’ and
evicted_keys_total - Kafka consumer lag and—more important—lag growth rate per group
- Edge and dependency signals
- Circuit breaker open rate and retry rate
- TLS handshake failures, upstream 429/503s per dependency
Pro tip: express signals as rates and ratios. Count trends beat instantaneous values. If you only have ‘CPU 70%’, you don’t have a signal; you have a vibe.
Instrumentation that predicts pain
If you inherited services built by ‘move fast’ or AI-generated vibe code, odds are the telemetry is thin. Fix that first:
- Standardize on OpenTelemetry for traces and metrics.
- Emit service and runtime metrics with exemplars linking to trace IDs.
- Name critical endpoints explicitly so you can alert on them, not the whole app.
- Track structured error causes (timeouts, 5xx, validation) as dimensions.
Example: add RED + saturation metrics for a Go service.
// Go: handler metrics and pool saturation
var (
reqs = promauto.NewCounterVec(prometheus.CounterOpts{Name: "http_requests_total"}, []string{"route","status"})
dur = promauto.NewHistogramVec(prometheus.HistogramOpts{Name: "http_request_duration_seconds", Buckets: prometheus.DefBuckets}, []string{"route"})
inflight = promauto.NewGauge(prometheus.GaugeOpts{Name: "inflight_requests"})
poolBusy = promauto.NewGauge(prometheus.GaugeOpts{Name: "db_pool_in_use"})
poolMax = promauto.NewGauge(prometheus.GaugeOpts{Name: "db_pool_max"})
)
func instrumented(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
route := mux.CurrentRoute(r).GetName()
inflight.Inc(); defer inflight.Dec()
start := time.Now()
rr := httptest.NewRecorder()
next.ServeHTTP(rr, r)
for k, v := range rr.Header() { w.Header()[k] = v }
w.WriteHeader(rr.Code)
rr.Body.WriteTo(w)
reqs.WithLabelValues(route, strconv.Itoa(rr.Code)).Inc()
dur.WithLabelValues(route).Observe(time.Since(start).Seconds())
})
}
// Periodically sample DB pool stats
func sampleDBPoolStats(db *sql.DB) {
stats := db.Stats()
poolBusy.Set(float64(stats.InUse))
poolMax.Set(float64(stats.MaxOpenConnections))
}SLOs and burn-rate alerts that page early (and correctly)
Page on SLO budget burn, not raw error counts. Use multi-window, multi-burn alerts so you catch both fast regressions and slow leaks. Here’s a Prometheus rule set for a 99.9% availability SLO over 28 days:
# prometheus-rules/slo-burnrate.yaml
groups:
- name: api-slo-burn
interval: 30s
rules:
- record: job:http_request_error_ratio:rate5m
expr: rate(http_requests_total{job='payments',status=~'5..'}[5m]) / rate(http_requests_total{job='payments'}[5m])
- record: job:http_request_error_ratio:rate1h
expr: rate(http_requests_total{job='payments',status=~'5..'}[1h]) / rate(http_requests_total{job='payments'}[1h])
# 99.9% -> allowed error budget per 28 days ≈ 0.1%
# Fast burn: if using 5m/1h window, threshold ~14.4x (Google SRE workbook)
- alert: PaymentsSLOFastBurn
expr: job:http_request_error_ratio:rate5m > (0.001 * 14.4)
for: 2m
labels:
severity: page
service: payments
slo: availability-99.9
annotations:
summary: 'Payments fast burn rate SLO breach'
description: 'Error ratio over 5m is burning the 28d budget too fast.'
# Slow burn: catch smoldering incidents
- alert: PaymentsSLOSlowBurn
expr: job:http_request_error_ratio:rate1h > (0.001 * 6)
for: 2h
labels:
severity: ticket
service: payments
slo: availability-99.9
annotations:
summary: 'Payments slow burn rate SLO breach'
description: 'Sustained error ratio over 1h indicates slow burn of SLO budget.'Do the same for latency: alert if p99 latency for the ‘charge’ endpoint burns budget. Don’t page on global latency; page on the paths that map to user journeys and revenue.
Saturation and queue growth: your early smoke alarm
Two more categories catch issues before users notice:
- Saturation
- DB pool:
db_pool_in_use / db_pool_max > 0.8 for 5m - Thread pools: queue length or ‘tasks rejected’ rate
- Sidecars/proxies: Envoy pending requests > 0, upstream connect failures rising
- DB pool:
- Queue growth
- Kafka: per-consumer-group lag growth
deriv(kafka_consumergroup_lag[5m]) > 0 - Job queues: wait time p95 trending up faster than workers added
- Kafka: per-consumer-group lag growth
Example Prometheus rules:
# prometheus-rules/infra-early-warnings.yaml
groups:
- name: shared-infra
interval: 30s
rules:
- alert: DBPoolSaturation
expr: (db_pool_in_use{service='payments'} / db_pool_max{service='payments'}) > 0.8
for: 5m
labels:
severity: page
service: payments
annotations:
summary: 'DB pool saturation >80%'
runbook: 'https://runbooks.company.internal/db-pool-saturation'
- alert: KafkaLagGrowing
expr: deriv(kafka_consumergroup_lag{group='payments-workers'}[5m]) > 0
for: 10m
labels:
severity: page
service: payments
annotations:
summary: 'Kafka consumer lag increasing'
runbook: 'https://runbooks.company.internal/kafka-lag'Auto-triage: from alert to hypothesis in 60 seconds
When it pages, it should already tell you what changed, where, and why. Enrich alerts with deploy metadata and kick off a triage workflow.
- Add labels/annotations in Alertmanager: service, region, commit SHA, Argo Rollouts URL, last deploy time.
- Fire a webhook to a triage job (Rundeck/GitHub Actions) that snapshots key facts.
- Attach links to Grafana dashboards with variables prefilled (service, pod, revision).
Alertmanager route and template:
# alertmanager/alertmanager.yaml
route:
receiver: pagerduty
routes:
- match:
severity: page
receiver: triage-bot
receivers:
- name: triage-bot
webhook_configs:
- url: 'https://triage-bot.internal/webhooks/alert'
send_resolved: false
http_config:
bearer_token_file: /etc/secrets/triage-bot.token
- name: pagerduty
pagerduty_configs:
- routing_key: '<pd-key>'
description: '{{ .CommonAnnotations.summary }} ({{ .CommonLabels.service }}) sha={{ .CommonLabels.sha }}'
severity: 'critical'
details:
service: '{{ .CommonLabels.service }}'
region: '{{ .CommonLabels.region }}'
rollout: 'https://argo.company/rollouts/{{ .CommonLabels.service }}'Triage job (abbreviated):
#!/usr/bin/env bash
# triage.sh: run by webhook with SERVICE and NAMESPACE env
set -euo pipefail
svc="$SERVICE"; ns="$NAMESPACE"
# 1) Last deploy
last_sha=$(kubectl -n "$ns" get rollout "$svc" -o json | jq -r '.status.currentPodHash')
last_started=$(kubectl -n "$ns" get rollout "$svc" -o json | jq -r '.status.conditions[] | select(.type=="Progressing").lastTransitionTime')
# 2) Hot endpoints
curl -s "http://prometheus/api/v1/query" \
--data-urlencode "query=topk(5, rate(http_requests_total{job='$svc',status=~'5..'}[5m]))" > "/var/triage/${svc}-hot.json"
# 3) Dependency health snapshot
for dep in payments-db redis-kv fraud-api; do
kubectl -n "$ns" exec deploy/$dep -- bash -lc 'echo ping | nc -w1 localhost 5432 || true' || true
done
# 4) If rollout started <10m and SLO alert is firing, abort canary
if [[ $(date -d "$last_started" +%s) -ge $(( $(date +%s) - 600 )) ]]; then
kubectl -n "$ns" argo rollouts abort "$svc" || true
fi
# 5) Post summary
printf 'service=%s sha=%s rollout_started=%s\n' "$svc" "$last_sha" "$last_started"Gate rollouts with metrics, not hope
Tie detection to mitigation. Use Argo Rollouts (or Flagger/Kayenta) to gate canaries on the same SLO and leading indicators you page on. If they regress, the rollout auto-aborts.
# argo/analysis-templates.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: payments-canary-analysis
spec:
metrics:
- name: error-ratio
interval: 1m
successCondition: result < 0.002 # 0.2%
failureLimit: 1
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
rate(http_requests_total{job='payments',status=~'5..'}[5m]) / rate(http_requests_total{job='payments'}[5m])
- name: p95-latency
interval: 1m
successCondition: result < 0.300 # 300ms
failureLimit: 1
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket{job='payments',route='POST_/charge'}[5m])))
- name: kafka-lag-growth
interval: 1m
successCondition: result <= 0
failureLimit: 1
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
deriv(kafka_consumergroup_lag{group='payments-workers'}[5m])
---
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: payments
spec:
strategy:
canary:
steps:
- setWeight: 10
- pause: {duration: 2m}
- analysis: {templates: [{templateName: payments-canary-analysis}]}
- setWeight: 25
- pause: {duration: 3m}
- analysis: {templates: [{templateName: payments-canary-analysis}]}
- setWeight: 50
- pause: {duration: 5m}
- analysis: {templates: [{templateName: payments-canary-analysis}]}If you use LaunchDarkly for risky features, point a kill-switch at the same metrics and wire a fast-fail rule: if error ratio spikes, flip the flag globally and keep the build rolling forward.
A two-sprint plan that actually ships
Sprint 1
- Define SLOs for 1–2 golden paths (e.g., ‘POST /charge’ 99.9% under 300ms; 99.95% availability).
- Instrument missing RED/USE metrics via OTel; add DB pool and queue signals.
- Ship Prometheus rules for burn-rate, saturation, and queue growth. ‘promtool check rules’ in CI.
- Enrich alerts: add service, region, deploy SHA, runbook links; route pages vs tickets.
Sprint 2
- Implement triage-bot: last deploy, hot endpoints, dependency probes, auto-abort recent canary.
- Integrate Argo Rollouts/Flagger with AnalysisTemplates for SLO + early warnings.
- Run two game days: slow leak (queue growth) and fast burn (5xx spike). Measure alert-to-mitigation time.
- Wire weekly report: MTTD, false-positive rate, error-budget spend, most noisy alert.
Keep the ‘done’ bar high: a rollout isn’t ‘done’ until it can detect and self-abort on regressions.
Results and hard-earned lessons
What this looks like in the wild:
- At a fintech we supported, MTTD dropped from ~18m to 90s. False positives fell 40%. Canary aborts rose temporarily (good!) then stabilized as teams fixed hotspots.
- DB pool saturation alerts caught a connection leak introduced by an ORM ‘fix’ within 6 minutes of rollout—before latency moved.
- Kafka lag-growth alerts caught a misconfigured partition reassignment without user-visible errors.
What I’d do differently (so you don’t have to learn the hard way):
- Don’t put SLO alerts behind dashboards. Page the team that owns the budget. Tickets are for slow burns, pages for fast burns.
- Delete noisy alerts fast. If it doesn’t predict or localize user impact, it’s logging, not alerting.
- Tag everything with deploy metadata. If your alerts don’t tell you what changed, you’ll guess. Guessing at 3 a.m. is expensive.
- Validate weekly with game days. If rollback is manual, it won’t happen under pressure.
If you want a partner who’s done the trench work—untangling legacy code and AI-generated ‘vibe’ services, wiring OTel, Prometheus, and ArgoCD the right way—GitPlumbers lives here. We fix the plumbing so your team ships safely, again and again.
Key takeaways
- Detect incidents with leading indicators (burn rate, saturation, queue growth), not vanity averages.
- Use SLOs and multi-window burn alerts to page early and page right.
- Auto-triage should attach deploy SHA, owner, and runbooks in under 60 seconds.
- Gate rollouts with real metrics (Argo Rollouts/Flagger) so deploys self-abort before impact.
- Standardize on OTel for traces + metrics; tie exemplars to alerts for fast pivot-to-trace.
- Measure success in MTTD/MTTR, false-positive rate, and error-budget spend—not dashboard vibes.
Implementation checklist
- Define 1–3 user-facing SLOs per service (availability and latency).
- Implement USE/RED instrumentation via OpenTelemetry and service metrics.
- Ship multi-window burn-rate alerts in Prometheus for each SLO.
- Add saturation and queue-growth alerts for shared infra (DB pools, Kafka, caches).
- Enrich alerts with deploy metadata (commit SHA, version, Argo Rollouts link).
- Automate triage: last deploy, top error endpoints, dependency health.
- Gate rollouts with AnalysisTemplates that query Prometheus; auto-abort on regressions.
- Drill: run weekly game days to validate detection and rollback paths.
- Track MTTD, false positives, and alert-to-mitigation time in your on-call report.
Questions we hear from teams
- What’s the minimal viable setup to cut MTTD this quarter?
- Pick one high-traffic service. Define a 99.9% availability SLO for a single golden path. Ship a fast/slow burn-rate alert, a DB pool saturation alert, and a consumer lag-growth alert. Enrich pages with service + SHA + runbook. Add an Argo Rollouts AnalysisTemplate to gate canary at 10% and 25%. That alone will shave minutes off detection.
- How do I keep alerts from waking up the world?
- Multi-window burn-rate alerts, ownership labels, and hard routing rules. Page the team that owns the SLO; ticket everyone else. De-duplicate in Alertmanager. Measure false-positive rate and delete or tune noisy rules weekly.
- We use Datadog/New Relic, not Prometheus—does this still apply?
- Yes. The patterns (SLO, burn-rate, saturation, queue growth, gating) work anywhere. Datadog monitors can implement burn rate. Flagger supports multiple providers. The key is the signals, not the tool logo.
- Our codebase is partially AI-generated and lacks telemetry. Where do we start?
- Do a vibe code cleanup: add OTel middleware for traces and RED metrics, instrument resource pools, and emit structured error reasons. Bake metric names and labels into a shared library so teams can’t drift. You can bolt this on in days, not months.
- How do we prove this works to the business?
- Track MTTD, alert-to-mitigation time, and error-budget spend. After two sprints, you should see MTTD fall by 50–80% on the target service and fewer user-visible incidents per deploy. Tie that to revenue protection (checkout success rate, ride match rate, etc.).
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
