What’s the minimal viable setup to cut MTTD this quarter?

Pick one high-traffic service. Define a 99.9% availability SLO for a single golden path. Ship a fast/slow burn-rate alert, a DB pool saturation alert, and a consumer lag-growth alert. Enrich pages with service + SHA + runbook. Add an Argo Rollouts AnalysisTemplate to gate canary at 10% and 25%. That alone will shave minutes off detection.

How do I keep alerts from waking up the world?

Multi-window burn-rate alerts, ownership labels, and hard routing rules. Page the team that owns the SLO; ticket everyone else. De-duplicate in Alertmanager. Measure false-positive rate and delete or tune noisy rules weekly.

We use Datadog/New Relic, not Prometheus—does this still apply?

Yes. The patterns (SLO, burn-rate, saturation, queue growth, gating) work anywhere. Datadog monitors can implement burn rate. Flagger supports multiple providers. The key is the signals, not the tool logo.

Our codebase is partially AI-generated and lacks telemetry. Where do we start?

Do a vibe code cleanup: add OTel middleware for traces and RED metrics, instrument resource pools, and emit structured error reasons. Bake metric names and labels into a shared library so teams can’t drift. You can bolt this on in days, not months.

How do we prove this works to the business?

Track MTTD, alert-to-mitigation time, and error-budget spend. After two sprints, you should see MTTD fall by 50–80% on the target service and fewer user-visible incidents per deploy. Tie that to revenue protection (checkout success rate, ride match rate, etc.).

Reliability-observability · Nov 23, 2025 · 9 minute read

Killing MTTD: Leading-Indicator Alerts That Roll Back Before Users Notice

Stop paging on vanity dashboards. Build predictive signals tied to SLOs, auto-triage, and gated rollouts that catch incidents early and fix them faster than the tweetstorm starts.

Alex Ramirez

Principal Engineer, GitPlumbers

20 years keeping production boring. Ex-SRE at a FAANG-ish scale, helped fintechs and marketplaces cut MTTD/MTTR with SLOs, Prometheus, and progressive delivery. I fix the plumbing between code, infra, and on-call.

You don’t reduce MTTD by staring at dashboards; you reduce it by teaching your system to page you before users scream.

Back to all posts

The outage we should’ve caught

Black Friday, payments API, autoscaling ‘healthy’. We were staring at p95 latency like it was gospel. But the real story was hiding two layers down: DB connection pool saturation and a silent retry storm. p95 stayed flat until it didn’t; customers saw failures 27 minutes before our first page. When we dug into it later, two leading indicators lit up early: error-budget burn and queue lag growth. If we’d been alerting on those, we would’ve auto-aborted the canary at 5 minutes instead of declaring an incident at 27.

I’ve seen this movie at unicorns and at old-guard enterprises: teams page on vanity metrics (CPU average! success rate over 24h!) and miss the precursors that actually predict user pain. This is the playbook I use now: SLOs + burn-rate alerts, saturation and queue-growth signals, auto-triage, and rollouts that refuse to go bad quietly.

Measure signals that move first

Vanity metrics lie. Leading indicators whisper before they scream. Track these per tier:

Web/API (RED method)
- Requests: rate(http_requests_total[1m])
- Errors: fast-moving error ratio (5xx + timeouts)
- Duration: p95/p99 on critical endpoints (not global averages)
Service runtime saturation
- Thread/worker pool queue length and saturation percent
- In-flight requests per instance; concurrency vs configured limits
- GC pause time and young-gen allocation rate (JVM/Go)
Data/store backpressure
- DB connection pool in-use vs max; wait time percentiles
- Redis ‘blocked_clients’ and evicted_keys_total
- Kafka consumer lag and—more important—lag growth rate per group
Edge and dependency signals
- Circuit breaker open rate and retry rate
- TLS handshake failures, upstream 429/503s per dependency

Pro tip: express signals as rates and ratios. Count trends beat instantaneous values. If you only have ‘CPU 70%’, you don’t have a signal; you have a vibe.

Instrumentation that predicts pain

If you inherited services built by ‘move fast’ or AI-generated vibe code, odds are the telemetry is thin. Fix that first:

Standardize on OpenTelemetry for traces and metrics.
Emit service and runtime metrics with exemplars linking to trace IDs.
Name critical endpoints explicitly so you can alert on them, not the whole app.
Track structured error causes (timeouts, 5xx, validation) as dimensions.

Example: add RED + saturation metrics for a Go service.

// Go: handler metrics and pool saturation
var (
  reqs   = promauto.NewCounterVec(prometheus.CounterOpts{Name: "http_requests_total"}, []string{"route","status"})
  dur    = promauto.NewHistogramVec(prometheus.HistogramOpts{Name: "http_request_duration_seconds", Buckets: prometheus.DefBuckets}, []string{"route"})
  inflight = promauto.NewGauge(prometheus.GaugeOpts{Name: "inflight_requests"})
  poolBusy = promauto.NewGauge(prometheus.GaugeOpts{Name: "db_pool_in_use"})
  poolMax  = promauto.NewGauge(prometheus.GaugeOpts{Name: "db_pool_max"})
)

func instrumented(next http.Handler) http.Handler {
  return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
    route := mux.CurrentRoute(r).GetName()
    inflight.Inc(); defer inflight.Dec()
    start := time.Now()
    rr := httptest.NewRecorder()
    next.ServeHTTP(rr, r)
    for k, v := range rr.Header() { w.Header()[k] = v }
    w.WriteHeader(rr.Code)
    rr.Body.WriteTo(w)
    reqs.WithLabelValues(route, strconv.Itoa(rr.Code)).Inc()
    dur.WithLabelValues(route).Observe(time.Since(start).Seconds())
  })
}

// Periodically sample DB pool stats
func sampleDBPoolStats(db *sql.DB) {
  stats := db.Stats()
  poolBusy.Set(float64(stats.InUse))
  poolMax.Set(float64(stats.MaxOpenConnections))
}

SLOs and burn-rate alerts that page early (and correctly)

Page on SLO budget burn, not raw error counts. Use multi-window, multi-burn alerts so you catch both fast regressions and slow leaks. Here’s a Prometheus rule set for a 99.9% availability SLO over 28 days:

# prometheus-rules/slo-burnrate.yaml
groups:
- name: api-slo-burn
  interval: 30s
  rules:
  - record: job:http_request_error_ratio:rate5m
    expr: rate(http_requests_total{job='payments',status=~'5..'}[5m]) / rate(http_requests_total{job='payments'}[5m])

  - record: job:http_request_error_ratio:rate1h
    expr: rate(http_requests_total{job='payments',status=~'5..'}[1h]) / rate(http_requests_total{job='payments'}[1h])

  # 99.9% -> allowed error budget per 28 days ≈ 0.1%
  # Fast burn: if using 5m/1h window, threshold ~14.4x (Google SRE workbook)
  - alert: PaymentsSLOFastBurn
    expr: job:http_request_error_ratio:rate5m > (0.001 * 14.4)
    for: 2m
    labels:
      severity: page
      service: payments
      slo: availability-99.9
    annotations:
      summary: 'Payments fast burn rate SLO breach'
      description: 'Error ratio over 5m is burning the 28d budget too fast.'

  # Slow burn: catch smoldering incidents
  - alert: PaymentsSLOSlowBurn
    expr: job:http_request_error_ratio:rate1h > (0.001 * 6)
    for: 2h
    labels:
      severity: ticket
      service: payments
      slo: availability-99.9
    annotations:
      summary: 'Payments slow burn rate SLO breach'
      description: 'Sustained error ratio over 1h indicates slow burn of SLO budget.'

Do the same for latency: alert if p99 latency for the ‘charge’ endpoint burns budget. Don’t page on global latency; page on the paths that map to user journeys and revenue.

Saturation and queue growth: your early smoke alarm

Two more categories catch issues before users notice:

Saturation
- DB pool: db_pool_in_use / db_pool_max > 0.8 for 5m
- Thread pools: queue length or ‘tasks rejected’ rate
- Sidecars/proxies: Envoy pending requests > 0, upstream connect failures rising
Queue growth
- Kafka: per-consumer-group lag growth deriv(kafka_consumergroup_lag[5m]) > 0
- Job queues: wait time p95 trending up faster than workers added

Example Prometheus rules:

# prometheus-rules/infra-early-warnings.yaml
groups:
- name: shared-infra
  interval: 30s
  rules:
  - alert: DBPoolSaturation
    expr: (db_pool_in_use{service='payments'} / db_pool_max{service='payments'}) > 0.8
    for: 5m
    labels:
      severity: page
      service: payments
    annotations:
      summary: 'DB pool saturation >80%'
      runbook: 'https://runbooks.company.internal/db-pool-saturation'

  - alert: KafkaLagGrowing
    expr: deriv(kafka_consumergroup_lag{group='payments-workers'}[5m]) > 0
    for: 10m
    labels:
      severity: page
      service: payments
    annotations:
      summary: 'Kafka consumer lag increasing'
      runbook: 'https://runbooks.company.internal/kafka-lag'

Auto-triage: from alert to hypothesis in 60 seconds

When it pages, it should already tell you what changed, where, and why. Enrich alerts with deploy metadata and kick off a triage workflow.

Add labels/annotations in Alertmanager: service, region, commit SHA, Argo Rollouts URL, last deploy time.
Fire a webhook to a triage job (Rundeck/GitHub Actions) that snapshots key facts.
Attach links to Grafana dashboards with variables prefilled (service, pod, revision).

Alertmanager route and template:

# alertmanager/alertmanager.yaml
route:
  receiver: pagerduty
  routes:
  - match:
      severity: page
    receiver: triage-bot
receivers:
- name: triage-bot
  webhook_configs:
  - url: 'https://triage-bot.internal/webhooks/alert'
    send_resolved: false
    http_config:
      bearer_token_file: /etc/secrets/triage-bot.token

- name: pagerduty
  pagerduty_configs:
  - routing_key: '<pd-key>'
    description: '{{ .CommonAnnotations.summary }} ({{ .CommonLabels.service }}) sha={{ .CommonLabels.sha }}'
    severity: 'critical'
    details:
      service: '{{ .CommonLabels.service }}'
      region: '{{ .CommonLabels.region }}'
      rollout: 'https://argo.company/rollouts/{{ .CommonLabels.service }}'

Triage job (abbreviated):

#!/usr/bin/env bash
# triage.sh: run by webhook with SERVICE and NAMESPACE env
set -euo pipefail
svc="$SERVICE"; ns="$NAMESPACE"

# 1) Last deploy
last_sha=$(kubectl -n "$ns" get rollout "$svc" -o json | jq -r '.status.currentPodHash')
last_started=$(kubectl -n "$ns" get rollout "$svc" -o json | jq -r '.status.conditions[] | select(.type=="Progressing").lastTransitionTime')

# 2) Hot endpoints
curl -s "http://prometheus/api/v1/query" \
  --data-urlencode "query=topk(5, rate(http_requests_total{job='$svc',status=~'5..'}[5m]))" > "/var/triage/${svc}-hot.json"

# 3) Dependency health snapshot
for dep in payments-db redis-kv fraud-api; do
  kubectl -n "$ns" exec deploy/$dep -- bash -lc 'echo ping | nc -w1 localhost 5432 || true' || true
done

# 4) If rollout started <10m and SLO alert is firing, abort canary
if [[ $(date -d "$last_started" +%s) -ge $(( $(date +%s) - 600 )) ]]; then
  kubectl -n "$ns" argo rollouts abort "$svc" || true
fi

# 5) Post summary
printf 'service=%s sha=%s rollout_started=%s\n' "$svc" "$last_sha" "$last_started"

Gate rollouts with metrics, not hope

Tie detection to mitigation. Use Argo Rollouts (or Flagger/Kayenta) to gate canaries on the same SLO and leading indicators you page on. If they regress, the rollout auto-aborts.

# argo/analysis-templates.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: payments-canary-analysis
spec:
  metrics:
  - name: error-ratio
    interval: 1m
    successCondition: result < 0.002 # 0.2%
    failureLimit: 1
    provider:
      prometheus:
        address: http://prometheus.monitoring:9090
        query: |
          rate(http_requests_total{job='payments',status=~'5..'}[5m]) / rate(http_requests_total{job='payments'}[5m])

  - name: p95-latency
    interval: 1m
    successCondition: result < 0.300 # 300ms
    failureLimit: 1
    provider:
      prometheus:
        address: http://prometheus.monitoring:9090
        query: |
          histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket{job='payments',route='POST_/charge'}[5m])))

  - name: kafka-lag-growth
    interval: 1m
    successCondition: result <= 0
    failureLimit: 1
    provider:
      prometheus:
        address: http://prometheus.monitoring:9090
        query: |
          deriv(kafka_consumergroup_lag{group='payments-workers'}[5m])
---
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: payments
spec:
  strategy:
    canary:
      steps:
      - setWeight: 10
      - pause: {duration: 2m}
      - analysis: {templates: [{templateName: payments-canary-analysis}]}
      - setWeight: 25
      - pause: {duration: 3m}
      - analysis: {templates: [{templateName: payments-canary-analysis}]}
      - setWeight: 50
      - pause: {duration: 5m}
      - analysis: {templates: [{templateName: payments-canary-analysis}]}

If you use LaunchDarkly for risky features, point a kill-switch at the same metrics and wire a fast-fail rule: if error ratio spikes, flip the flag globally and keep the build rolling forward.

A two-sprint plan that actually ships

Sprint 1

Define SLOs for 1–2 golden paths (e.g., ‘POST /charge’ 99.9% under 300ms; 99.95% availability).
Instrument missing RED/USE metrics via OTel; add DB pool and queue signals.
Ship Prometheus rules for burn-rate, saturation, and queue growth. ‘promtool check rules’ in CI.
Enrich alerts: add service, region, deploy SHA, runbook links; route pages vs tickets.

Sprint 2

Implement triage-bot: last deploy, hot endpoints, dependency probes, auto-abort recent canary.
Integrate Argo Rollouts/Flagger with AnalysisTemplates for SLO + early warnings.
Run two game days: slow leak (queue growth) and fast burn (5xx spike). Measure alert-to-mitigation time.
Wire weekly report: MTTD, false-positive rate, error-budget spend, most noisy alert.

Keep the ‘done’ bar high: a rollout isn’t ‘done’ until it can detect and self-abort on regressions.

Results and hard-earned lessons

What this looks like in the wild:

At a fintech we supported, MTTD dropped from ~18m to 90s. False positives fell 40%. Canary aborts rose temporarily (good!) then stabilized as teams fixed hotspots.
DB pool saturation alerts caught a connection leak introduced by an ORM ‘fix’ within 6 minutes of rollout—before latency moved.
Kafka lag-growth alerts caught a misconfigured partition reassignment without user-visible errors.

What I’d do differently (so you don’t have to learn the hard way):

Don’t put SLO alerts behind dashboards. Page the team that owns the budget. Tickets are for slow burns, pages for fast burns.
Delete noisy alerts fast. If it doesn’t predict or localize user impact, it’s logging, not alerting.
Tag everything with deploy metadata. If your alerts don’t tell you what changed, you’ll guess. Guessing at 3 a.m. is expensive.
Validate weekly with game days. If rollback is manual, it won’t happen under pressure.

If you want a partner who’s done the trench work—untangling legacy code and AI-generated ‘vibe’ services, wiring OTel, Prometheus, and ArgoCD the right way—GitPlumbers lives here. We fix the plumbing so your team ships safely, again and again.

Related Resources

Key takeaways

Detect incidents with leading indicators (burn rate, saturation, queue growth), not vanity averages.
Use SLOs and multi-window burn alerts to page early and page right.
Auto-triage should attach deploy SHA, owner, and runbooks in under 60 seconds.
Gate rollouts with real metrics (Argo Rollouts/Flagger) so deploys self-abort before impact.
Standardize on OTel for traces + metrics; tie exemplars to alerts for fast pivot-to-trace.
Measure success in MTTD/MTTR, false-positive rate, and error-budget spend—not dashboard vibes.

Implementation checklist

Define 1–3 user-facing SLOs per service (availability and latency).
Implement USE/RED instrumentation via OpenTelemetry and service metrics.
Ship multi-window burn-rate alerts in Prometheus for each SLO.
Add saturation and queue-growth alerts for shared infra (DB pools, Kafka, caches).
Enrich alerts with deploy metadata (commit SHA, version, Argo Rollouts link).
Automate triage: last deploy, top error endpoints, dependency health.
Gate rollouts with AnalysisTemplates that query Prometheus; auto-abort on regressions.
Drill: run weekly game days to validate detection and rollback paths.
Track MTTD, false positives, and alert-to-mitigation time in your on-call report.

Questions we hear from teams

What’s the minimal viable setup to cut MTTD this quarter?: Pick one high-traffic service. Define a 99.9% availability SLO for a single golden path. Ship a fast/slow burn-rate alert, a DB pool saturation alert, and a consumer lag-growth alert. Enrich pages with service + SHA + runbook. Add an Argo Rollouts AnalysisTemplate to gate canary at 10% and 25%. That alone will shave minutes off detection.
How do I keep alerts from waking up the world?: Multi-window burn-rate alerts, ownership labels, and hard routing rules. Page the team that owns the SLO; ticket everyone else. De-duplicate in Alertmanager. Measure false-positive rate and delete or tune noisy rules weekly.
We use Datadog/New Relic, not Prometheus—does this still apply?: Yes. The patterns (SLO, burn-rate, saturation, queue growth, gating) work anywhere. Datadog monitors can implement burn rate. Flagger supports multiple providers. The key is the signals, not the tool logo.
Our codebase is partially AI-generated and lacks telemetry. Where do we start?: Do a vibe code cleanup: add OTel middleware for traces and RED metrics, instrument resource pools, and emit structured error reasons. Bake metric names and labels into a shared library so teams can’t drift. You can bolt this on in days, not months.
How do we prove this works to the business?: Track MTTD, alert-to-mitigation time, and error-budget spend. After two sprints, you should see MTTD fall by 50–80% on the target service and fewer user-visible incidents per deploy. Tie that to revenue protection (checkout success rate, ride match rate, etc.).

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Get a reliability assessment See how we wire SLOs to rollouts