What’s a good starting set of leading indicators per service?

p99 latency, error-rate anomaly (relative change), retry rate, internal queue depth/age, dependency 5xx, DB connection pool saturation, and for event-driven systems, consumer lag growth rate. Add runtime-specific ones (GC pause p95/p99 for JVM, FD usage, CPU steal).

How do I prevent alert fatigue when adding more signals?

Group alerts by service and page on budget burn or multi-signal conditions. Use Alertmanager grouping and deduplication, and route WARN to Slack while PAGE only on sustained leading indicators that correlate with user-impact or SLO burn.

We’re on ECS/Lambda, not Kubernetes. Does this still apply?

Yes. Replace Argo Rollouts with CodeDeploy blue/green + CloudWatch Alarms or Flagger on App Mesh. Same Prometheus/OpenTelemetry metrics, same decision trees, different controllers.

Who owns the playbooks—SRE or product teams?

Service-owning teams own their playbooks. SRE provides the framework: templates, tooling, routing, and quality bar. Make playbook PRs part of the service’s definition of done.

How do we test playbooks without breaking prod?

Use shadow traffic and staged canaries. Chaos test in staging with synthetic load. In prod, run controlled fault injections (low blast radius) during low traffic, paired with automatic rollback and SLO burn alerts.

Reliability-observability · Oct 28, 2025 · 9 minute read

The Playbook Problem: Building Incident Response That Scales Across Teams (And Predicts the Blast Before It Happens)

Stop paging on vanity metrics. Start wiring leading indicators to triage and rollout automation so incidents fix themselves—or never happen.

Alex Carter

Partner, Reliability & Systems

20 years building and fixing production systems from bare metal to multi-cloud. Ex-AWS, ex-startup CTO. I’ve carried the pager and shipped the rollback buttons.

We stopped paging on CPU and started paging on queue age, retry storms, and error-budget burn. Incidents didn’t disappear—they just stopped being exciting.

Back to all posts

The outage that didn’t page… until it was too late

I’ve watched teams get wrecked by the same pattern: green dashboards, then suddenly Slack is on fire. At one fintech, checkout looked “healthy” (CPU 45%, 200 OK rate stable). Meanwhile, Kafka consumer lag was climbing, connection pools were at 90% saturation, and retries were doubling every minute. No alert. Ten minutes later, p99 exploded, autoscaling thrashed, and we rolled back blind.

The fix wasn’t another dashboard. We rewired playbooks around leading indicators and tied them to automation. Incidents became boring. That’s the goal.

What to measure if you want to predict incidents

Skip the vanity: average CPU, node uptime, request count. They’re fine for capacity reviews, useless at 3 a.m. You want the precursors—the signals that move before users feel pain.

Application layer
- p99/p999 latency regression (not averages)
- Retry storms (client and server) and 429/503 from dependencies
- Queue depth/age: internal work queues, Sidekiq, SQS, Celery
- Thread/connection pool saturation: >80% sustained is smoke
- Error-rate anomaly: relative change, not fixed threshold
- Cache miss rate spikes -> DB meltdown precursor
Runtime/infra
- GC pause p95/p99 (JVM gc_pause_seconds, Go gcpause) trending up
- CPU steal on noisy neighbors in shared tenancy
- File descriptor/ephemeral port exhaustion
- Disk I/O wait, write amplification
Data/streaming
- Kafka consumer lag growth rate (slope), not just absolute
- DB lock wait time/deadlock count
- Replication lag (Postgres pg_stat_replication), cache fill backlog

Anchor alerts to SLOs and error budgets, not hunches. If you burn 2% of your monthly budget in 15 minutes, that’s a page—even if status is 200.

# prometheus alerts: leading indicators over vanity
groups:
- name: checkout-leading
  rules:
  - alert: LatencyRegressionP99
    expr: histogram_quantile(0.99, sum(rate(http_server_request_duration_seconds_bucket{job="checkout"}[5m])) by (le)) > 0.5
    for: 10m
    labels:
      severity: page
      service: checkout
      team: checkout
    annotations:
      summary: "Checkout p99 latency > 500ms for 10m"
      playbook: "https://git.company.com/runbooks/checkout#latency"

  - alert: KafkaConsumerLagSpike
    expr: sum(kafka_consumergroup_group_lag{consumergroup="orders"}) > 50000
    for: 5m
    labels: {severity: page, service: orders, team: orders}
    annotations:
      summary: "Orders consumer lag > 50k for 5m (growth likely)"
      playbook: "https://git.company.com/runbooks/orders#lag"

  - alert: ConnPoolSaturation
    expr: max(db_connection_pool_in_use{service="checkout"}) / max(db_connection_pool_size{service="checkout"}) > 0.85
    for: 5m
    labels: {severity: warn, service: checkout, team: checkout}
    annotations:
      summary: "DB connection pool >85% for 5m"
      playbook: "https://git.company.com/runbooks/checkout#db-saturation"

If you’re pushing traces, use tail-based sampling to keep the hot signals:

# OpenTelemetry Collector: keep error and high-latency traces
processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 50000
    policies:
    - name: errors
      type: status_code
      status_code:
        status_codes: [ERROR]
    - name: high-latency
      type: latency
      latency:
        threshold_ms: 500

Turn metrics into decisions: playbooks that read like code

The best playbooks aren’t PDFs. They’re short, specific, and linked from alerts with exactly one decision tree.

Structure every playbook like this:

Trigger: the alert name and the metric query.
Guardrails: SLO context, error budget remaining, blast radius.
Decision tree: degrade/rollback/route traffic—no vague “investigate”.
Automation: commands, scripts, or toggles with examples.
Roll-forward notes: how to fix root cause without paging next time.

Example: Checkout p99 latency regression

Trigger: LatencyRegressionP99 fired for 10m.
Guardrails: Error budget burn = 3% in 15m; canary at 10%.
Decision tree:
1. If retry_rate rising and payments 5xx > 1%, enable kill switch.
2. If canary active, pause, evaluate Argo analysis, and rollback if failing.
3. If DB connection pool >90%, enable read-only mode and shed non-critical traffic (feature flag promo-deals).
4. If consumer lag growth >500/sec, scale consumers and slow producers via token bucket.
Automation:
- Toggle feature flags (LaunchDarkly):
```
// graceful degradation on payments outage
if (ldClient.variation("payments-kill-switch", user, false)) {
  return cachedResponse || { status: "degraded" };
}
```
- Run traffic routing command or move weights via Argo Rollouts/Istio.
- One-click rollback via argocd app rollback checkout <rev>.

Pro tip: every step in the tree should be unambiguous and testable in staging. If it says “check logs,” it should specify the query.

Wire telemetry to rollout automation (so rollbacks aren’t heroics)

If your playbook says “rollback if error rate >1%,” make that a controller’s job.

Argo Rollouts with Prometheus analysis

# Analysis: fail canary on error-rate or p99 regression
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: checkout-canary
spec:
  metrics:
  - name: error-rate
    interval: 1m
    count: 5
    successCondition: result[0] < 0.01
    failureLimit: 1
    provider:
      prometheus:
        address: http://prometheus.monitoring.svc:9090
        query: sum(rate(http_server_requests_seconds_count{service="checkout",status=~"5.."}[1m]))
          / sum(rate(http_server_requests_seconds_count{service="checkout"}[1m]))
  - name: p99-latency
    interval: 1m
    count: 5
    successCondition: result[0] < 0.4
    failureLimit: 1
    provider:
      prometheus:
        address: http://prometheus.monitoring.svc:9090
        query: histogram_quantile(0.99, sum(rate(http_server_request_duration_seconds_bucket{service="checkout"}[1m])) by (le))
---
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: checkout
spec:
  strategy:
    canary:
      steps:
      - setWeight: 10
      - pause: {duration: 60}
      - analysis:
          templates:
          - templateName: checkout-canary
      - setWeight: 50
      - pause: {duration: 120}
      - analysis:
          templates:
          - templateName: checkout-canary
      - setWeight: 100

Flagger offers similar automation if you’re more AppMesh/Istio-centric.

Catch dependency failures early with circuit breakers:

# Istio: outlier detection + connection pool limits
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: payments-dr
spec:
  host: payments
  trafficPolicy:
    outlierDetection:
      consecutive5xxErrors: 5
      interval: 5s
      baseEjectionTime: 30s
      maxEjectionPercent: 50
    connectionPool:
      http:
        http1MaxPendingRequests: 1000
        maxRequestsPerConnection: 100

Now your playbook step “degrade on payments 5xx” is a toggle, not a midnight YAML edit.

Triage that scales across teams (routing, ownership, and chatops)

Your on-call shouldn’t guess who owns payments-proxy-v2. Route alerts by service and team labels. Use a service catalog (Backstage) as the source of truth.

# Backstage catalog: ownership + links to dashboards/runbooks
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: checkout
  annotations:
    grafana/dashboard-url: https://grafana.company.com/d/checkout
    ops.playbook/url: https://git.company.com/runbooks/checkout
spec:
  type: service
  owner: team-checkout
  lifecycle: production

Alertmanager does the boring but critical routing:

route:
  receiver: pagerduty
  group_by: ['service','team']
  routes:
  - matchers:
    - service="checkout"
    receiver: pd-checkout
receivers:
- name: pd-checkout
  pagerduty_configs:
  - routing_key: <secret>
    severity: '{{ .CommonLabels.severity | default "error" }}'

Standardize the incident ladder (SEV-1 to SEV-4) and automate the boilerplate:

Slack war room creation with pinned links (Grafana, runbook, rollout)
PagerDuty or Opsgenie for paging, Jira/ServiceNow ticket creation
Single command to pull the right dashboard: !checkout dashboard

# PagerDuty event with playbook + dashboard links
curl -X POST https://events.pagerduty.com/v2/enqueue \
  -H 'Content-Type: application/json' \
  -d '{
    "routing_key": "PD_ROUTING_KEY",
    "event_action": "trigger",
    "payload": {
      "summary": "Checkout p99 latency regression",
      "severity": "critical",
      "source": "prometheus",
      "custom_details": {
        "playbook": "https://gitplumbers.dev/runbooks/checkout#latency"
      }
    },
    "links": [{"href":"https://grafana.company.com/d/checkout"}]
  }'

If your org spans multiple time zones and teams, this wiring is the difference between a clean handoff and chaos.

Keep playbooks evergreen: test them like code

I’ve seen beautiful playbooks rot in Confluence. Fix it with Git and tests.

GitOps everything: playbooks, alerts, dashboards. Same repo as service or a shared ops/ monorepo with ownership.
PR review by on-call engineers. If they can’t follow it half-asleep, rewrite it.
Gamedays monthly: simulate dependency 503s, throttle DB, inject latency with tc or chaos-mesh.
Chaos in CI: ephemeral env + smoke scenario that exercises canary analysis.
Score the playbook:
- MTTR for its triggers
- False-positive rate
- Number of manual steps (drive toward zero)
- “First meaningful action” time from page to toggle/rollback

As a rule: if a human did it twice during an incident, automate it.

What success looks like (numbers, not vibes)

At a SaaS we helped last year:

MTTR for SEV-2s dropped from 78 minutes to 24 minutes in six weeks.
62% of rollbacks executed automatically via Argo Rollouts; zero missed canary failures.
False-positive pages down 40% after replacing CPU alerts with leading indicators.
On-call interrupts per engineer per week fell from 9.1 to 3.4.
Exec-friendly SLO burn alerts replaced vague “high error rate” noise.

None of that required a new APM license. It required wiring what you already have to decisions and automation.

The starter kit (copy/paste and adapt)

Pick three services. For each, define:
- SLOs: availability, latency (p99), and error-rate
- Leading indicators: retry rate, connection pool usage, queue depth, dependency 5xx
Implement alerts and runbooks with live links to dashboards.
Add one automated rollback path (Argo Rollouts or Flagger) and one kill switch (feature flag).
Route alerts by service and team. Validate PagerDuty/Jira integration.
Schedule a 60-minute gameday. Iterate on what broke.

The boring incident is a good incident. Make decisions machine-readable and humans will sleep again.

Related Resources

Key takeaways

Push playbooks into Git with service ownership and links from alerts to runbooks so responders never hunt for context.
Alert on leading indicators (queue depth, p99 tail, retry storms, connection pool saturation, consumer lag growth) instead of uptime vanity.
Wire metrics to automation: canaries and circuit breakers that pause/rollback without human heroics.
Standardize triage: common severity ladder, routing by service/owner, Slack war room automation, ticket auto-creation.
Continuously test playbooks with gamedays and chaos; measure MTTR and false-positive rate per playbook.

Implementation checklist

Define SLOs and error budgets per service; align alerts to budget burn, not just thresholds.
Pick 6–10 leading indicators per tier (app, runtime, dependencies, infra).
Create decision trees that map metric states to actions (degrade, shed load, rollback).
Automate canary analysis with Prometheus queries; fail fast on error-rate and p99 regression.
Route alerts by service/team in Alertmanager with runbook links and dashboards.
Bake kill switches and circuit breakers into code and mesh; document feature-flag toggles in playbooks.
Version playbooks, dashboards, and alert rules together; require PR review by on-call engineers.
Run monthly gamedays to validate automation; track MTTR and false positives per scenario.

Questions we hear from teams

What’s a good starting set of leading indicators per service?: p99 latency, error-rate anomaly (relative change), retry rate, internal queue depth/age, dependency 5xx, DB connection pool saturation, and for event-driven systems, consumer lag growth rate. Add runtime-specific ones (GC pause p95/p99 for JVM, FD usage, CPU steal).
How do I prevent alert fatigue when adding more signals?: Group alerts by service and page on budget burn or multi-signal conditions. Use Alertmanager grouping and deduplication, and route WARN to Slack while PAGE only on sustained leading indicators that correlate with user-impact or SLO burn.
We’re on ECS/Lambda, not Kubernetes. Does this still apply?: Yes. Replace Argo Rollouts with CodeDeploy blue/green + CloudWatch Alarms or Flagger on App Mesh. Same Prometheus/OpenTelemetry metrics, same decision trees, different controllers.
Who owns the playbooks—SRE or product teams?: Service-owning teams own their playbooks. SRE provides the framework: templates, tooling, routing, and quality bar. Make playbook PRs part of the service’s definition of done.
How do we test playbooks without breaking prod?: Use shadow traffic and staged canaries. Chaos test in staging with synthetic load. In prod, run controlled fault injections (low blast radius) during low traffic, paired with automatic rollback and SLO burn alerts.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to GitPlumbers about playbooks that scale See how we automated rollbacks with Argo Rollouts