The Playbook Problem: Building Incident Response That Scales Across Teams (And Predicts the Blast Before It Happens)
Stop paging on vanity metrics. Start wiring leading indicators to triage and rollout automation so incidents fix themselves—or never happen.
We stopped paging on CPU and started paging on queue age, retry storms, and error-budget burn. Incidents didn’t disappear—they just stopped being exciting.Back to all posts
The outage that didn’t page… until it was too late
I’ve watched teams get wrecked by the same pattern: green dashboards, then suddenly Slack is on fire. At one fintech, checkout looked “healthy” (CPU 45%, 200 OK rate stable). Meanwhile, Kafka consumer lag was climbing, connection pools were at 90% saturation, and retries were doubling every minute. No alert. Ten minutes later, p99 exploded, autoscaling thrashed, and we rolled back blind.
The fix wasn’t another dashboard. We rewired playbooks around leading indicators and tied them to automation. Incidents became boring. That’s the goal.
What to measure if you want to predict incidents
Skip the vanity: average CPU, node uptime, request count. They’re fine for capacity reviews, useless at 3 a.m. You want the precursors—the signals that move before users feel pain.
- Application layer
- p99/p999 latency regression (not averages)
- Retry storms (client and server) and
429/503from dependencies - Queue depth/age: internal work queues,
Sidekiq,SQS,Celery - Thread/connection pool saturation: >80% sustained is smoke
- Error-rate anomaly: relative change, not fixed threshold
- Cache miss rate spikes -> DB meltdown precursor
- Runtime/infra
- GC pause p95/p99 (JVM
gc_pause_seconds, Gogcpause) trending up - CPU steal on noisy neighbors in shared tenancy
- File descriptor/ephemeral port exhaustion
- Disk I/O wait, write amplification
- GC pause p95/p99 (JVM
- Data/streaming
- Kafka consumer lag growth rate (slope), not just absolute
- DB lock wait time/deadlock count
- Replication lag (Postgres
pg_stat_replication), cache fill backlog
Anchor alerts to SLOs and error budgets, not hunches. If you burn 2% of your monthly budget in 15 minutes, that’s a page—even if status is 200.
# prometheus alerts: leading indicators over vanity
groups:
- name: checkout-leading
rules:
- alert: LatencyRegressionP99
expr: histogram_quantile(0.99, sum(rate(http_server_request_duration_seconds_bucket{job="checkout"}[5m])) by (le)) > 0.5
for: 10m
labels:
severity: page
service: checkout
team: checkout
annotations:
summary: "Checkout p99 latency > 500ms for 10m"
playbook: "https://git.company.com/runbooks/checkout#latency"
- alert: KafkaConsumerLagSpike
expr: sum(kafka_consumergroup_group_lag{consumergroup="orders"}) > 50000
for: 5m
labels: {severity: page, service: orders, team: orders}
annotations:
summary: "Orders consumer lag > 50k for 5m (growth likely)"
playbook: "https://git.company.com/runbooks/orders#lag"
- alert: ConnPoolSaturation
expr: max(db_connection_pool_in_use{service="checkout"}) / max(db_connection_pool_size{service="checkout"}) > 0.85
for: 5m
labels: {severity: warn, service: checkout, team: checkout}
annotations:
summary: "DB connection pool >85% for 5m"
playbook: "https://git.company.com/runbooks/checkout#db-saturation"If you’re pushing traces, use tail-based sampling to keep the hot signals:
# OpenTelemetry Collector: keep error and high-latency traces
processors:
tail_sampling:
decision_wait: 10s
num_traces: 50000
policies:
- name: errors
type: status_code
status_code:
status_codes: [ERROR]
- name: high-latency
type: latency
latency:
threshold_ms: 500Turn metrics into decisions: playbooks that read like code
The best playbooks aren’t PDFs. They’re short, specific, and linked from alerts with exactly one decision tree.
Structure every playbook like this:
- Trigger: the alert name and the metric query.
- Guardrails: SLO context, error budget remaining, blast radius.
- Decision tree: degrade/rollback/route traffic—no vague “investigate”.
- Automation: commands, scripts, or toggles with examples.
- Roll-forward notes: how to fix root cause without paging next time.
Example: Checkout p99 latency regression
- Trigger:
LatencyRegressionP99fired for 10m. - Guardrails: Error budget burn = 3% in 15m; canary at 10%.
- Decision tree:
- If
retry_raterising andpayments5xx> 1%, enable kill switch. - If canary active, pause, evaluate Argo analysis, and rollback if failing.
- If DB connection pool >90%, enable read-only mode and shed non-critical traffic (feature flag
promo-deals). - If consumer lag growth >500/sec, scale consumers and slow producers via token bucket.
- If
- Automation:
- Toggle feature flags (
LaunchDarkly):// graceful degradation on payments outage if (ldClient.variation("payments-kill-switch", user, false)) { return cachedResponse || { status: "degraded" }; } - Run traffic routing command or move weights via
Argo Rollouts/Istio. - One-click rollback via
argocd app rollback checkout <rev>.
- Toggle feature flags (
Pro tip: every step in the tree should be unambiguous and testable in staging. If it says “check logs,” it should specify the query.
Wire telemetry to rollout automation (so rollbacks aren’t heroics)
If your playbook says “rollback if error rate >1%,” make that a controller’s job.
Argo Rollouts with Prometheus analysis
# Analysis: fail canary on error-rate or p99 regression
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: checkout-canary
spec:
metrics:
- name: error-rate
interval: 1m
count: 5
successCondition: result[0] < 0.01
failureLimit: 1
provider:
prometheus:
address: http://prometheus.monitoring.svc:9090
query: sum(rate(http_server_requests_seconds_count{service="checkout",status=~"5.."}[1m]))
/ sum(rate(http_server_requests_seconds_count{service="checkout"}[1m]))
- name: p99-latency
interval: 1m
count: 5
successCondition: result[0] < 0.4
failureLimit: 1
provider:
prometheus:
address: http://prometheus.monitoring.svc:9090
query: histogram_quantile(0.99, sum(rate(http_server_request_duration_seconds_bucket{service="checkout"}[1m])) by (le))
---
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: checkout
spec:
strategy:
canary:
steps:
- setWeight: 10
- pause: {duration: 60}
- analysis:
templates:
- templateName: checkout-canary
- setWeight: 50
- pause: {duration: 120}
- analysis:
templates:
- templateName: checkout-canary
- setWeight: 100Flagger offers similar automation if you’re more AppMesh/Istio-centric.
Catch dependency failures early with circuit breakers:
# Istio: outlier detection + connection pool limits
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: payments-dr
spec:
host: payments
trafficPolicy:
outlierDetection:
consecutive5xxErrors: 5
interval: 5s
baseEjectionTime: 30s
maxEjectionPercent: 50
connectionPool:
http:
http1MaxPendingRequests: 1000
maxRequestsPerConnection: 100Now your playbook step “degrade on payments 5xx” is a toggle, not a midnight YAML edit.
Triage that scales across teams (routing, ownership, and chatops)
Your on-call shouldn’t guess who owns payments-proxy-v2. Route alerts by service and team labels. Use a service catalog (Backstage) as the source of truth.
# Backstage catalog: ownership + links to dashboards/runbooks
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
name: checkout
annotations:
grafana/dashboard-url: https://grafana.company.com/d/checkout
ops.playbook/url: https://git.company.com/runbooks/checkout
spec:
type: service
owner: team-checkout
lifecycle: productionAlertmanager does the boring but critical routing:
route:
receiver: pagerduty
group_by: ['service','team']
routes:
- matchers:
- service="checkout"
receiver: pd-checkout
receivers:
- name: pd-checkout
pagerduty_configs:
- routing_key: <secret>
severity: '{{ .CommonLabels.severity | default "error" }}'Standardize the incident ladder (SEV-1 to SEV-4) and automate the boilerplate:
- Slack war room creation with pinned links (Grafana, runbook, rollout)
- PagerDuty or Opsgenie for paging, Jira/ServiceNow ticket creation
- Single command to pull the right dashboard:
!checkout dashboard
# PagerDuty event with playbook + dashboard links
curl -X POST https://events.pagerduty.com/v2/enqueue \
-H 'Content-Type: application/json' \
-d '{
"routing_key": "PD_ROUTING_KEY",
"event_action": "trigger",
"payload": {
"summary": "Checkout p99 latency regression",
"severity": "critical",
"source": "prometheus",
"custom_details": {
"playbook": "https://gitplumbers.dev/runbooks/checkout#latency"
}
},
"links": [{"href":"https://grafana.company.com/d/checkout"}]
}'If your org spans multiple time zones and teams, this wiring is the difference between a clean handoff and chaos.
Keep playbooks evergreen: test them like code
I’ve seen beautiful playbooks rot in Confluence. Fix it with Git and tests.
- GitOps everything: playbooks, alerts, dashboards. Same repo as service or a shared
ops/monorepo with ownership. - PR review by on-call engineers. If they can’t follow it half-asleep, rewrite it.
- Gamedays monthly: simulate dependency
503s, throttle DB, inject latency withtcorchaos-mesh. - Chaos in CI: ephemeral env + smoke scenario that exercises canary analysis.
- Score the playbook:
- MTTR for its triggers
- False-positive rate
- Number of manual steps (drive toward zero)
- “First meaningful action” time from page to toggle/rollback
As a rule: if a human did it twice during an incident, automate it.
What success looks like (numbers, not vibes)
At a SaaS we helped last year:
- MTTR for SEV-2s dropped from 78 minutes to 24 minutes in six weeks.
- 62% of rollbacks executed automatically via Argo Rollouts; zero missed canary failures.
- False-positive pages down 40% after replacing CPU alerts with leading indicators.
- On-call interrupts per engineer per week fell from 9.1 to 3.4.
- Exec-friendly SLO burn alerts replaced vague “high error rate” noise.
None of that required a new APM license. It required wiring what you already have to decisions and automation.
The starter kit (copy/paste and adapt)
- Pick three services. For each, define:
- SLOs: availability, latency (p99), and error-rate
- Leading indicators: retry rate, connection pool usage, queue depth, dependency
5xx
- Implement alerts and runbooks with live links to dashboards.
- Add one automated rollback path (Argo Rollouts or Flagger) and one kill switch (feature flag).
- Route alerts by
serviceandteam. Validate PagerDuty/Jira integration. - Schedule a 60-minute gameday. Iterate on what broke.
The boring incident is a good incident. Make decisions machine-readable and humans will sleep again.
Key takeaways
- Push playbooks into Git with service ownership and links from alerts to runbooks so responders never hunt for context.
- Alert on leading indicators (queue depth, p99 tail, retry storms, connection pool saturation, consumer lag growth) instead of uptime vanity.
- Wire metrics to automation: canaries and circuit breakers that pause/rollback without human heroics.
- Standardize triage: common severity ladder, routing by service/owner, Slack war room automation, ticket auto-creation.
- Continuously test playbooks with gamedays and chaos; measure MTTR and false-positive rate per playbook.
Implementation checklist
- Define SLOs and error budgets per service; align alerts to budget burn, not just thresholds.
- Pick 6–10 leading indicators per tier (app, runtime, dependencies, infra).
- Create decision trees that map metric states to actions (degrade, shed load, rollback).
- Automate canary analysis with Prometheus queries; fail fast on error-rate and p99 regression.
- Route alerts by service/team in Alertmanager with runbook links and dashboards.
- Bake kill switches and circuit breakers into code and mesh; document feature-flag toggles in playbooks.
- Version playbooks, dashboards, and alert rules together; require PR review by on-call engineers.
- Run monthly gamedays to validate automation; track MTTR and false positives per scenario.
Questions we hear from teams
- What’s a good starting set of leading indicators per service?
- p99 latency, error-rate anomaly (relative change), retry rate, internal queue depth/age, dependency 5xx, DB connection pool saturation, and for event-driven systems, consumer lag growth rate. Add runtime-specific ones (GC pause p95/p99 for JVM, FD usage, CPU steal).
- How do I prevent alert fatigue when adding more signals?
- Group alerts by service and page on budget burn or multi-signal conditions. Use Alertmanager grouping and deduplication, and route WARN to Slack while PAGE only on sustained leading indicators that correlate with user-impact or SLO burn.
- We’re on ECS/Lambda, not Kubernetes. Does this still apply?
- Yes. Replace Argo Rollouts with CodeDeploy blue/green + CloudWatch Alarms or Flagger on App Mesh. Same Prometheus/OpenTelemetry metrics, same decision trees, different controllers.
- Who owns the playbooks—SRE or product teams?
- Service-owning teams own their playbooks. SRE provides the framework: templates, tooling, routing, and quality bar. Make playbook PRs part of the service’s definition of done.
- How do we test playbooks without breaking prod?
- Use shadow traffic and staged canaries. Chaos test in staging with synthetic load. In prod, run controlled fault injections (low blast radius) during low traffic, paired with automatic rollback and SLO burn alerts.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
