Your Incidents Start 30 Minutes Before the Pager: Playbooks That Scale Across Teams
Stop paging on CPU and 500 counts. Build playbooks around leading indicators, wire telemetry to triage, and automate safe rollouts before users feel pain.
Incidents announce themselves 30 minutes early—you just have to listen to the right signals and let your playbook push the buttons.Back to all posts
The on-call that convinced me our playbooks were theater
At a previous gig, we paged the payments team at 2:17 AM for “CPU > 85%.” By 2:45 we realized the real problem: the card auth service was stuck in a retry storm because the downstream fraud model’s p99 latency had doubled after a redeploy. CPU was a red herring. The incident started 30 minutes before the pager, when p95 latency slope
and error budget burn
spiked on the canary. Our playbook didn’t look at either.
I’ve seen this movie at startups and at Fortune 100s. The fix isn’t “more dashboards.” It’s playbooks that encode leading indicators, route to the right humans automatically, and push buttons for you when the math says roll back.
The signals that predict trouble (and the ones that don’t)
Vanity metrics light up NOC TVs but won’t save your night. Stop paging on:
Average CPU or memory without saturation context
Total requests or 200-rate without success definition
Uptime pings that ignore user journeys
Instead, page on leading indicators wired to your SLOs and saturation:
Multi-window error budget burn: if you promise 99.9%, page on burn rates that will exhaust budget quickly
Latency slope: rapid p95/p99 growth rate, not just absolute thresholds
Queue depth and derivative:
kafka_consumer_lag
andrate(lag[5m])
Connection pool saturation:
db_pool_in_use / db_pool_size > 0.8
with rising trendGC pressure:
jvm_gc_pause_seconds_sum
delta and heap occupancy growthCache health: falling
cache_hit_ratio
and risingbackend_call_rate
Network turbulence:
tcp_retransmits
andpacket_drops
(eBPF visibility helps)Retry storms: ratio of 5xx to retries; exponential growth is a fire alarm
Here’s how the SLO burn alert actually looks in Prometheus. Multi-window prevents noise while catching fast burns:
# For a 99.9% availability SLO (0.1% budget). Burn multipliers follow Google SRE patterns.
# Create recording rules first
- record: slo:request_error_ratio:5m
expr: sum(rate(http_requests_total{code=~"5..", slo="payments"}[5m]))
/
sum(rate(http_requests_total{slo="payments"}[5m]))
- record: slo:request_error_ratio:1h
expr: sum(rate(http_requests_total{code=~"5..", slo="payments"}[1h]))
/
sum(rate(http_requests_total{slo="payments"}[1h]))
# Alert when both short and long windows are burning fast
- alert: HighErrorBudgetBurn
expr: (slo:request_error_ratio:5m > (0.001 * 14.4)) and (slo:request_error_ratio:1h > (0.001 * 14.4))
labels:
severity: page
service: payments
team: fincore
annotations:
summary: "Payments SLO burning fast"
runbook: "https://git.company.com/runbooks/payments-slo.md"
For saturation, don’t wait for 100% anything. Page on trend plus level:
- alert: ThreadPoolSaturation
expr: (threadpool_in_use{pool="netty"} / threadpool_max{pool="netty"} > 0.8)
and (deriv(threadpool_in_use{pool="netty"}[10m]) > 0)
labels:
severity: page
service: api-gateway
team: edge
Tie telemetry to ownership and triage
If an alert can’t tell you who owns the fix, it’s just noise. Standardize OpenTelemetry
resource attributes and propagate them into logs, traces, and metrics. Then route and enrich based on those attributes:
# OpenTelemetry collector resource processor
processors:
resource/add_ownership:
attributes:
- key: service.name
action: upsert
value: payments-api
- key: service.namespace
action: upsert
value: prod
- key: team
action: upsert
value: fincore
- key: pagerduty.service
action: upsert
value: PD123ABC
- key: slack.channel
action: upsert
value: "#oncall-fincore"
Now your alerting pipeline can route intelligently. With PagerDuty Event Orchestration
, you can fan-in signals and apply rules instead of hardcoding routes in every tool:
{
"routing_rules": [
{
"conditions": [
{"field": "payload.severity", "operator": "equals", "value": "critical"},
{"field": "payload.custom_details.team", "operator": "equals", "value": "fincore"}
],
"actions": {
"route_to": "PD123ABC",
"set_priority": "P1",
"annotate": "Runbook: https://git.company.com/runbooks/payments-slo.md"
}
}
]
}
Your triage page should contain one-click links to:
Grafana dashboards filtered by
service.name
andtrace_id
A prebuilt LogQL/SQL query for the last 30 minutes of errors
A
kubectl
command snippet with the correct namespace/label selectorThe canary analysis report from
Argo Rollouts
The feature flag console pre-filtered to the service
Automate the first 15 minutes
If your playbook starts with “ssh into the box,” you’ve already lost. Turn common steps into buttons. I’ve had good results with Rundeck
or StackStorm
backed by AWS SSM
for safe, auditable actions.
Automate:
Cache flush for a specific keyspace
Toggle read-only mode
Restart a single deployment or scale out a replica set
Roll back the last canary
Pause or reduce traffic percentage in the service mesh (
Istio
/Linkerd
)
A typical auto-triage flow triggered by the alert could be:
Capture context: fetch last N exception traces and top error fingerprints
Check saturation: thread pools, connection pools, Kafka lag deltas
Validate canary health vs. baseline
If canary bad, pause rollout; if severe, trigger automated rollback
Post a concise update to
#oncall-<team>
and create aJira
incident with links
Here’s a simple StackStorm
rule that pauses an Argo Rollout
when SLO burn hits the fast threshold:
---
name: pause_rollout_on_fast_burn
pack: sre
trigger: sensu.alert
criteria:
trigger.name:
type: eq
pattern: HighErrorBudgetBurn
action:
ref: argo.pause_rollout
parameters:
namespace: prod
rollout: payments-api
reason: "SLO fast burn detected via Prometheus"
Bake safety into rollouts: canaries and flags
If your only rollback is redeploying main, you’re gambling. Use Argo Rollouts
(or Flagger
) for progressive delivery and wire the same Prometheus signals you alert on. This way, the system self-governs before customers scream.
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: payments-api
spec:
strategy:
canary:
steps:
- setWeight: 10
- pause: {duration: 300}
- analysis:
templates:
- templateName: success-rate
- templateName: p95-latency
- setWeight: 25
- pause: {duration: 300}
- analysis:
templates:
- templateName: success-rate
- templateName: p95-latency
trafficRouting:
istio:
virtualService: payments-vs
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate
spec:
metrics:
- name: req_success_rate
interval: 1m
successCondition: result[0] >= 0.999
failureLimit: 2
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
1 - (sum(rate(http_requests_total{service="payments-api",code=~"5..",version="$CANARY"}[1m]))
/
sum(rate(http_requests_total{service="payments-api",version="$CANARY"}[1m])))
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: p95-latency
spec:
metrics:
- name: p95_latency
interval: 1m
successCondition: result[0] <= 200 # ms
failureLimit: 2
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service="payments-api",version="$CANARY"}[1m])) by (le)) * 1000
Pair canaries with feature flags (LaunchDarkly
, Unleash
) and a literal kill switch:
A dedicated
code
flag for “disable payments auth path” with a 1-minute TTLPre-authorized on-call to flip it without PRs
Audit trail back to the incident ticket
Standardize playbooks with GitOps so they actually scale
Every team writing its own runbooks guarantees drift. You need a standard template and a place to put it. We keep playbooks in a repo next to service manifests, reviewed like any other code. That makes ArgoCD
your distribution channel.
Start with a template like this and require every service to fill it out:
# Service: payments-api
## Triggers
- Alerts: HighErrorBudgetBurn, ThreadPoolSaturation
- Dashboards: grafana.com/d/abc123?var-service=payments-api
## Triage Checklist (First 15 minutes)
1. Confirm scope via SLO burn and latency slope
2. Check canary report in Argo Rollouts
3. Inspect top error fingerprints in logs
4. Validate DB pool saturation and Kafka lag delta
## Automated Actions
- Pause rollout: `st2 run argo.pause_rollout rollout=payments-api`
- Scale out: `st2 run k8s.scale deployment=payments-api replicas=+2`
- Kill switch: LaunchDarkly flag `payments-auth-enabled=false`
## Rollback / Roll-forward
- Rollback last version if canary failing twice
- If infra-only, toggle mesh route back to stable
## Comms
- Slack: #oncall-fincore, #incident-bridge
- Jira: create from template JINC-42
## Ownership
- Team: fincore
- PagerDuty: PD123ABC
Because it’s Git, you can enforce:
Linting: check for missing ownership fields and runbook links
Tests: run tabletop simulations with
chaos-mesh
or scripted drillsVersion history: diff what changed between incidents
And yes, write playbooks for platforms too: Kafka
, Redis
, Postgres
, Istio
, not just product services.
Prove it works: program-level metrics and outcomes
When we implement this with clients, we measure:
MTTA: should drop by 30–50% once routing + triage links exist
MTTR: usually down 20–40% with automation and auto-rollbacks
False-page rate: aim for <10% after migrating to SLO/burn alerts
Auto-remediation success: >60% of deploy-related incidents resolved by canary halt or rollback without human action
Adoption: % services with valid playbooks in repo, reviewed quarterly
One client saw deployment-related P1s fall from weekly to monthly after we wired Argo Rollouts
to SLO metrics and replaced CPU alerts with burn + saturation. Another cut “mystery pages” by tagging team
and pagerduty.service
at the OpenTelemetry
collector and auto-routing. None of this required a platform rewrite—just discipline and plumbing.
What I’d do tomorrow if I were you
Pick one critical journey and define a 99.9% SLO. Implement the multi-window burn alert above.
Add
team
,service.name
, andpagerduty.service
to telemetry at the collector. Route viaPagerDuty
rules.Convert your most common mitigation (rollback, flag) into a
Rundeck
/StackStorm
job and link it in the alert.Put
Argo Rollouts
on one service and gate canaries using those same Prometheus queries.Move your playbooks into a repo, adopt the template, and schedule a 60-minute tabletop drill per quarter.
None of this is flashy. It’s the boring, repeatable stuff that keeps customers off Twitter and your team out of firefighting mode.
When to call GitPlumbers
If you want this wired up end-to-end—SLOs, OTel taxonomy, PD routing, canaries, and the runbooks that actually get used—this is literally what we do. We’ve rescued orgs mid-migration, post-acquisition, and mid-GenAI pivot. We’ll meet you where you are, fix the plumbing, and hand you a system that scales across teams without the theater.
Key takeaways
- Use leading indicators: multi-window SLO burn rates, saturation signals, queue depth deltas, and GC/eviction precursors.
- Tag everything with ownership in `OpenTelemetry` resource attributes and route alerts via `PagerDuty Event Orchestration`.
- Automate the first 15 minutes: runbooks-as-code with `Rundeck`/`StackStorm` and prebuilt queries/dashboards.
- Bake recovery into delivery: `Argo Rollouts` canaries + feature flag kill-switches tied to telemetry gates.
- Standardize a playbook template and version it via `GitOps` so every team plays the same game.
Implementation checklist
- Define SLOs per critical user journey and implement multi-window burn rate alerts.
- Instrument saturation indicators: connection pool usage, queue depth growth, GC pause time, retry storm detection.
- Add `team`, `service`, and `pagerduty_service` attributes to telemetry via `OpenTelemetry` resource attributes.
- Create PagerDuty routing rules based on these attributes and incident type.
- Automate runbook steps for restart/rollback/cache purge with `Rundeck`/`StackStorm` and tag actions in audit logs.
- Implement `Argo Rollouts` (or `Flagger`) with metric-based canary gates; wire to the same Prometheus data as alerts.
- Adopt a standardized playbook template in a repo; require PR review and tabletop tests per quarter.
- Track MTTA, MTTR, false-page rate, and “automated remediation success rate” as program KPIs.
Questions we hear from teams
- What if we don’t have SLOs yet?
- Start with one critical user journey and define a simple availability SLO (e.g., 99.9%). Instrument success vs. failure at the request layer and implement the multi-window burn alert. Don’t wait for a company-wide SLO initiative—prove it on one service and spread.
- Won’t this create more alerts?
- Done right, you’ll have fewer, better alerts. Multi-window burn rates and saturation trends reduce noise compared to static CPU thresholds. We routinely cut false pages below 10% by eliminating vanity alerts.
- Do we need Kubernetes/Argo to benefit?
- No. The patterns hold on VMs and serverless. Replace `Argo Rollouts` with your deploy tool (e.g., Spinnaker, CodeDeploy) and use feature flags for fast rollback. The key is gating rollouts with the same metrics you alert on.
- Centralized or team-owned playbooks?
- Both. Platform defines the template, telemetry taxonomy, and automation interfaces. Teams own their service-specific playbooks and tests. Store them together in Git and enforce via PR review and quarterly drills.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.