Your Incidents Start 30 Minutes Before the Pager: Playbooks That Scale Across Teams

Stop paging on CPU and 500 counts. Build playbooks around leading indicators, wire telemetry to triage, and automate safe rollouts before users feel pain.

Incidents announce themselves 30 minutes early—you just have to listen to the right signals and let your playbook push the buttons.
Back to all posts

The on-call that convinced me our playbooks were theater

At a previous gig, we paged the payments team at 2:17 AM for “CPU > 85%.” By 2:45 we realized the real problem: the card auth service was stuck in a retry storm because the downstream fraud model’s p99 latency had doubled after a redeploy. CPU was a red herring. The incident started 30 minutes before the pager, when p95 latency slope and error budget burn spiked on the canary. Our playbook didn’t look at either.

I’ve seen this movie at startups and at Fortune 100s. The fix isn’t “more dashboards.” It’s playbooks that encode leading indicators, route to the right humans automatically, and push buttons for you when the math says roll back.

The signals that predict trouble (and the ones that don’t)

Vanity metrics light up NOC TVs but won’t save your night. Stop paging on:

  • Average CPU or memory without saturation context

  • Total requests or 200-rate without success definition

  • Uptime pings that ignore user journeys

Instead, page on leading indicators wired to your SLOs and saturation:

  • Multi-window error budget burn: if you promise 99.9%, page on burn rates that will exhaust budget quickly

  • Latency slope: rapid p95/p99 growth rate, not just absolute thresholds

  • Queue depth and derivative: kafka_consumer_lag and rate(lag[5m])

  • Connection pool saturation: db_pool_in_use / db_pool_size > 0.8 with rising trend

  • GC pressure: jvm_gc_pause_seconds_sum delta and heap occupancy growth

  • Cache health: falling cache_hit_ratio and rising backend_call_rate

  • Network turbulence: tcp_retransmits and packet_drops (eBPF visibility helps)

  • Retry storms: ratio of 5xx to retries; exponential growth is a fire alarm

Here’s how the SLO burn alert actually looks in Prometheus. Multi-window prevents noise while catching fast burns:

# For a 99.9% availability SLO (0.1% budget). Burn multipliers follow Google SRE patterns.
# Create recording rules first
- record: slo:request_error_ratio:5m
  expr: sum(rate(http_requests_total{code=~"5..", slo="payments"}[5m]))
        /
        sum(rate(http_requests_total{slo="payments"}[5m]))
- record: slo:request_error_ratio:1h
  expr: sum(rate(http_requests_total{code=~"5..", slo="payments"}[1h]))
        /
        sum(rate(http_requests_total{slo="payments"}[1h]))

# Alert when both short and long windows are burning fast
- alert: HighErrorBudgetBurn
  expr: (slo:request_error_ratio:5m > (0.001 * 14.4)) and (slo:request_error_ratio:1h > (0.001 * 14.4))
  labels:
    severity: page
    service: payments
    team: fincore
  annotations:
    summary: "Payments SLO burning fast"
    runbook: "https://git.company.com/runbooks/payments-slo.md"

For saturation, don’t wait for 100% anything. Page on trend plus level:

- alert: ThreadPoolSaturation
  expr: (threadpool_in_use{pool="netty"} / threadpool_max{pool="netty"} > 0.8)
        and (deriv(threadpool_in_use{pool="netty"}[10m]) > 0)
  labels:
    severity: page
    service: api-gateway
    team: edge

Tie telemetry to ownership and triage

If an alert can’t tell you who owns the fix, it’s just noise. Standardize OpenTelemetry resource attributes and propagate them into logs, traces, and metrics. Then route and enrich based on those attributes:

# OpenTelemetry collector resource processor
processors:
  resource/add_ownership:
    attributes:
      - key: service.name
        action: upsert
        value: payments-api
      - key: service.namespace
        action: upsert
        value: prod
      - key: team
        action: upsert
        value: fincore
      - key: pagerduty.service
        action: upsert
        value: PD123ABC
      - key: slack.channel
        action: upsert
        value: "#oncall-fincore"

Now your alerting pipeline can route intelligently. With PagerDuty Event Orchestration, you can fan-in signals and apply rules instead of hardcoding routes in every tool:

{
  "routing_rules": [
    {
      "conditions": [
        {"field": "payload.severity", "operator": "equals", "value": "critical"},
        {"field": "payload.custom_details.team", "operator": "equals", "value": "fincore"}
      ],
      "actions": {
        "route_to": "PD123ABC",
        "set_priority": "P1",
        "annotate": "Runbook: https://git.company.com/runbooks/payments-slo.md"
      }
    }
  ]
}

Your triage page should contain one-click links to:

  • Grafana dashboards filtered by service.name and trace_id

  • A prebuilt LogQL/SQL query for the last 30 minutes of errors

  • A kubectl command snippet with the correct namespace/label selector

  • The canary analysis report from Argo Rollouts

  • The feature flag console pre-filtered to the service

Automate the first 15 minutes

If your playbook starts with “ssh into the box,” you’ve already lost. Turn common steps into buttons. I’ve had good results with Rundeck or StackStorm backed by AWS SSM for safe, auditable actions.

Automate:

  • Cache flush for a specific keyspace

  • Toggle read-only mode

  • Restart a single deployment or scale out a replica set

  • Roll back the last canary

  • Pause or reduce traffic percentage in the service mesh (Istio/Linkerd)

A typical auto-triage flow triggered by the alert could be:

  1. Capture context: fetch last N exception traces and top error fingerprints

  2. Check saturation: thread pools, connection pools, Kafka lag deltas

  3. Validate canary health vs. baseline

  4. If canary bad, pause rollout; if severe, trigger automated rollback

  5. Post a concise update to #oncall-<team> and create a Jira incident with links

Here’s a simple StackStorm rule that pauses an Argo Rollout when SLO burn hits the fast threshold:

---
name: pause_rollout_on_fast_burn
pack: sre
trigger: sensu.alert
criteria:
  trigger.name:
    type: eq
    pattern: HighErrorBudgetBurn
action:
  ref: argo.pause_rollout
  parameters:
    namespace: prod
    rollout: payments-api
    reason: "SLO fast burn detected via Prometheus"

Bake safety into rollouts: canaries and flags

If your only rollback is redeploying main, you’re gambling. Use Argo Rollouts (or Flagger) for progressive delivery and wire the same Prometheus signals you alert on. This way, the system self-governs before customers scream.

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: payments-api
spec:
  strategy:
    canary:
      steps:
        - setWeight: 10
        - pause: {duration: 300}
        - analysis:
            templates:
              - templateName: success-rate
              - templateName: p95-latency
        - setWeight: 25
        - pause: {duration: 300}
        - analysis:
            templates:
              - templateName: success-rate
              - templateName: p95-latency
      trafficRouting:
        istio:
          virtualService: payments-vs
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  metrics:
    - name: req_success_rate
      interval: 1m
      successCondition: result[0] >= 0.999
      failureLimit: 2
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            1 - (sum(rate(http_requests_total{service="payments-api",code=~"5..",version="$CANARY"}[1m]))
                /
                sum(rate(http_requests_total{service="payments-api",version="$CANARY"}[1m])))
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: p95-latency
spec:
  metrics:
    - name: p95_latency
      interval: 1m
      successCondition: result[0] <= 200 # ms
      failureLimit: 2
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service="payments-api",version="$CANARY"}[1m])) by (le)) * 1000

Pair canaries with feature flags (LaunchDarkly, Unleash) and a literal kill switch:

  • A dedicated code flag for “disable payments auth path” with a 1-minute TTL

  • Pre-authorized on-call to flip it without PRs

  • Audit trail back to the incident ticket

Standardize playbooks with GitOps so they actually scale

Every team writing its own runbooks guarantees drift. You need a standard template and a place to put it. We keep playbooks in a repo next to service manifests, reviewed like any other code. That makes ArgoCD your distribution channel.

Start with a template like this and require every service to fill it out:

# Service: payments-api

## Triggers
- Alerts: HighErrorBudgetBurn, ThreadPoolSaturation
- Dashboards: grafana.com/d/abc123?var-service=payments-api

## Triage Checklist (First 15 minutes)
1. Confirm scope via SLO burn and latency slope
2. Check canary report in Argo Rollouts
3. Inspect top error fingerprints in logs
4. Validate DB pool saturation and Kafka lag delta

## Automated Actions
- Pause rollout: `st2 run argo.pause_rollout rollout=payments-api`
- Scale out: `st2 run k8s.scale deployment=payments-api replicas=+2`
- Kill switch: LaunchDarkly flag `payments-auth-enabled=false`

## Rollback / Roll-forward
- Rollback last version if canary failing twice
- If infra-only, toggle mesh route back to stable

## Comms
- Slack: #oncall-fincore, #incident-bridge
- Jira: create from template JINC-42

## Ownership
- Team: fincore
- PagerDuty: PD123ABC

Because it’s Git, you can enforce:

  • Linting: check for missing ownership fields and runbook links

  • Tests: run tabletop simulations with chaos-mesh or scripted drills

  • Version history: diff what changed between incidents

And yes, write playbooks for platforms too: Kafka, Redis, Postgres, Istio, not just product services.

Prove it works: program-level metrics and outcomes

When we implement this with clients, we measure:

  • MTTA: should drop by 30–50% once routing + triage links exist

  • MTTR: usually down 20–40% with automation and auto-rollbacks

  • False-page rate: aim for <10% after migrating to SLO/burn alerts

  • Auto-remediation success: >60% of deploy-related incidents resolved by canary halt or rollback without human action

  • Adoption: % services with valid playbooks in repo, reviewed quarterly

One client saw deployment-related P1s fall from weekly to monthly after we wired Argo Rollouts to SLO metrics and replaced CPU alerts with burn + saturation. Another cut “mystery pages” by tagging team and pagerduty.service at the OpenTelemetry collector and auto-routing. None of this required a platform rewrite—just discipline and plumbing.

What I’d do tomorrow if I were you

  1. Pick one critical journey and define a 99.9% SLO. Implement the multi-window burn alert above.

  2. Add team, service.name, and pagerduty.service to telemetry at the collector. Route via PagerDuty rules.

  3. Convert your most common mitigation (rollback, flag) into a Rundeck/StackStorm job and link it in the alert.

  4. Put Argo Rollouts on one service and gate canaries using those same Prometheus queries.

  5. Move your playbooks into a repo, adopt the template, and schedule a 60-minute tabletop drill per quarter.

None of this is flashy. It’s the boring, repeatable stuff that keeps customers off Twitter and your team out of firefighting mode.

When to call GitPlumbers

If you want this wired up end-to-end—SLOs, OTel taxonomy, PD routing, canaries, and the runbooks that actually get used—this is literally what we do. We’ve rescued orgs mid-migration, post-acquisition, and mid-GenAI pivot. We’ll meet you where you are, fix the plumbing, and hand you a system that scales across teams without the theater.

Related Resources

Key takeaways

  • Use leading indicators: multi-window SLO burn rates, saturation signals, queue depth deltas, and GC/eviction precursors.
  • Tag everything with ownership in `OpenTelemetry` resource attributes and route alerts via `PagerDuty Event Orchestration`.
  • Automate the first 15 minutes: runbooks-as-code with `Rundeck`/`StackStorm` and prebuilt queries/dashboards.
  • Bake recovery into delivery: `Argo Rollouts` canaries + feature flag kill-switches tied to telemetry gates.
  • Standardize a playbook template and version it via `GitOps` so every team plays the same game.

Implementation checklist

  • Define SLOs per critical user journey and implement multi-window burn rate alerts.
  • Instrument saturation indicators: connection pool usage, queue depth growth, GC pause time, retry storm detection.
  • Add `team`, `service`, and `pagerduty_service` attributes to telemetry via `OpenTelemetry` resource attributes.
  • Create PagerDuty routing rules based on these attributes and incident type.
  • Automate runbook steps for restart/rollback/cache purge with `Rundeck`/`StackStorm` and tag actions in audit logs.
  • Implement `Argo Rollouts` (or `Flagger`) with metric-based canary gates; wire to the same Prometheus data as alerts.
  • Adopt a standardized playbook template in a repo; require PR review and tabletop tests per quarter.
  • Track MTTA, MTTR, false-page rate, and “automated remediation success rate” as program KPIs.

Questions we hear from teams

What if we don’t have SLOs yet?
Start with one critical user journey and define a simple availability SLO (e.g., 99.9%). Instrument success vs. failure at the request layer and implement the multi-window burn alert. Don’t wait for a company-wide SLO initiative—prove it on one service and spread.
Won’t this create more alerts?
Done right, you’ll have fewer, better alerts. Multi-window burn rates and saturation trends reduce noise compared to static CPU thresholds. We routinely cut false pages below 10% by eliminating vanity alerts.
Do we need Kubernetes/Argo to benefit?
No. The patterns hold on VMs and serverless. Replace `Argo Rollouts` with your deploy tool (e.g., Spinnaker, CodeDeploy) and use feature flags for fast rollback. The key is gating rollouts with the same metrics you alert on.
Centralized or team-owned playbooks?
Both. Platform defines the template, telemetry taxonomy, and automation interfaces. Teams own their service-specific playbooks and tests. Store them together in Git and enforce via PR review and quarterly drills.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to GitPlumbers about wiring playbooks to your telemetry Download the playbook template (Markdown)

Related resources