The Playbook That Stopped Pager Roulette: Predictive Signals + Push‑Button Rollbacks Across 12 Teams

Stop shipping hope as a strategy. Build incident playbooks that predict trouble, triage in seconds, and trigger rollbacks and feature flag kills without a war room.

If your alert can’t tell a bot how to rollback, it’s not a playbook — it’s a suggestion.
Back to all posts

The Friday rollout that broke six teams — and how we stopped it

Two summers ago, six teams shared a payment path. A Friday deploy looked green in Grafana — average CPU steady, error rate “fine,” synthetic checks passing. At 6:14 PM, p99 checkout latency crept from 650ms to 1.8s. Retries kicked in, Kafka lag ballooned, Node pools started CPU throttling, and the on-call Slack went feral.

We didn’t fix it with more dashboards. We built a playbook that read leading indicators and let a bot push the buttons: pause rollout, kill a flag, drain a bad subset, and page the one team that could actually fix it. MTTR dropped from 72 minutes to 11 over three months, and pages per incident fell by half.

Here’s the version that scales across teams without becoming binder-ware.

Chase leading indicators, not dashboards that lie

If you only track vanity metrics — average CPU, total requests, APDEX — you’ll miss the cliff edge. The playbook starts with signals that move before users churn.

What actually predicts trouble:

  • Error budget burn rate: 2h window > 6x or 1h window > 14x are classic gates.
  • Tail latency: p99/p99.9 for critical RPCs and DB calls, not p50.
  • Saturation proxies: CPU throttling (container_cpu_cfs_throttled_seconds_total), Kafka consumer lag, thread/conn pool exhaustion.
  • Retry/amplification: rising 5xx with client retries > 1.2x baseline; circuit breakers oscillating.
  • GC/stop-the-world: JVM gc_pause_seconds_sum spikes or Go STW > 50ms sustained.

A Prometheus rule set that actually fires early:

# rules-burnrate.yaml
groups:
- name: slo-burnrate
  rules:
  - alert: SLOHighBurnShort
    expr: sum(rate(http_request_errors_total{service=~"checkout|payment",le="+Inf"}[5m]))
          /
          sum(rate(http_requests_total{service=~"checkout|payment"}[5m])) > (14 * 0.01)
    for: 5m
    labels:
      severity: critical
      team: payments
      playbook: checkout-latency
    annotations:
      summary: High burn rate (short window)
      runbook: https://git.company/runbooks/checkout-latency

- name: saturation-and-tail
  rules:
  - alert: CPUThrottlingHigh
    expr: rate(container_cpu_cfs_throttled_seconds_total{container!="",container!="POD"}[5m])
          / rate(container_cpu_cfs_periods_total{container!="",container!="POD"}[5m]) > 0.2
    for: 10m
    labels:
      severity: warning
      playbook: capacity-throttle
  - alert: TailLatencyDegradation
    expr: histogram_quantile(0.99, sum(rate(http_server_duration_seconds_bucket{route="/checkout"}[2m])) by (le)) > 1.2
    for: 5m
    labels:
      severity: critical
      playbook: checkout-latency

If you can’t label consistently across teams (service, team, playbook), your playbook won’t scale. Standardize that in CI.

Tie telemetry to triage: one page, three buttons

Every page should map to an action a bot can execute. Don’t send humans into Grafana spelunking while customers time out.

The triage model we deploy across orgs:

  • Client issue: Roll back or kill a feature flag. Pause canary. No infra changes.
  • Dependency issue: Route around a degraded downstream, widen timeouts, open the circuit.
  • Capacity issue: Scale HPA, raise limits, shed non-critical traffic.

A runbook template that forces the link between signal and action:

# Runbook: Checkout Latency

- Owner: @payments-oncall
- SLO: 99.9% under 1s
- Tag: playbook=checkout-latency

## Leading Indicators
- Burn rate (1h > 14x)
- p99 /checkout latency > 1.2s
- CPU throttling > 20%
- Kafka consumer lag > 5k msgs

## Immediate Actions (choose one)
1. [Bot] Pause Argo Rollout: `argo rollouts pause checkout`
2. [Bot] Kill Experiment Flag: `ldctl flag set checkout_newflow off`
3. [Bot] Raise HPA minReplicas: `kubectl scale deploy/checkout --replicas=60`

## Next Steps
- If dependency `payments-core` 5xx > 5%, route to vN-1 via Istio.
- If throttling persists, apply `resources.limits.cpu` +20% and restart.

## Escalation
- Pager: `#payments-pager`
- Slack: `#incident-bridge`

Pro tip: enforce that every Alertmanager route includes a playbook label with a resolvable runbook URL and mapped automation.

Automate rollouts and rollbacks from alerts

Manual rollbacks are where minutes burn. Wire alerts to automation that pauses bad rollouts, blocks promotion, or kills risky flags.

  • Canary gates with Argo Rollouts: Use AnalysisTemplate to query Prometheus and fail fast.
  • Alertmanager webhooks: Route specific alerts to a bot that calls Argo/Flag APIs.
  • Feature flag kills: Kill risky variations via LaunchDarkly when leading indicators trip.

Example AnalysisTemplate for p99 checks:

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: checkout-p99
spec:
  metrics:
  - name: p99-latency
    interval: 1m
    count: 5
    successCondition: result[0] < 1.0
    failureLimit: 1
    provider:
      prometheus:
        address: http://prometheus:9090
        query: |
          histogram_quantile(0.99, sum(rate(http_server_duration_seconds_bucket{route="/checkout",version="{{args.version}}"}[1m])) by (le))

Alertmanager route to automation:

route:
  receiver: ops-bot
  group_by: [team, playbook]
receivers:
- name: ops-bot
  webhook_configs:
  - url: https://ops-bot.company/hooks/alert
    send_resolved: true

Bot handler (simplified) that pauses rollouts or kills a flag:

#!/usr/bin/env bash
PLAYBOOK="$1"
SERVICE="$2"
case "$PLAYBOOK" in
  checkout-latency)
    argo rollouts pause "$SERVICE" || true
    ldctl flag set checkout_newflow off || true
    ;;
  capacity-throttle)
    kubectl -n "$SERVICE" patch hpa "$SERVICE" --type merge -p '{"spec":{"minReplicas":60}}'
    ;;
esac

Service meshes help too. If a dependency is sick, open the circuit and route around it:

apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: payments-core
spec:
  host: payments-core
  trafficPolicy:
    connectionPool:
      http:
        http1MaxPendingRequests: 1000
        maxRequestsPerConnection: 100
    outlierDetection:
      consecutive5xxErrors: 10
      interval: 5s
      baseEjectionTime: 30s
      maxEjectionPercent: 50

Make it scale: a shared module, not a wiki

If each team hand-rolls rules and runbooks, drift wins. Publish a Terraform/Helm module that emits:

  • A standard set of Prometheus rules (burn rate, tail latency, saturation).
  • Alertmanager routes with team, service, playbook labels.
  • Dashboards wired to runbook URLs.
  • OTel Collector config with enforced semantic conventions.

Terraform to stamp alerts per service:

module "service_alerts" {
  source           = "git::ssh://git.company/infra//modules/alerts"
  service_name     = var.service_name
  team             = var.team
  slo_target       = 0.999
  latency_route    = "/checkout"
  playbook_url     = "https://git.company/runbooks/${var.service_name}.md"
}

OpenTelemetry Collector snippet enforcing labels:

receivers:
  otlp:
    protocols: { http: {}, grpc: {} }
processors:
  attributes:
    actions:
    - key: service
      from_attribute: service.name
      action: upsert
    - key: team
      value: payments
      action: upsert
    - key: playbook
      value: checkout-latency
      action: upsert
exporters:
  otlphttp:
    endpoint: http://tempo:4318
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [attributes]
      exporters: [otlphttp]

Gate merges on instrumentation compliance. We add a CI check that rejects code if spans/metrics lack service, team, playbook labels.

Results from the field (what actually changed)

On a real engagement (12 teams, ~180 services, EKS + ArgoCD + Prometheus), we measured after 90 days:

  • MTTR: 72m → 14m median; p90 140m → 32m.
  • Automated remediation rate: 9% → 54% of incidents had at least one bot action.
  • False positives: alert acknowledgement with no action 38% → 12%.
  • Rollout safety: canary auto-abort rate up 3x before customer impact (intended).
  • Pages per incident: 5.2 → 2.1; Slack bridges got quieter.

The big unlock was predictiveness. Burn rate + tail latency caught issues 10–20 minutes before business KPIs moved. We didn’t chase “availability 99.99%” as a vanity metric — we used the error budget to tell us when to stop shipping and stabilize.

Harden the playbook: drills, chaos, and the AI code wildcard

Don’t trust a playbook you haven’t rehearsed.

  • Monthly game day: Break a dependency (we use Litmus or Gremlin) and verify the bot pauses rollouts, flips flags, and routes traffic as expected.
  • Dependency budget: Set per-downstream SLOs; page the upstream owner who’s burning the budget.
  • Retry discipline: Cap retries and jitter backoff. AI-generated “vibe code” loves while(true) retry() — we fix those loops weekly.
  • Circuit breakers: Ensure they trip before queues avalanche. Validate with chaos.
  • Post-incident PRs: If a human did it once, a bot should do it next time. Add the automation.

A quick guard against retry storms that we now standardize in service templates:

// Typescript/Node HTTP client example with sane retries
import pLimit from 'p-limit';
import got, { RequestError } from 'got';
const limit = pLimit(100); // cap concurrency

async function callPayment(url: string) {
  return await got(url, {
    retry: { limit: 2, methods: ['GET','POST'], statusCodes: [502,503,504], backoffLimit: 2000 },
    timeout: { request: 1500 },
    hooks: {
      beforeRetry: [ (opts, err) => { if ((err as RequestError).response?.statusCode === 500) opts.retry.limit = 1; } ],
    },
  });
}

export const guarded = (u: string) => limit(() => callPayment(u));

I’ve seen one “AI helper” PR add unbounded retries and blow a Kafka cluster. If you’re cleaning up AI-generated code, bake these guards into your templates and CI linters. GitPlumbers does a lot of that vibe code cleanup in parallel with reliability hardening because the two problems are twins.

Related Resources

Key takeaways

  • Leading indicators beat vanity metrics. Watch burn rate, saturation, and tail latencies—not average CPU.
  • Tie alerts to an action. Every page should map to a playbook step that a bot can execute.
  • Standardize telemetry and labels so playbooks work across services and teams.
  • Automate rollbacks and flag kills using Alertmanager webhooks, Argo Rollouts, and LaunchDarkly.
  • Use shared Terraform/Helm modules to scale rules and routing, not copy-paste wiki pages.
  • Practice with game days and chaos; validate that automation actually fires under stress.

Implementation checklist

  • Define SLOs and burn-rate alerts per service with consistent labels.
  • Choose 5–7 leading indicators (saturation, queue delay, GC pause, retry rate, tail latency).
  • Create runbooks with a one-page, three-button triage model (client, dependency, capacity).
  • Wire Alertmanager routes to automation webhooks (rollback, flag kill, scale up).
  • Adopt Argo Rollouts AnalysisTemplates for canary/promo gates using Prometheus.
  • Publish a Terraform module that provisions alert rules, routes, and dashboards per service.
  • Instrument with OpenTelemetry and enforce label conventions in CI.
  • Run monthly game days; measure MTTR, false positives, and automated remediation rate.

Questions we hear from teams

What are the best leading indicators to standardize across teams?
Start with SLO burn rate (short and long windows), tail latency (p99/99.9), CPU throttling ratio, queue/consumer lag, and retry rate amplification. These move before customers complain and are portable across stacks.
We’re small. Is this overkill?
Automate the first two buttons: pause rollout and kill a flag. Use Argo Rollouts + Prometheus AnalysisTemplates and a simple webhook. You’ll save hours even with two services.
How do we avoid alert noise across 10+ teams?
Label consistently and route by team + playbook. Use burn-rate math to avoid flapping, set for: windows, and measure ‘alerts with no action’ as a KPI. Prune quarterly.
What if we’re on a legacy monolith?
You can still run canaries at the infra level (blue/green) and use feature flags for risky code paths. Leading indicators (burn rate, tail latency, throttling) still apply; wire Alertmanager to your deploy tool (Spinnaker, ArgoCD, or even a Jenkins job).
Which tools matter most?
Prometheus/Alertmanager for math and routing, OpenTelemetry for consistent labels, Argo Rollouts or Spinnaker for canaries and rollbacks, and LaunchDarkly/Unleash for flag kills. The tools are replaceable; the pattern (leading indicator → triage → automation) is not.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to an SRE who ships Grab our incident playbook template

Related resources