The Chaos Engineering Playbook We Actually Run: Resilience Tests That Don’t Torch Prod

Practical, low-blast-radius chaos engineering for teams that ship. Metrics, guardrails, configs, and tooling that hold up under fire.

Hope is not a fallback strategy. Test it, or it will fail in prod.
Back to all posts

You don’t earn resilience by hoping

I’ve been in too many incident reviews where someone says, “We never thought payments would slow down Redis.” Cool story. Hope is not a fallback strategy. The teams that ship safely run chaos the same way they run deploys: small blast radius, observable, with rollbacks and receipts.

This is the playbook we use at GitPlumbers when a senior team says, “Make our platform harder to kill,” and they don’t want a vendor to light their prod on fire. It’s stepwise, boring on purpose, and it works.


1) Start with the contract: steady state, SLOs, and error budgets

If you can’t define “healthy,” chaos is cosplay. Tie experiments to user-visible outcomes via SLOs.

  • Steady-state SLIs (per service):
    • http_request_duration_seconds p95/p99 latency
    • http_requests_total error rate (5xx)
    • Queue depth / saturation (Kafka lag, Redis ops/sec, DB connections)
    • Resource saturation (CPU throttling, memory RSS, GC pause)
  • SLO/Objective examples:
    • API latency p99 <= 300ms for 99% of requests over 30 days
    • Availability 99.9% with a 43m monthly error budget
  • Prove it exists with Prometheus/Grafana and alerts. If you’re fancy, generate SLO alerts with Sloth or Nobl9.

Here’s a simple Sloth config that enforces a latency SLO using Prometheus histograms:

# sloth: checkout latency SLO
apiVersion: slok.dev/v1
kind: PrometheusServiceLevel
metadata:
  name: checkout-latency
spec:
  service: checkout
  slos:
    - name: latency
      objective: 99.0
      labels:
        tier: "critical"
      sli:
        events:
          # Good events: requests under 300ms
          errorQuery: sum(rate(http_request_duration_seconds_bucket{service="checkout",le="0.3"}[5m]))
          totalQuery: sum(rate(http_request_duration_seconds_count{service="checkout"}[5m]))
      alerting:
        name: SLO-burn
        labels:
          severity: page
        buckets:
          - burnRate: 2
            window: 1h
          - burnRate: 1
            window: 6h

Checkpoints:

  1. You have dashboards showing p95/p99, error rate, and saturation per service.
  2. You can answer: “What’s our error budget this week and how fast are we burning it?”
  3. Alerting is wired to page on fast burn and ticket on slow burn.

2) Guardrails before grenades: blast radius, budgets, and aborts

Before you inject failure, constrain it. The way we do it:

  • Traffic controls:
    • Enable circuit breakers/outlier detection (Istio/Envoy).
    • Cap concurrency and queue sizes. Timeouts > retries > backoff.
  • Capacity protections:
    • HPA min replicas and PodDisruptionBudget to prevent total drain.
    • Priority classes to avoid starving critical control planes.
  • Abort conditions:
    • Auto-stop if any SLO alert triggers or error budget burn > X.
    • Manual stop button in Slack via a runbook script.

Examples:

# Kubernetes PodDisruptionBudget
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: checkout-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: checkout
# Istio circuit breaker/outlier detection
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: payments
spec:
  host: payments.default.svc.cluster.local
  trafficPolicy:
    outlierDetection:
      consecutive5xxErrors: 5
      interval: 5s
      baseEjectionTime: 30s
      maxEjectionPercent: 50
    connectionPool:
      http:
        http2MaxRequests: 1000
        maxRequestsPerConnection: 100

Checkpoints:

  1. PDBs exist for all critical workloads; HPA not set to scale-to-zero.
  2. Circuit breaker/rate limit policies exist at service mesh or gateway.
  3. Abort conditions are codified and tested.

If your first chaos run pages everyone, your guardrails are wrong—not the experiment.


3) Pick experiments that match real outages

Start with three that account for 80% of the pain:

  • Network latency/packet loss between app and dependency (DB, payments, cache)
  • Dependency brownout (timeouts, slow remote) rather than total failure
  • Node/Pod eviction (chaotic reschedules, disk pressure)

Tooling that won’t fight you:

  • Kubernetes: Chaos Mesh, LitmusChaos, Pumba for container kills
  • Edges/VMs/Cloud: Gremlin, AWS Fault Injection Simulator (FIS), Azure Chaos Studio
  • App-level: Toxiproxy (brutally reliable), tc netem, feature-flag toggles

Examples you can drop in:

# Chaos Mesh: add 200ms latency to checkout pods
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: add-latency-checkout
spec:
  action: delay
  mode: one
  selector:
    namespaces:
      - default
    labelSelectors:
      app: checkout
  delay:
    latency: "200ms"
    correlation: "0.5"
    jitter: "50ms"
  duration: "10m"
# Toxiproxy: add latency in front of payments
docker run -d --name toxiproxy -p 8474:8474 -p 8666:8666 shopify/toxiproxy
toxiproxy-cli create payments -l 0.0.0.0:8666 -u payments.svc:443
toxiproxy-cli toxic add payments -t latency -a latency=500 -a jitter=100
# AWS FIS: stop one EC2 instance in an ASG for 5 minutes (JSON template snippet)
aws fis create-experiment-template \
  --cli-input-json file://fis-stop-one.json

Checkpoints:

  1. Each experiment has a clear owner and a change window.
  2. You can revert with a single command.
  3. You have a shadow load generator (k6, vegeta) to keep the system hot during tests.

4) Design the experiment like a hypothesis, not a dare

Write it down in your repo next to your service. Treat it like test code.

  • Hypothesis: “If payments adds 200ms latency, checkout p99 stays < 300ms and error rate < 0.5%.”
  • Abort conditions: “If burn rate > 2x over 1h window or p99 > 600ms for 5m, stop.”
  • Expected mitigations: Retries backoff, circuit breaker trips < 10% of traffic, queue depth stabilizes.
  • Rollback: kubectl delete -f experiment.yaml, plus feature flag reset.
  • Runbook: Slack channel, pager target, Grafana dashboard link.

Template we actually use:

# experiments/checkout-latency-200ms.yaml
experiment:
  id: CHK-NT-001
  service: checkout
  hypothesis: p99 stays < 300ms; errors < 0.5%
  blastRadius: single pod
  preChecks:
    - dashboard: grafana.com/d/checkout
    - kubectl: kubectl get pods -l app=checkout
  abort:
    - sloBurnRate: 
        window: 1h
        threshold: 2
    - latencyP99:
        window: 5m
        thresholdMs: 600
  apply:
    - cmd: kubectl apply -f chaos/network-latency.yaml
  observe:
    - promql: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service="checkout"}[5m])) by (le))
    - promql: sum(rate(http_requests_total{service="checkout",code=~"5.."}[5m])) / sum(rate(http_requests_total{service="checkout"}[5m]))
  rollback:
    - cmd: kubectl delete -f chaos/network-latency.yaml
  notes: |
    Expect small spike in retries; ensure memq depth < 80%.

Checkpoints:

  1. Hypothesis uses concrete thresholds.
  2. Observability is pre-linked; no hunting for dashboards mid-run.
  3. Rollback path is rehearsed.

5) Observe and score: PromQL, burn, and MTTR

If you can’t score it, you didn’t learn. The handful of queries we lean on:

  • Latency p99:
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service="checkout"}[5m])) by (le))
  • Error rate:
sum(rate(http_requests_total{service="checkout",code=~"5.."}[5m])) / sum(rate(http_requests_total{service="checkout"}[5m]))
  • Error budget burn (availability objective 99.9%):
# burn rate = error_rate / (1 - 0.999)
(
  sum(rate(http_requests_total{service="checkout",code=~"5.."}[5m]))
/
  sum(rate(http_requests_total{service="checkout"}[5m]))
) / 0.001
  • Saturation (CPU throttling, queue depth): instrument and graph. For Kafka:
max(kafka_consumergroup_lag{consumergroup="checkout"})
  • Circuit breaker trips (Istio):
sum(rate(istio_requests_total{destination_service="payments",response_code=~"5.."}[5m]))

We also capture:

  • Time-to-detect (TTD) from injection to first alert
  • MTTR from alert to recovery
  • Change failure rate of chaos runs (should trend down)

Deliver a small report: hypothesis, results vs thresholds, screenshots, follow-ups. Put it in the repo.


6) Automate with GitOps so it’s boring and repeatable

Chaos that relies on someone’s terminal is chaos you won’t run next quarter. Make it code.

  • Repo layout:
    • services/<svc>/experiments/*.yaml (hypothesis files)
    • services/<svc>/chaos/*.yaml (Chaos Mesh, Toxiproxy configs)
    • scripts/score-slo.sh (PromQL scoring)
  • GitOps apply with ArgoCD or Flux using a dedicated chaos app/namespace
  • CI kicker via GitHub Actions; schedule off-hours canaries; slack notify; auto-rollback on failure

Example CI job:

name: chaos-experiment
on:
  workflow_dispatch:
  schedule:
    - cron: "0 3 * * 2"  # Tuesdays 03:00 UTC
jobs:
  run-chaos:
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Context
        run: |
          echo "Running CHK-NT-001"
      - name: Apply experiment
        run: |
          kubectl apply -f services/checkout/chaos/network-latency.yaml
      - name: Warm traffic
        run: |
          k6 run load/checkout-smoke.js
      - name: Score SLO
        run: |
          ./scripts/score-slo.sh --service checkout --window 10m --objective 0.99
      - name: Rollback chaos
        if: failure()
        run: |
          kubectl delete -f services/checkout/chaos/network-latency.yaml || true

Script sketch:

#!/usr/bin/env bash
set -euo pipefail
while getopts s:w:o: flag; do
  case "$flag" in
    s) SERVICE=$OPTARG;;
    w) WINDOW=$OPTARG;;
    o) OBJ=$OPTARG;;
  esac
done
PROM="http://prometheus.k8s.svc:9090"
ERR_RATE=$(curl -sG "$PROM/api/v1/query" --data-urlencode \
  "query=sum(rate(http_requests_total{service=\"$SERVICE\",code=~\"5..\"}[$WINDOW]))/sum(rate(http_requests_total{service=\"$SERVICE\"}[$WINDOW]))" \
  | jq -r '.data.result[0].value[1]')
OBJ_ERR=$(awk -v o=$OBJ 'BEGIN{print 1-o}')
BURN=$(awk -v e=$ERR_RATE -v o=$OBJ_ERR 'BEGIN{print e/o}')
echo "error_rate=$ERR_RATE burn=$BURN"
[ "$(awk -v b=$BURN 'BEGIN{print (b>2)}')" -eq 1 ] && { echo "Abort: fast burn"; exit 1; }

Checkpoints:

  1. Experiments are declarative and peer-reviewed.
  2. CI can run, score, and rollback without SSHing into prod.
  3. ArgoCD/Flux tracks drift and gives you audit trails.

7) What good looks like (and what it buys you)

Real outcomes we’ve seen after 4–6 weeks of disciplined chaos (1–2 experiments/week):

  • p99 stability: 18–35% reduction in tail spikes during dependency slowdowns
  • MTTR: 30–50% faster recovery on correlated container restarts (node pressure)
  • Error budget: burn halved during peak season by adding timeouts, jittered retries, and bulkheads
  • Incidents: 20–40% fewer paging incidents attributed to cascading failures
  • Runbooks: responders cut “find the right dashboard” time from 10m to <2m

Business translation: same headcount, fewer Sev1s, safer feature velocity. You pay down the invisible debt that only shows up on Black Friday.


8) Avoid these traps (seen them all)

  • Running chaos in prod first. Do staging with prod-like load, then a tiny prod window.
  • Skipping guardrails. No PDB + chaos == self-inflicted outage.
  • Testing total kills only. Brownouts catch you far more often than hard downs.
  • State corruption. Be cautious with DB chaos—prefer read replicas, feature flags, and validate consistency.
  • No comms plan. Book a change window, notify support, and pin a Slack channel.
  • Not fixing findings. Each report must have owners and backlog items with dates.

If you want a second set of eyes, GitPlumbers helps teams wire this end-to-end—SLOs, guardrails, experiments-as-code, and the political theater with risk and compliance. We’ve done it for fintechs on SOC2 leash and retailers who can’t sneeze during Q4.

Related Resources

Key takeaways

  • Tie chaos to SLOs and error budgets or you’re just breaking stuff for sport.
  • Limit blast radius with traffic controls, budgets, and abort conditions—codified, not tribal.
  • Start with boring failures: latency, packet loss, dependency brownouts, and node evictions.
  • Score every experiment with PromQL and burn-rate math; publish a report, not a vibe.
  • Automate via GitOps and CI so experiments are reproducible and auditable.
  • Use chaos to drive investment decisions: queue limits, timeouts, retries, circuit breakers, and runbooks.

Implementation checklist

  • Define steady-state metrics, SLOs, and error budgets per service.
  • Codify guardrails: `PDB`, `HPA` min replicas, `DestinationRule` outlier detection, and abort conditions.
  • Choose 2–3 first experiments (network latency, dependency timeout, node kill) with clear hypotheses.
  • Instrument Prometheus queries and alerts for latency, error rate, saturation, and burn.
  • Automate experiment apply/rollback in CI and gate with feature flags.
  • Run canary-style in staging with prod-like load before a small, capped-prod window.
  • Track MTTR, change failure rate, and error budget burn deltas after each run.

Questions we hear from teams

Can we run chaos engineering in regulated environments (SOC2, PCI)?
Yes. Treat experiments as change-managed work with tickets, approvals, and auditable Git history. Use staging with prod-like load first, time-bounded prod windows, documented abort conditions, and evidence (dashboards, reports). We’ve implemented this with risk/control mappings for SOC2 and PCI DSS.
How do we avoid data loss when testing databases?
Prefer read replicas, followers, or shadow traffic. Test network faults and failover logic rather than destructive writes. If you must test primary failure, do it with snapshot/point-in-time restore rehearsals and explicit data validation checks post-run.
We’re not on Kubernetes—does this still apply?
Absolutely. Use `Gremlin`, `AWS FIS`, or `Toxiproxy` at the VM/app layer. The principles—SLOs, guardrails, small blast radius, scoring—are platform-agnostic.
What’s the minimum viable chaos program?
Two SLOs per critical service, one guardrail PR (timeouts + circuit breaker), one latency experiment per week, scored and reported. In a month you’ll have findings that justify deeper investment.
How do we get executive buy-in?
Tie experiments to error budget burn, MTTR, and change failure rate. Show a before/after dashboard and one story where chaos prevented an incident. Frame it as risk reduction with auditability, not cowboy testing.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Run a guided chaos dry-run with GitPlumbers Download our Chaos Experiment Template

Related resources