The Chaos Engineering Playbook We Actually Run: Resilience Tests That Don’t Torch Prod
Practical, low-blast-radius chaos engineering for teams that ship. Metrics, guardrails, configs, and tooling that hold up under fire.
Hope is not a fallback strategy. Test it, or it will fail in prod.Back to all posts
You don’t earn resilience by hoping
I’ve been in too many incident reviews where someone says, “We never thought payments would slow down Redis.” Cool story. Hope is not a fallback strategy. The teams that ship safely run chaos the same way they run deploys: small blast radius, observable, with rollbacks and receipts.
This is the playbook we use at GitPlumbers when a senior team says, “Make our platform harder to kill,” and they don’t want a vendor to light their prod on fire. It’s stepwise, boring on purpose, and it works.
1) Start with the contract: steady state, SLOs, and error budgets
If you can’t define “healthy,” chaos is cosplay. Tie experiments to user-visible outcomes via SLOs.
- Steady-state SLIs (per service):
http_request_duration_secondsp95/p99 latencyhttp_requests_totalerror rate (5xx)- Queue depth / saturation (Kafka lag, Redis ops/sec, DB connections)
- Resource saturation (CPU throttling, memory RSS, GC pause)
- SLO/Objective examples:
- API latency p99 <= 300ms for 99% of requests over 30 days
- Availability 99.9% with a 43m monthly error budget
- Prove it exists with Prometheus/Grafana and alerts. If you’re fancy, generate SLO alerts with Sloth or Nobl9.
Here’s a simple Sloth config that enforces a latency SLO using Prometheus histograms:
# sloth: checkout latency SLO
apiVersion: slok.dev/v1
kind: PrometheusServiceLevel
metadata:
name: checkout-latency
spec:
service: checkout
slos:
- name: latency
objective: 99.0
labels:
tier: "critical"
sli:
events:
# Good events: requests under 300ms
errorQuery: sum(rate(http_request_duration_seconds_bucket{service="checkout",le="0.3"}[5m]))
totalQuery: sum(rate(http_request_duration_seconds_count{service="checkout"}[5m]))
alerting:
name: SLO-burn
labels:
severity: page
buckets:
- burnRate: 2
window: 1h
- burnRate: 1
window: 6hCheckpoints:
- You have dashboards showing p95/p99, error rate, and saturation per service.
- You can answer: “What’s our error budget this week and how fast are we burning it?”
- Alerting is wired to page on fast burn and ticket on slow burn.
2) Guardrails before grenades: blast radius, budgets, and aborts
Before you inject failure, constrain it. The way we do it:
- Traffic controls:
- Enable circuit breakers/outlier detection (
Istio/Envoy). - Cap concurrency and queue sizes. Timeouts > retries > backoff.
- Enable circuit breakers/outlier detection (
- Capacity protections:
HPAmin replicas andPodDisruptionBudgetto prevent total drain.- Priority classes to avoid starving critical control planes.
- Abort conditions:
- Auto-stop if any SLO alert triggers or error budget burn > X.
- Manual stop button in Slack via a runbook script.
Examples:
# Kubernetes PodDisruptionBudget
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: checkout-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: checkout# Istio circuit breaker/outlier detection
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: payments
spec:
host: payments.default.svc.cluster.local
trafficPolicy:
outlierDetection:
consecutive5xxErrors: 5
interval: 5s
baseEjectionTime: 30s
maxEjectionPercent: 50
connectionPool:
http:
http2MaxRequests: 1000
maxRequestsPerConnection: 100Checkpoints:
- PDBs exist for all critical workloads; HPA not set to scale-to-zero.
- Circuit breaker/rate limit policies exist at service mesh or gateway.
- Abort conditions are codified and tested.
If your first chaos run pages everyone, your guardrails are wrong—not the experiment.
3) Pick experiments that match real outages
Start with three that account for 80% of the pain:
- Network latency/packet loss between app and dependency (DB, payments, cache)
- Dependency brownout (timeouts, slow remote) rather than total failure
- Node/Pod eviction (chaotic reschedules, disk pressure)
Tooling that won’t fight you:
- Kubernetes:
Chaos Mesh,LitmusChaos,Pumbafor container kills - Edges/VMs/Cloud:
Gremlin,AWS Fault Injection Simulator (FIS),Azure Chaos Studio - App-level:
Toxiproxy(brutally reliable),tc netem, feature-flag toggles
Examples you can drop in:
# Chaos Mesh: add 200ms latency to checkout pods
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: add-latency-checkout
spec:
action: delay
mode: one
selector:
namespaces:
- default
labelSelectors:
app: checkout
delay:
latency: "200ms"
correlation: "0.5"
jitter: "50ms"
duration: "10m"# Toxiproxy: add latency in front of payments
docker run -d --name toxiproxy -p 8474:8474 -p 8666:8666 shopify/toxiproxy
toxiproxy-cli create payments -l 0.0.0.0:8666 -u payments.svc:443
toxiproxy-cli toxic add payments -t latency -a latency=500 -a jitter=100# AWS FIS: stop one EC2 instance in an ASG for 5 minutes (JSON template snippet)
aws fis create-experiment-template \
--cli-input-json file://fis-stop-one.jsonCheckpoints:
- Each experiment has a clear owner and a change window.
- You can revert with a single command.
- You have a shadow load generator (
k6,vegeta) to keep the system hot during tests.
4) Design the experiment like a hypothesis, not a dare
Write it down in your repo next to your service. Treat it like test code.
- Hypothesis: “If payments adds 200ms latency, checkout p99 stays < 300ms and error rate < 0.5%.”
- Abort conditions: “If burn rate > 2x over 1h window or p99 > 600ms for 5m, stop.”
- Expected mitigations: Retries backoff, circuit breaker trips < 10% of traffic, queue depth stabilizes.
- Rollback:
kubectl delete -f experiment.yaml, plus feature flag reset. - Runbook: Slack channel, pager target, Grafana dashboard link.
Template we actually use:
# experiments/checkout-latency-200ms.yaml
experiment:
id: CHK-NT-001
service: checkout
hypothesis: p99 stays < 300ms; errors < 0.5%
blastRadius: single pod
preChecks:
- dashboard: grafana.com/d/checkout
- kubectl: kubectl get pods -l app=checkout
abort:
- sloBurnRate:
window: 1h
threshold: 2
- latencyP99:
window: 5m
thresholdMs: 600
apply:
- cmd: kubectl apply -f chaos/network-latency.yaml
observe:
- promql: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service="checkout"}[5m])) by (le))
- promql: sum(rate(http_requests_total{service="checkout",code=~"5.."}[5m])) / sum(rate(http_requests_total{service="checkout"}[5m]))
rollback:
- cmd: kubectl delete -f chaos/network-latency.yaml
notes: |
Expect small spike in retries; ensure memq depth < 80%.Checkpoints:
- Hypothesis uses concrete thresholds.
- Observability is pre-linked; no hunting for dashboards mid-run.
- Rollback path is rehearsed.
5) Observe and score: PromQL, burn, and MTTR
If you can’t score it, you didn’t learn. The handful of queries we lean on:
- Latency p99:
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service="checkout"}[5m])) by (le))- Error rate:
sum(rate(http_requests_total{service="checkout",code=~"5.."}[5m])) / sum(rate(http_requests_total{service="checkout"}[5m]))- Error budget burn (availability objective 99.9%):
# burn rate = error_rate / (1 - 0.999)
(
sum(rate(http_requests_total{service="checkout",code=~"5.."}[5m]))
/
sum(rate(http_requests_total{service="checkout"}[5m]))
) / 0.001- Saturation (CPU throttling, queue depth): instrument and graph. For Kafka:
max(kafka_consumergroup_lag{consumergroup="checkout"})- Circuit breaker trips (Istio):
sum(rate(istio_requests_total{destination_service="payments",response_code=~"5.."}[5m]))We also capture:
- Time-to-detect (TTD) from injection to first alert
- MTTR from alert to recovery
- Change failure rate of chaos runs (should trend down)
Deliver a small report: hypothesis, results vs thresholds, screenshots, follow-ups. Put it in the repo.
6) Automate with GitOps so it’s boring and repeatable
Chaos that relies on someone’s terminal is chaos you won’t run next quarter. Make it code.
- Repo layout:
services/<svc>/experiments/*.yaml(hypothesis files)services/<svc>/chaos/*.yaml(Chaos Mesh, Toxiproxy configs)scripts/score-slo.sh(PromQL scoring)
- GitOps apply with
ArgoCDorFluxusing a dedicatedchaosapp/namespace - CI kicker via GitHub Actions; schedule off-hours canaries; slack notify; auto-rollback on failure
Example CI job:
name: chaos-experiment
on:
workflow_dispatch:
schedule:
- cron: "0 3 * * 2" # Tuesdays 03:00 UTC
jobs:
run-chaos:
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Context
run: |
echo "Running CHK-NT-001"
- name: Apply experiment
run: |
kubectl apply -f services/checkout/chaos/network-latency.yaml
- name: Warm traffic
run: |
k6 run load/checkout-smoke.js
- name: Score SLO
run: |
./scripts/score-slo.sh --service checkout --window 10m --objective 0.99
- name: Rollback chaos
if: failure()
run: |
kubectl delete -f services/checkout/chaos/network-latency.yaml || trueScript sketch:
#!/usr/bin/env bash
set -euo pipefail
while getopts s:w:o: flag; do
case "$flag" in
s) SERVICE=$OPTARG;;
w) WINDOW=$OPTARG;;
o) OBJ=$OPTARG;;
esac
done
PROM="http://prometheus.k8s.svc:9090"
ERR_RATE=$(curl -sG "$PROM/api/v1/query" --data-urlencode \
"query=sum(rate(http_requests_total{service=\"$SERVICE\",code=~\"5..\"}[$WINDOW]))/sum(rate(http_requests_total{service=\"$SERVICE\"}[$WINDOW]))" \
| jq -r '.data.result[0].value[1]')
OBJ_ERR=$(awk -v o=$OBJ 'BEGIN{print 1-o}')
BURN=$(awk -v e=$ERR_RATE -v o=$OBJ_ERR 'BEGIN{print e/o}')
echo "error_rate=$ERR_RATE burn=$BURN"
[ "$(awk -v b=$BURN 'BEGIN{print (b>2)}')" -eq 1 ] && { echo "Abort: fast burn"; exit 1; }Checkpoints:
- Experiments are declarative and peer-reviewed.
- CI can run, score, and rollback without SSHing into prod.
- ArgoCD/Flux tracks drift and gives you audit trails.
7) What good looks like (and what it buys you)
Real outcomes we’ve seen after 4–6 weeks of disciplined chaos (1–2 experiments/week):
- p99 stability: 18–35% reduction in tail spikes during dependency slowdowns
- MTTR: 30–50% faster recovery on correlated container restarts (node pressure)
- Error budget: burn halved during peak season by adding timeouts, jittered retries, and bulkheads
- Incidents: 20–40% fewer paging incidents attributed to cascading failures
- Runbooks: responders cut “find the right dashboard” time from 10m to <2m
Business translation: same headcount, fewer Sev1s, safer feature velocity. You pay down the invisible debt that only shows up on Black Friday.
8) Avoid these traps (seen them all)
- Running chaos in prod first. Do staging with prod-like load, then a tiny prod window.
- Skipping guardrails. No
PDB+ chaos == self-inflicted outage. - Testing total kills only. Brownouts catch you far more often than hard downs.
- State corruption. Be cautious with DB chaos—prefer read replicas, feature flags, and validate consistency.
- No comms plan. Book a change window, notify support, and pin a Slack channel.
- Not fixing findings. Each report must have owners and backlog items with dates.
If you want a second set of eyes, GitPlumbers helps teams wire this end-to-end—SLOs, guardrails, experiments-as-code, and the political theater with risk and compliance. We’ve done it for fintechs on SOC2 leash and retailers who can’t sneeze during Q4.
Key takeaways
- Tie chaos to SLOs and error budgets or you’re just breaking stuff for sport.
- Limit blast radius with traffic controls, budgets, and abort conditions—codified, not tribal.
- Start with boring failures: latency, packet loss, dependency brownouts, and node evictions.
- Score every experiment with PromQL and burn-rate math; publish a report, not a vibe.
- Automate via GitOps and CI so experiments are reproducible and auditable.
- Use chaos to drive investment decisions: queue limits, timeouts, retries, circuit breakers, and runbooks.
Implementation checklist
- Define steady-state metrics, SLOs, and error budgets per service.
- Codify guardrails: `PDB`, `HPA` min replicas, `DestinationRule` outlier detection, and abort conditions.
- Choose 2–3 first experiments (network latency, dependency timeout, node kill) with clear hypotheses.
- Instrument Prometheus queries and alerts for latency, error rate, saturation, and burn.
- Automate experiment apply/rollback in CI and gate with feature flags.
- Run canary-style in staging with prod-like load before a small, capped-prod window.
- Track MTTR, change failure rate, and error budget burn deltas after each run.
Questions we hear from teams
- Can we run chaos engineering in regulated environments (SOC2, PCI)?
- Yes. Treat experiments as change-managed work with tickets, approvals, and auditable Git history. Use staging with prod-like load first, time-bounded prod windows, documented abort conditions, and evidence (dashboards, reports). We’ve implemented this with risk/control mappings for SOC2 and PCI DSS.
- How do we avoid data loss when testing databases?
- Prefer read replicas, followers, or shadow traffic. Test network faults and failover logic rather than destructive writes. If you must test primary failure, do it with snapshot/point-in-time restore rehearsals and explicit data validation checks post-run.
- We’re not on Kubernetes—does this still apply?
- Absolutely. Use `Gremlin`, `AWS FIS`, or `Toxiproxy` at the VM/app layer. The principles—SLOs, guardrails, small blast radius, scoring—are platform-agnostic.
- What’s the minimum viable chaos program?
- Two SLOs per critical service, one guardrail PR (timeouts + circuit breaker), one latency experiment per week, scored and reported. In a month you’ll have findings that justify deeper investment.
- How do we get executive buy-in?
- Tie experiments to error budget burn, MTTR, and change failure rate. Show a before/after dashboard and one story where chaos prevented an incident. Frame it as risk reduction with auditability, not cowboy testing.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
