The Bottleneck Playbooks I Reach For When Prod Starts Smoking
Stop guessing. Ship a small set of rehearsed performance plays you can run at 3 a.m.—with checkpoints, metrics, and tools that actually work.
Performance isn’t a project; it’s a set of rehearsed plays you can run at 3 a.m.Back to all posts
The 3 a.m. page is not the time to invent your strategy
I’ve watched smart teams spend an hour debating a theory while p99 latency blew through the SLO and the error budget melted. The teams that win don’t guess—they run a small set of proven plays. This guide gives you those plays: what to measure, which buttons to push, and how to prove you fixed it.
- Audience: senior engineers and leads who’ve been burned by hand-wavy advice.
- Goal: ship a handful of performance playbooks you can reuse across services.
- Scope: API latency spikes, database hotspots, cache stampedes, GC pauses, and Kubernetes resource pain.
Performance isn’t a project; it’s a set of rehearsed plays you can run under pressure.
Start with SLOs and a Baseline You Trust
Before you touch a knob, make performance observable and tied to user impact.
- Define SLOs per critical path.
- Latency:
p50/p95/p99on key endpoints (e.g., checkoutp99 < 800ms). - Availability: success rate (
5xxand timeouts). Use an error budget to decide when to pause feature work. - Capacity: target RPS/QPS and saturation (CPU, memory, I/O, queue lag).
- Latency:
- Instrument the RED/USE core.
- RED: Requests, Errors, Duration. USE: Utilization, Saturation, Errors.
- Tools:
Prometheus + Grafana,OpenTelemetrytraces toJaegerorTempo.
- Add production-safe profiling paths now.
- Go: enable
net/http/pprof. - JVM:
JFR,async-profiler. - Node:
clinic flameor0x.
- Go: enable
- Create an alert that links to a runbook.
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: api-latency
spec:
groups:
- name: api-slo
rules:
- alert: APIHighLatencyP99
expr: histogram_quantile(0.99, sum by (le, service) (rate(http_request_duration_seconds_bucket{service="payments"}[5m]))) > 0.800
for: 10m
labels:
severity: page
annotations:
summary: "payments p99 > 800ms for 10m"
runbook_url: https://git.company.com/runbooks/api-latency.mdCheckpoint: You can answer “What’s the p99 for checkout right now? What’s saturated?” within 60 seconds.
A Reusable Playbook Template You Can Drop in Any Repo
Codify the choreography. Here’s a minimal YAML you can put in runbooks/perf/api-latency.yaml and render in Backstage/Docs.
name: API latency spike (p99)
trigger:
alert: APIHighLatencyP99
threshold: p99 > 800ms for 10m
dashboards:
- grafana: /d/service-latency/payments
- traces: /jaeger/search?service=payments
steps:
- check: "Capacity"
run:
- kubectl top pods -n prod -l app=payments
- promql: rate(container_cpu_cfs_throttled_seconds_total{pod=~"payments-.*"}[5m])
checkpoint: "CPU throttle < 10%"
- check: "Hot path"
run:
- open flamegraph: /profiling/payments
- inspect: "upstream calls > 1 per request?"
checkpoint: "No >20% self time in JSON parse"
mitigations:
- "Temporarily raise CPU limit to 2 cores"
- "Enable gzip offload in NGINX"
rollback:
- "Scale back limits and create ticket PERF-123"
owner: "payments-oncall"Keep each playbook:
- Tied to a specific SLO and alert.
- Stepwise with checkpoints you can prove.
- Containing both a quick mitigation and a follow-up fix.
- With a named owner and rollback.
Playbook 1: API Latency Spike (CPU-bound, lock contention, or I/O)
Symptoms: p99 erupts, RPS steady or increasing, error rate may rise, pods not OOMing.
Steps:
- Validate the alert and scope.
- Grafana: compare
p99vs RPS, error rate, and saturation. - Check
container_cpu_cfs_throttled_seconds_totaland pod restarts.
- Grafana: compare
- Trace the hot path.
- Open recent
OpenTelemetrytraces for the slow endpoint. - Look for N+1 calls, external dependency waits, or serialization hotspots.
- Open recent
- Profile before tuning.
- Go:
import (
_ "net/http/pprof"
"net/http"
"log"
)
func main() {
go func() { log.Println(http.ListenAndServe("localhost:6060", nil)) }()
// app
}go tool pprof -http=:0 http://localhost:6060/debug/pprof/profile?seconds=60- JVM:
jcmd <pid> JFR.start name=profile settings=profile duration=120s filename=/tmp/app.jfr
./profiler.sh -d 60 -e cpu -f /tmp/cpu.html <pid> # async-profiler- Node:
npm i -g clinic
clinic flame -- node server.js- Checkpoints to decide action.
- If CPU throttle ratio > 20% → increase
resources.limits.cpuor remove limits temporarily. - If time spent in JSON/XML decode > 30% → switch to
jsoniter/serde/streaming parser or reduce payload. - If outbound call dominates → introduce a circuit breaker + timeout; cache responses.
- If locks/mutex hotspots → shard or reduce critical section scope.
- If CPU throttle ratio > 20% → increase
- Mitigations you can apply fast.
- Raise CPU limit + HPA min replicas by 1–2x; add a canary before full rollout.
- Add request rate limits in the API gateway (e.g.,
Envoy/NGINX) to protect the backend. - Enable gzip/brotli at the edge to shrink payloads.
- Verify.
- Rerun synthetic load test (
k6) and confirmp99back under SLO.
- Rerun synthetic load test (
k6 script you can keep in-repo:
import http from 'k6/http';
import { check, sleep } from 'k6';
export let options = { thresholds: { http_req_duration: ['p(99)<800'] }, vus: 50, duration: '5m' };
export default function () {
const res = http.get(__ENV.TARGET + '/v1/payments?id=123');
check(res, { 'status was 200': (r) => r.status === 200 });
sleep(1);
}Expected result: p99 returns under SLO and CPU throttle < 10% within 10–15 minutes.
Playbook 2: Database Hotspots (slow queries, locks, connection storms)
95% of “mysterious” latency is one slow query or a lock chain. I’ve seen teams scale Postgres to the moon when one missing index was the culprit.
Steps:
- Identify the top offenders.
- Ensure
pg_stat_statementsis enabled.
- Ensure
SELECT query, calls, total_time, mean_time, rows
FROM pg_stat_statements
ORDER BY total_time DESC
LIMIT 10;- Enable
auto_explainfor nasty surprises in logs.
- Explain, don’t guess.
EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT /* your slow query */- Check for seq scans on large tables, nested loops over big sets, or misestimates.
- Check locks and waiting queries.
SELECT
bl.pid AS blocked_pid,
a.usename AS blocked_user,
ka.query AS blocking_query,
now() - ka.query_start AS blocking_duration
FROM pg_stat_activity a
JOIN pg_locks bl ON bl.pid = a.pid AND NOT bl.granted
JOIN pg_locks kl ON kl.transactionid = bl.transactionid AND kl.granted
JOIN pg_stat_activity ka ON ka.pid = kl.pid;- Checkpoints to decide action.
- If a query dominates
total_timeand lacks a selective index → add it; verify withEXPLAIN. - If row estimates are off by >10x →
ANALYZE, consider extended stats. - If wait events show
LWLock:buffer_contentorIO→ increaseshared_buffers, tunework_mem, or reduce fan-out. - If CPU is low but latency high with many connections → deploy
pgBouncerin transaction mode; capmax_connectionsto sane values.
- If a query dominates
- Mitigations you can apply fast.
- Create a covering index and a quick migration; verify on a canary.
- Add a read replica and route read-heavy endpoints via feature flag.
- Break long transactions; ensure application sets sane timeouts.
Checkpoint: pg_stat_statements.mean_time for the offender drops by >50% and DB CPU/IO returns to baseline.
Nice-to-have: add a Grafana panel for pg_stat_statements top-N and pg_locks wait trees.
Playbook 3: Cache Misses and Stampedes (Redis/HTTP/CDN)
If your cache hit ratio drops under 80% during peak, your origin is about to eat glass. I’ve watched a single naive invalidation melt a cluster.
Steps:
- Measure the basics.
- Redis:
INFO statshit ratio,SLOWLOG, andlatency doctor.
- Redis:
redis-cli info stats | egrep 'keyspace_hits|keyspace_misses'
redis-cli slowlog get 10- CDN/NGINX: add a header to see cache status.
add_header X-Cache $upstream_cache_status;- Introduce request coalescing to prevent dogpiles.
- Go example with
singleflight:
- Go example with
var g singleflight.Group
v, err, _ := g.Do("user:123", func() (any, error) {
return expensiveFetch(123)
})- Add edge or local caching with predictable keys and TTLs.
- NGINX microcache:
proxy_cache_path /var/cache/nginx levels=1:2 keys_zone=api_cache:10m inactive=10m max_size=1g;
server {
location /v1/data {
proxy_cache api_cache;
proxy_cache_key $request_uri;
proxy_cache_valid 200 1m;
add_header X-Cache $upstream_cache_status;
proxy_pass http://backend;
}
}- Checkpoints to decide action.
- If hit ratio < 80% and TTLs are tiny → bump TTLs; add soft TTL + background refresh.
- If stampedes happen on invalidation → gate with a lock (
SETNX) or singleflight. - If Redis CPU > 80% with small objects → enable
pipeline/batching and avoid large Lua locks in hot paths.
- Verify.
- Cache hit ratio recovers >90%; origin RPS drops; p95 improves proportionally.
Bonus: align HTTP caching headers (Cache-Control, ETag) with CDN behavior; add a dashboard panel for X-Cache breakdown.
Playbook 4: GC Pauses and Memory Leaks (Go/JVM)
When p99 stalls but CPU looks fine, suspect GC or leaks. I’ve seen runaway allocs from innocent JSON marshaling grind services.
Steps:
- Confirm GC involvement.
- Go: run with GC traces.
GODEBUG=gctrace=1 ./service- JVM: check pause times via JFR or GC logs.
- Profile allocations.
- Go heap profile:
curl -s http://localhost:6060/debug/pprof/heap > /tmp/heap.pb.gz
go tool pprof -http=:0 /tmp/heap.pb.gz- JVM allocation/CPU:
jcmd <pid> JFR.start name=alloc settings=profile duration=120s filename=/tmp/app.jfr
./profiler.sh -d 60 -e alloc -f /tmp/alloc.html <pid>- Checkpoints to decide action.
- If short-lived allocations dominate (>70%) → preallocate buffers, reuse with
sync.Pool(Go), avoid reflection-heavy codecs. - If heap grows across requests → look for caches without bounds; add size/TTL caps.
- If GC pauses > 100ms for latency-critical endpoints → reduce object churn; for Go, tune
GOGC; for JVM, evaluateG1/ZGCand right-size heap.
- If short-lived allocations dominate (>70%) → preallocate buffers, reuse with
- Mitigations you can apply fast.
- Reduce response payloads; compress at edge, not in app.
- Cap in-process caches; move to Redis with eviction.
- For Go,
export GOGC=100(or higher) to trade memory for fewer GCs temporarily.
Verify: GC pause p99 < 50ms, RSS stable, and endpoint p99 back under SLO within one deploy.
Playbook 5: Kubernetes Resource Thrash (limits, throttling, HPA)
More than once, the “CPU is only 40%” line fooled teams—because the kernel was throttling them to death. Watch the right metrics.
Steps:
- Check for throttling and OOMs.
rate(container_cpu_cfs_throttled_seconds_total{container!="",pod!=""}[5m])
/ rate(container_cpu_cfs_periods_total{container!="",pod!=""}[5m]) > 0.2- Inspect
kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}and restarts.
- Compare requests/limits to actual usage.
kubectl top pods -n prodand GrafanaContainer CPU/Memorydashboard.
- Apply sane autoscaling.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: payments
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: payments
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "20"- Checkpoints to decide action.
- If throttling ratio > 10% → raise CPU limit or remove it; ensure requests reflect typical usage.
- If OOMs during spikes → raise memory requests; verify GC/leaks aren’t the root cause.
- If HPA flaps → add stabilization windows and scale on an SLI (RPS/latency) not just CPU.
- Verify.
- Throttling < 5%, restarts flat, p95 stabilizes under SLO.
Note: Consider VPA in recommend mode to right-size over time; commit the recommendations via GitOps, not clicks.
Operationalize It: GitOps, Alerts, and Drills
This is where teams either make it stick or let it rot on Confluence.
- Put everything in Git.
- Runbooks:
runbooks/perf/*.yaml. - Alerts:
prometheusrules/*.yamlwithrunbook_urlpointing to the repo. - Dashboards: JSON exports in
grafana/. - Load tests:
perf/k6/*.jswithMakefiletargets.
- Runbooks:
- Wire it to your deploy flow.
ArgoCD/Fluxapplies alerts/dashboards.- CI runs smoke
k6after canary; blocks if SLO thresholds fail.
- Run drills quarterly.
- Trigger the alert in staging with a controlled load; time the MTTR.
- Rotate on-call to run the playbook end-to-end.
- Track business metrics.
- Tie SLOs to conversion, abandonment, and cost. E.g., shaving 200ms off checkout p99 lifted conversion 2.1% at one client—real money.
If you don’t rehearse, you won’t execute when it matters. This is SRE 101, but it’s shocking how many orgs skip it.
GitPlumbers note: we audit and harden these playbooks, then pressure-test them with your stack. If you want a second set of eyes, we’re here.
Related Resources
Key takeaways
- Performance work should be a small set of rehearsed plays tied to SLOs and error budgets, not ad hoc heroics.
- Each playbook needs triggers, reproducible diagnostics, checkpoints, mitigations, and a rollback path.
- Profile before you tune: flame graphs and traces beat log-diving and hunches every time.
- Focus on the usual suspects: API hot paths, database hotspots, cache stampedes, GC pauses, and K8s resource throttling.
- Operationalize via GitOps: version runbooks, alerts, and dashboards; test them with synthetic load and chaos drills.
Implementation checklist
- Define user-facing SLOs with p50/p95/p99 and an error budget.
- Instrument RED/USE metrics and wire tracing with OpenTelemetry.
- Create a runbook template with trigger, steps, checkpoints, mitigations, and rollback.
- Author Prometheus alerts with runbook_url annotations for each play.
- Stand up profiling (pprof/JFR/async-profiler) and ensure it’s safe to use in prod.
- Create DB observability: pg_stat_statements, auto_explain, slow query logs.
- Put a cache policy in writing: TTLs, keys, coalescing, and circuit breakers.
- Right-size K8s requests/limits; add HPA (and VPA where safe); watch CPU throttling.
- Store the playbooks, alerts, dashboards, and k6 tests in Git; run drills quarterly.
Questions we hear from teams
- How do I pick SLO thresholds without over-optimizing?
- Start from user journeys and business outcomes. If users tolerate 1s on search but only 500ms on checkout, encode that. Use historical p95/p99 plus error budget math to set realistic targets. Revisit quarterly as traffic and features change.
- Is profiling in production safe?
- Yes, with care. Go pprof CPU at 60s sampling and JFR/async-profiler are low overhead when used briefly. Keep endpoints locked down (localhost/mtls), timebox profiles, and store artifacts off-box.
- Why p99 and not average?
- Averages hide pain. Users experience tail latency. p95 reveals systemic issues; p99 exposes outliers and contention that torpedo conversion. Track all three, alert on p99 for critical paths.
- When should I cache vs. scale?
- Cache if the data is read-heavy and tolerates staleness for a short TTL; scale if data is highly dynamic and correctness is critical. Often it’s both: microcache at the edge plus origin capacity and backpressure.
- What about microservices and cross-service latency?
- Invest in tracing. Add per-hop budgets and circuit breakers (Istio/Envoy). Use bulkheads so one slow dependency doesn’t cascade. A good playbook includes dependency maps and fallback behavior.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
