Five Battle‑Tested Performance Playbooks: CPU Hot Paths, DB Latency, GC Pauses, I/O Stall, and Lock Contention
When prod melts, you don’t need platitudes. You need playbooks with hard checkpoints, precise metrics, and the right tools. This is what we actually run at 2 a.m.
You don’t fix performance at 2 a.m. with guesses. You fix it with playbooks, metrics, and a clean rollback.Back to all posts
The reality: performance fires don’t care about your sprint plan
I’ve lost count of the launches where a marketing bump turned into a paging storm. Same story: dashboards missing, everyone guessing, someone proposes a rewrite, and we’re two hours into a war room with zero proof. The teams that survive don’t “move fast”; they run the right playbook fast.
This guide covers the five playbooks we actually use at GitPlumbers when prod melts: CPU hot paths, DB latency, GC pauses, I/O stall, and lock contention. Each has triggers, steps, checkpoints, and tooling. Copy them, adapt them, and stop improvising at 2 a.m.
If it’s not measured, it didn’t happen. If it’s not repeatable, it won’t stick.
1) Baseline and guardrails before touching a line of code
You can’t optimize what you can’t see. Get your baseline and rollback ready.
- SLOs: Define
availabilityandlatencyobjectives. Example: p95 API latency <= 300ms, error rate < 1%. - Golden signals:
latency,traffic,errors,saturation(RED/USE). Make them one-click in Grafana. - Traceability:
OpenTelemetrytraces to show where time goes; propagatetrace_idin logs. - Load generator:
k6,Vegeta, orwrkagainst a prod-like staging. Freeze feature flags during tests. - Release safety: Canary deploys with
Argo RolloutsorFlagger; circuit breakers inEnvoy/Istio.
Quick starters:
# k6 smoke to establish baseline
k6 run --vus 50 --duration 2m --summary-trend-stats="avg,p(90),p(95),p(99)" baseline.js
# Vegeta sustained RPS
echo "GET https://api.example.com/v1/search?q=test" | vegeta attack -rate=200 -duration=60s | vegeta reportPromQL you’ll need:
# p95 latency by route
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, route))
# error rate
sum(rate(http_requests_total{code=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
# saturation
avg(node_load1) / count(node_cpu_seconds_total{mode="idle"})Checkpoint: You have baselines for p50/p95, error rate, CPU, memory, GC pause, DB qps/latency, and iowait. Canary + rollback works. If not, stop here and fix that first. I’ve seen teams shave 5ms off a handler and lose 50% traffic because their rollback failed. Don’t be that team.
2) Playbook: CPU hot paths (when cores peg and p95 climbs)
Trigger: CPU > 80% for 5+ minutes, runq > 1 per core, flame graphs show a single function dominating.
Steps:
- Triage: scale first to buy time.
- Kubernetes: bump replicas or enable HPA.
- Add a short-lived cache (
Redis/in-memory) for the hot endpoint.
- Profile: capture real profiles under load.
pprof(Go),eBPFviaparca-agentorpyroscope/ebpffor any language.- For Node:
clinic flame --on-port 3000.
- Optimize: attack the top 1–2 stacks from the flame.
- Remove needless JSON marshalling, precompute templates, vectorize loops, avoid regex backtracking.
- Verify: re-run load test; canary with a guardrail on
p95_latencyandCPU.
HPA example for a quick mitigation:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api
minReplicas: 4
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60Go pprof grab:
# Enable net/http/pprof in your app, then:
curl -s http://api:6060/debug/pprof/profile?seconds=30 > cpu.prof
go tool pprof -http=:0 cpu.profCheckpoint: CPU < 70% at steady state; p95 latency back under SLO; flame top stack reduced by >50%. If not, consider algorithmic changes or precomputation (e.g., Bloom filters, memoization, denormalized reads).
3) Playbook: Database latency (slow queries, missing indexes, pool saturation)
Trigger: p95 DB latency > SLO, connection pool at 100%, cache hit rate < 90%, or application thread wait on DB > 30%.
Steps:
- Triage: reduce load and fix pool limits.
- Increase
pgBouncerpool, add request queue / rate limit at the edge. - Turn on read-through cache for hot keys with a tight TTL.
- Increase
- Identify hot queries.
- Postgres:
pg_stat_statementsfor top total_time and mean_time. EXPLAIN (ANALYZE, BUFFERS)on the worst offenders.
- Postgres:
- Optimize: add missing indexes, rewrite N+1s, narrow SELECT lists, batch operations.
- Verify: look for index-only scans, cache warming, and pool headroom.
SQL helpers:
-- Top slow queries
SELECT query, calls, total_time, mean_time
FROM pg_stat_statements
ORDER BY total_time DESC
LIMIT 10;
-- Locks and waiters
SELECT relation::regclass, mode, granted, pid, query
FROM pg_locks JOIN pg_stat_activity USING (pid)
WHERE NOT granted;
-- Example index
CREATE INDEX CONCURRENTLY idx_orders_user_created
ON orders(user_id, created_at DESC);
-- Validate plan
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders WHERE user_id=$1 ORDER BY created_at DESC LIMIT 20;Read-through cache snippet (Node + Redis):
async function getUserOrders(userId: string) {
const key = `orders:${userId}`;
const cached = await redis.get(key);
if (cached) return JSON.parse(cached);
const rows = await db.query('SELECT ...');
await redis.setEx(key, 30, JSON.stringify(rows)); // 30s TTL
return rows;
}PromQL to watch DB saturation:
# Pool saturation
sum(db_pool_in_use) / sum(db_pool_size)
# Cache hit rate (Postgres)
(sum(rate(pg_stat_database_blks_hit[5m])) /
sum(rate(pg_stat_database_blks_hit[5m]) + rate(pg_stat_database_blks_read[5m])))Checkpoint: p95 DB latency within SLO; pool saturation < 80%; top query mean_time reduced > 40%; cache hit > 90%. If not, consider read replicas for GET-heavy endpoints and introduce CQRS or precomputed materialized views.
4) Playbook: GC pauses and heap pressure (JVM, Go, Node)
Trigger: GC pause p95 > 100ms (JVM/Node), heap growth without return, frequent young-gen promotions, or Go GODEBUG=gctrace=1 shows high STW %.
Steps:
- Triage: reduce allocation rate.
- Increase batch sizes, reuse buffers, avoid per-request object creation.
- Capture GC telemetry.
- JVM: enable GC logs and JFR; Go:
pprofheap; Node:clinic heapprof.
- JVM: enable GC logs and JFR; Go:
- Tune collectors carefully; test under load.
- Fix the leaks; watch object retention in profiles.
JVM flags (G1GC sane start):
JAVA_TOOL_OPTIONS="-XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:+ParallelRefProcEnabled \
-XX:+UnlockExperimentalVMOptions -XX:+AlwaysPreTouch -XX:+UseStringDeduplication \
-XX:InitiatingHeapOccupancyPercent=30 -Xms4g -Xmx4g"Go memory tuning:
# Lower GC aggressiveness to reduce CPU if heap is stable
export GOGC=150
# Heap profile under load
curl -s http://svc:6060/debug/pprof/heap > heap.prof
go tool pprof -http=:0 heap.profNode heap snapshot with Clinic:
npx clinic heapprof -- node server.jsCheckpoint: GC pause p95 under target (e.g., <100ms), stable heap after steady state, allocation rate down >30%. If still red, consider off-heap caches, pooling, or moving large object graphs out of request path.
5) Playbook: I/O stall and network pain (disks, TLS, proxies)
Trigger: High iowait, EBS credit depletion, upstream timeouts, or tail latency spikes at TLS handshakes.
Steps:
- Triage: raise timeouts and add retries with jitter; enable keepalives.
- Measure:
node_disk_io_time_seconds_total,node_network_transmit_queue_length, proxy logs. - Fix: right-size storage class, tune thread pools, optimize TLS, and set sensible proxy timeouts.
Envoy circuit breaker + timeout example:
cluster:
name: upstream_api
connect_timeout: 2s
type: STRICT_DNS
lb_policy: ROUND_ROBIN
circuit_breakers:
thresholds:
- max_connections: 1024
max_pending_requests: 2048
max_retries: 3
outlier_detection:
consecutive_5xx: 5
interval: 5s
base_ejection_time: 30s
max_ejection_percent: 50Nginx keepalive and proxy tuning:
proxy_http_version 1.1;
proxy_set_header Connection "";
keepalive_requests 1000;
keepalive_timeout 65s;
proxy_connect_timeout 2s;
proxy_read_timeout 5s;
proxy_send_timeout 5s;Storage checks on Linux:
iostat -xz 1
# watch r/s, w/s, await, svctm
# EBS volume throughput checks on AWS
aws cloudwatch get-metric-statistics --namespace AWS/EBS --metric-name VolumeThroughputPercentage ...Checkpoint: iowait < 5% steady, proxy error rate < 0.5%, p99 handshake < 50ms with TLS session reuse. If not, move hot data to faster media (gp3/io2), shard, or move heavy uploads off sync paths (S3 pre-signed URLs).
6) Playbook: Lock contention and queue backpressure
Trigger: Thread dumps show blocked threads, Go mutex profile hot, DB pg_locks waits, or consumer lag rising with flat producer rate.
Steps:
- Triage: apply backpressure and circuit breakers; shed load gracefully.
- Measure: queue depth, consumer lag, lock wait time, mutex hold time.
- Fix: reduce lock scope, adopt RW locks, batch, or move to idempotent async patterns.
Go mutex profile:
// Enable mutex profiling
runtime.SetMutexProfileFraction(5)
// Then pprof: /debug/pprof/mutexPostgres lock view during incident:
SELECT now(), locktype, relation::regclass, mode, granted, pid, query
FROM pg_locks JOIN pg_stat_activity USING (pid)
ORDER BY granted, relation;Resiliency policy (Envoy + retry + retry budgets):
retry_policy:
retry_on: connect-failure,reset,5xx
num_retries: 2
retry_back_off:
base_interval: 50ms
max_interval: 300msCheckpoint: queue depth stabilizes; consumer lag decreasing; lock wait times < 10ms p95; error budget burn normalized. If not, split critical sections, introduce partitioning/sharding keys, or move hot workflows to outbox + async workers.
7) Make it stick: dashboards, canaries, and runbooks
Performance fixes rot unless you bake them into operations.
- Dashboards: one Grafana folder per service with RED + resource views, GC, DB, queue metrics.
- SLOs & burn alerts: alert on
burn_rate > 2for 30m windows; page on p99 if it correlates with error budget burn. - GitOps: track HPA, proxy timeouts, and circuit breaker configs in Git; diff them like code.
- Canary: use
Argo Rolloutswith metric analysis on p95 latency and error rate. - Drills: quarterly load drills; record MTTR, rollback time, and change failure rate.
Example Argo Rollouts metric gate:
analysis:
templates:
- name: latency-check
prometheus:
address: http://prometheus.monitoring:9090
query: |
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{route="/v1/search"}[2m])) by (le))
threshold: 0.3 # seconds
startingStep: 2Results we typically see when teams adopt these playbooks:
- 30–70% reduction in p95 latency on hot endpoints within 2–4 weeks.
- 20–50% infra cost reduction by right-sizing CPU/memory and storage classes.
- MTTR dropping from hours to minutes because mitigations are scripted and safe.
Write the runbooks. Treat them like code. And run the drills. That’s the difference between a one-off save and a reliable system.
Key takeaways
- Playbooks beat ad‑hoc heroics: define triggers, checkpoints, and rollback upfront.
- Instrument first: if it’s not in Prometheus/Otel, it doesn’t exist for performance work.
- Always separate mitigations (buy time) from optimizations (fix root cause).
- Validate changes with load tests and canaries; never trust local benchmarks alone.
- Codify wins into dashboards, SLOs, and runbooks; make optimizations durable.
Implementation checklist
- Define SLOs and error budgets; agree on p50/p95/p99 targets.
- Establish golden signals dashboards and RED/USE views.
- Stand up a load generator (k6/Vegeta/wrk) and a safe staging environment.
- Create canary + rollback pipeline (Argo Rollouts/Flagger).
- For each playbook, document triggers, first actions, tooling, and success criteria.
- Run drills quarterly; measure MTTR and change failure rate.
Questions we hear from teams
- How do I profile safely in Kubernetes without changing the app?
- Use eBPF-based profilers like Parca or Pyroscope’s eBPF agent. They sample kernel/user stacks without in-process agents. Scope via namespace and limit overhead by sampling at 1–10 Hz during incidents.
- Should we scale first or optimize first?
- Scale first to stop the bleeding (HPA, cache, rate limit). Then optimize with profilers and EXPLAIN ANALYZE. Always separate mitigations (reversible, fast) from optimizations (risky, lasting).
- What if I don’t have a prod-like staging for load tests?
- Carve out a shadow env that mirrors prod topology with 10–20% of prod capacity. Replay sampled traffic with tc mirroring or service mesh traffic split. If that’s impossible, use a small canary with strict abort thresholds.
- How do I make sure a change actually improved things?
- Define success criteria upfront (e.g., p95 < 300ms, CPU < 70%). Run the same workload before/after, compare Prometheus metrics, and only complete rollout if both latency and error rate improve. Record in a runbook PR.
- Any quick JVM GC win without deep tuning?
- Switch to G1GC if you’re on CMS/Parallel, set Xms=Xmx to avoid catastrophic resizing, cap pause target ~200ms, and fix allocation hotspots first. Most wins come from reducing churn, not GC flags.
- Are feature flags safe during performance work?
- Yes—if they’re typed and server-evaluated with consistent hashing. Pair with canary analysis so you can roll back instantly when a flag increases error budget burn.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
