The p95 Kill Kit: Battle‑Tested Playbooks for CPU, DB, GC, and Cache Bottlenecks
Stop vibe‑debugging. Use these focused playbooks to cut p95 latency, tame costs, and ship without fear.
“Performance work is boring on purpose. The playbook is the point.”Back to all posts
Baseline first: speed limits and gauges
If you don’t have SLOs and a shared dashboard, you’re optimizing by vibes. I’ve watched teams shave 5 ms off an endpoint that only runs once a day while the checkout API burns its error budget by lunch.
- Define SLOs per API and job:
p95 < 200ms,error_rate < 0.1%,availability > 99.9%. - Adopt RED + USE:
- RED (for services):
rate,errors,duration - USE (for resources):
utilization,saturation,errors
- RED (for services):
- Instrument once, leverage everywhere:
- Tracing:
OpenTelemetrytoJaeger/Tempo/Datadog. - Metrics:
Prometheus+Grafanawith histograms for latency. - Profiles:
pprof,py-spy/rbspy,async-profiler,eBPF(Parca/Pixie).
- Tracing:
Example Prometheus histogram (Go):
var apiLatency = prometheus.NewHistogramVec(prometheus.HistogramOpts{
Namespace: "checkout",
Name: "http_latency_seconds",
Buckets: prometheus.ExponentialBuckets(0.005, 2, 12),
}, []string{"route","status"})Quick OS sanity checks during an incident:
# CPU, runqueue, context switches
mpstat -P ALL 1 | head -50
vmstat 1 5
# I/O wait and disk saturation
iostat -xz 1 5
# Top kernel stack consumers (root cause hunting)
sudo perf top --sort comm,dsoCheckpoint
- You have a single dashboard showing p50/p95/p99, RPS, error rate, CPU, memory, GC, I/O wait, DB connections, cache hit rate.
- You can click a slow trace into spans and see
db.statement(redacted) and external calls.
CPU hot paths: starve the flame
Symptom: high CPU, p95 blows up with RPS, or the box is “busy” but not doing useful work. I’ve seen teams scale pods 10x before doing a 30‑minute profile that cut CPU in half.
- Capture a profile in prod (30–60s window):
- Go:
import _ "net/http/pprof" func init() { go http.ListenAndServe(":6060", nil) }go tool pprof -http=:0 http://svc:6060/debug/pprof/profile?seconds=30 - Python:
pip install py-spy sudo py-spy record -p $(pidof yourproc) -o cpu.svg --duration 30 - JVM:
./async-profiler/profiler.sh -d 30 -e cpu -f cpu.svg <PID>
- Go:
- Read the flamegraph: wide stacks are your budget. Look for JSON marshalling, regexes,
crypto, or poorly batched loops. - Fix patterns that always pay:
- Replace per‑request allocations with pooling.
- Avoid
O(n^2)joins in hot code; precompute maps. - Batch I/O and RPCs.
- In Go, cap goroutines and use
singleflightto dedupe.
var g singleflight.Group
val, err, _ := g.Do(key, func() (any, error) { return fetchExpensive(), nil })Metrics & checkpoints
- CPU utilization drops by 30–60% under the same load.
- p95 improves proportionally; no increase in GC pauses or I/O wait.
- Flamegraph top frame shrinks; hot function < 20% of total.
Memory and GC: stop death by pause
Symptom: latency spikes with RPS, GC pauses in logs, RSS climbs until OOMKill. I’ve watched a JVM go from 250ms p95 to 80ms by switching GC and killing a rogue cache.
- Measure first:
- JVM:
JFR/async-profilerfor allocation hot spots; export GC logs. - Go:
GODEBUG=gctrace=1andpprof /heap. - Node:
clinic flameor--trace-gcfor pause insight.
- JVM:
- Kill allocation hot paths:
- Reuse buffers (e.g.,
bytes.Bufferpools in Go;ByteBufferin Java). - Avoid per‑request JSON encode/decode when you can cache.
- Reuse buffers (e.g.,
- Tune GC sanely:
- JVM (G1 as default for mixed workloads):
-XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:+ParallelRefProcEnabled -Xlog:gc*:stdout:time - Go: raise or lower heap growth target based on headroom:
GOGC=100 ./server # start here; consider 150–200 if lots of headroom
- JVM (G1 as default for mixed workloads):
- Cap caches and watch cardinality (Prometheus labels too): unbounded caches and high‑card metrics are classic RAM eaters.
Metrics & checkpoints
- GC total pause time < 1% CPU wall time during load.
- RSS stable within 10% over a 60‑minute soak.
- p95 improves; heap/alloc profiles show top allocators reduced by 50%.
Database hotspots: make Postgres do less
If you’re not running pg_stat_statements, you’re flying blind. Most “database is slow” incidents I’ve handled were 2–3 bad queries and a mis‑sized connection pool.
- Find the real offenders:
-- Enable once (superuser): CREATE EXTENSION IF NOT EXISTS pg_stat_statements; -- Top 10 by total time SELECT query, calls, total_time, mean_time FROM pg_stat_statements ORDER BY total_time DESC LIMIT 10; - Explain with buffers:
EXPLAIN (ANALYZE, BUFFERS) SELECT ...- Look for
Seq Scanon large tables, highRows Removed by Filter, orBitmap Heap Scanwith high rechecks.
- Look for
- Add the right index without downtime:
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_orders_user_created ON orders(user_id, created_at DESC); - Fix N+1s and batch: fetch related entities with
JOIN/INor application‑side batching (e.g.,DataLoader). - Right‑size connection pools (avoid oversubscription):
- HikariCP:
maximumPoolSize=20 minimumIdle=5 connectionTimeout=30000 - Don’t set pool size >
num_cores * 2per app; let Postgres work.
- HikariCP:
Metrics & checkpoints
- p95 query latency down 30–70% for offenders.
- Cache hit ratio (Postgres) > 99% for hot tables.
- Lock wait time near zero; active connections <
max_connections * 0.7.
Pro tip: put query fingerprints in traces with OpenTelemetry and redact literals. You’ll correlate slow spans to the exact plan.
Cache misses and thundering herd: serve stale, not fail
Every outage postmortem I read includes “cache was cold” or “everyone stampeded origin.” Fixing this earns you cheap, durable wins.
- Measure and expose cache hit rate per keyspace.
- Add stale‑while‑revalidate and circuit breakers at the edge.
- Coalesce duplicate work in the app layer.
Nginx edge cache with stale‑on‑error:
proxy_cache_path /var/cache/nginx levels=1:2 keys_zone=api_cache:50m inactive=10m use_temp_path=off;
server {
location /api/ {
proxy_pass http://origin;
proxy_cache api_cache;
proxy_cache_valid 200 1m;
proxy_cache_use_stale updating error timeout http_500 http_502 http_503 http_504;
add_header X-Cache-Status $upstream_cache_status;
}
}App‑side dogpile prevention (Go):
import "golang.org/x/sync/singleflight"
var g singleflight.Group
func cached(key string, fetch func() (string, error)) (string, error) {
if v, ok := redisGet(key); ok { return v, nil }
v, err, _ := g.Do(key, func() (any, error) { return fetch() })
if err == nil { redisSetEX(key, v.(string), 60*time.Second) }
return v.(string), err
}Metrics & checkpoints
- Cache hit rate > 90% for hot keys.
- Origin RPS drops 5–10x during spikes; p95 remains within SLO using stale.
- No simultaneous >N fetches for the same key (coalescing works).
Thread pools and event loops: avoid self‑DoS
I’ve seen Tomcat with maxThreads=5000 and a Postgres pool at 200. The system looked “concurrent” and then face‑planted under backpressure.
- Expose saturation metrics: request queue depth, thread pool utilization, event loop lag.
- Node event loop utilization:
const { monitorEventLoopDelay, performance } = require('perf_hooks'); const h = monitorEventLoopDelay(); h.enable(); setInterval(() => console.log('eventLoopLagMs', Math.round(h.mean()/1e6)), 1000);
- Node event loop utilization:
- Right‑size pools and timeouts:
- Tomcat:
<Connector port="8080" maxThreads="200" minSpareThreads="20" acceptCount="100" connectionTimeout="20000"/> - HikariCP pool <= DB core count * 2; align with thread pool.
- Node: avoid blocking sync work; if using workers, cap
UV_THREADPOOL_SIZEsanely (e.g., 16).
- Tomcat:
- Apply backpressure early: fail fast with circuit breakers and timeouts.
Istio example (retries + timeouts):
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata: { name: checkout }
spec:
hosts: ["checkout"]
http:
- route: [{ destination: { host: checkout } }]
retries: { attempts: 2, perTryTimeout: 300ms, retryOn: connect-failure,refused-stream,reset,5xx }
timeout: 800msMetrics & checkpoints
- Queue depth near zero at steady state; short spikes drain quickly.
- p95 stable under burst tests; no thread starvation or Node lag > 50ms.
- Retries < 5% of requests; no retry storms during incidents.
Tail latency in the mesh: budget your retries
99th percentile is where user happiness goes to die. A few slow replicas or a flaky downstream can torch your error budget via retry amplification.
- Add outlier detection and circuit breakers:
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata: { name: payments }
spec:
host: payments
trafficPolicy:
outlierDetection:
consecutive5xx: 5
interval: 5s
baseEjectionTime: 30s
maxEjectionPercent: 50
connectionPool:
http: { http1MaxPendingRequests: 100 }
tcp: { maxConnections: 100 }- Use retry budgets, not blind retries. Track retry ratio and cap it.
- Precompute and cache tail‑heavy calls (idempotent reads) or move them to async.
Metrics & checkpoints
- p99 within 2x p95; outlier ejections occur during incidents.
- Retry ratio < 2% at steady state, < 10% during failures.
Prove it: canary, load, and lock in the win
The only performance change that matters is one that survives traffic.
- Write a load test that matches your mix (k6):
import http from 'k6/http';
import { sleep, check } from 'k6';
export const options = {
stages: [
{ duration: '2m', target: 100 },
{ duration: '5m', target: 100 },
{ duration: '2m', target: 0 },
],
thresholds: {
http_req_duration: ['p(95)<200'],
http_req_failed: ['rate<0.001'],
},
};
export default function () {
const res = http.get(__ENV.TARGET);
check(res, { 'status 200': r => r.status === 200 });
sleep(1);
}- Canary with automated analysis (Argo Rollouts):
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata: { name: api }
spec:
replicas: 10
strategy:
canary:
steps:
- setWeight: 10
- pause: { duration: 120 }
- analysis:
templates:
- templateName: success-slo- Alert on budget burn, not CPU. Prometheus:
- alert: ErrorBudgetBurn
expr:
(sum(rate(http_request_duration_seconds_bucket{le="+Inf",status!~"5.."}[5m]))
/ sum(rate(http_request_duration_seconds_count[5m])))
< 0.99
for: 10m
labels: { severity: page }
annotations:
description: "SLO burn detected; investigate latest deploy or dependency."- Document the win: before/after metrics, links to profiles and queries, and a rollback plan baked into runbooks.
Metrics & checkpoints
- Canary passes thresholds automatically; rollout completes.
- p95 down X%, infra spend down Y% (CPU/req, DB time/req).
- Regression alerts configured; dashboards updated.
heroQuote: "Performance work is boring on purpose. The playbook is the point."
Key takeaways
- Define SLOs first; optimize to budget, not vibes.
- Use a repeatable playbook per bottleneck type; don’t improvise under pager pressure.
- Instrument once, reuse everywhere: tracing + RED/USE metrics + profiles.
- Always verify gains under load with canaries and thresholds—no “it feels faster” sign‑offs.
- Lock in wins with guardrails: alerting, dashboards, and regression tests.
Implementation checklist
- Have p95/p99 latency SLOs per critical endpoint.
- Enable tracing (OpenTelemetry) and RED/USE metrics in prod and staging.
- Profile in prod safely (pprof, eBPF, py-spy/rbspy) with time‑boxed windows.
- Enable pg_stat_statements and sample EXPLAIN ANALYZE for top queries.
- Track cache hit rate and configure stale‑while‑revalidate/circuit breakers.
- Right‑size thread/connection pools; monitor queue depth and saturation.
- Use canary rollouts with automated checks (Argo Rollouts + k6 thresholds).
- Set alerts on budget exhaustion (e.g., 25% of SLO error budget burned in 1h).
Questions we hear from teams
- How do I profile in production safely?
- Time‑box profiles (30–60s), sample not trace every call, and whitelist access to pprof endpoints via mTLS/IP allowlists. eBPF profilers (Parca, Pixie) provide low‑overhead continuous profiles with aggregation.
- We’re on Kubernetes—what K8s settings matter most for perf?
- Right‑size `requests/limits` to avoid CPU throttling, set `podDisruptionBudget` to keep capacity during rollouts, and use HPA on a meaningful metric (RPS or queue depth via custom metrics), not CPU% alone.
- What if AI‑generated code made things worse?
- Seen it. Use the same playbooks: profile, trace, and measure. Then do a vibe code cleanup: remove unnecessary layers, reduce allocations, and add tests around perf‑critical paths. GitPlumbers does code rescue and AI code refactoring with trace‑driven guidance.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
