The P99 Killers: Playbooks We Actually Use to Fix DB Hotspots, Thread Pools, and K8s Throttle Loops
If you keep firefighting the same latency spikes, it’s time to standardize the fix. Here are the performance playbooks we install at clients so teams can diagnose and resolve bottlenecks in hours, not quarters.
If you can’t prove the fix with a graph, it didn’t happen.Back to all posts
You don’t need another war room; you need repeatable playbooks
I’ve watched teams burn weeks debating whether the p99 spike is “the network” or “GC” while the answer was sitting in pg_stat_statements. I’ve also seen AI-generated code ship a 12-table join that looked clever and blew the cache. Here’s the kit we install at GitPlumbers when a client asks for performance help: opinionated playbooks for the bottlenecks that account for 80% of incidents—database hotspots, thread pool saturation, cache storms, GC pauses, and Kubernetes CPU throttling.
Each playbook below includes: the symptom, metrics to verify, tooling, a step-by-step fix, and a quick way to prove it worked. Keep this tight, keep it in Git, and stop reinventing the wheel in the middle of an outage.
Pro tip: standardize on RED/USE. RED (Rate, Errors, Duration) for services; USE (Utilization, Saturation, Errors) for resources. It keeps the team aligned when the alarms go off.
Baseline once; reuse forever
Before you touch app code, get the fundamentals in place.
- Metrics:
Prometheus+Grafana. Collecthttp_request_duration_seconds_bucket,process_cpu_seconds_total,go_threads/jvm_threads,node_cpu_seconds_total,container_cpu_cfs_throttled_seconds_total. - Tracing:
OpenTelemetrySDKs +Jaeger/Tempofor sampling + trace exemplars in histograms. - Profiling:
async-profiler(JVM),pprof(Go),py-spy(Python),clinic flame(Node),perf/bpftrace(Linux) for eBPF-based insights. - Load gen:
k6,vegeta, orwrkfor controlled experiments. - Release control:
ArgoCD+Istio/Envoyfor canaries,Feature Flagsfor toggles.
Baseline dashboard (copy/paste into Grafana):
# Prometheus RED panel (p95/p99) example query
histogram_quantile(0.95, sum by (le, route) (rate(http_request_duration_seconds_bucket[5m])))And a cheap, high-signal alert:
# alerts/p99_latency.yml
groups:
- name: service-latency
rules:
- alert: P99LatencyHigh
expr: histogram_quantile(0.99, sum by (le, route) (rate(http_request_duration_seconds_bucket[5m]))) > 0.800
for: 10m
labels:
severity: page
annotations:
summary: "p99 latency > 800ms on {{ $labels.route }}"
runbook: "https://gitplumbers.dev/runbooks/p99"Checkpoint: You can answer “what’s slow” in under 2 minutes with metrics + a trace link.
Playbook 1: Database hot query or lock contention
Symptom: p95/p99 creeps up under load; DB CPU pegged; connections pile up; traces show long spans in SELECT statements.
Verify with metrics
db_cpu_utilization,active_connections,locks_waiting.pg_stat_statements(PostgreSQL) orpt-query-digest(MySQL) for top queries by total time.
Tools: EXPLAIN (ANALYZE, BUFFERS), pg_stat_statements, pganalyze, pgBouncer.
Steps
- Identify hot statements:
-- PostgreSQL: top 10 by total time SELECT query, calls, total_time, mean_time FROM pg_stat_statements ORDER BY total_time DESC LIMIT 10; - Reproduce with parameters from logs; run
EXPLAIN ANALYZE:EXPLAIN (ANALYZE, BUFFERS) SELECT ... WHERE tenant_id = $1 AND created_at > now() - interval '7 days'; - Fix the actual plan problem:
- Add or refine composite indexes (
(tenant_id, created_at)), avoid functions on indexed columns. - Replace N+1 fetches with batched queries or
JOIN LATERAL. - For lock contention: split writes, shorten transactions, move long reads to
READ COMMITTED.
- Add or refine composite indexes (
- Add connection pooling and timeouts:
pgBouncerin transaction mode.- App:
statement_timeout=2s,idle_in_transaction_session_timeout=5s.
- Cache aggressively where safe:
Rediswith TTLs for metadata tables. - Canary and validate under load (
k6) with the exact route patterns.
Proof: p99 down >30%, total_time for the query cut in half, locks waiting near zero. Update runbook with the query fingerprint and the index rationale.
Example k6 script stub:
import http from 'k6/http';
import { check, sleep } from 'k6';
export let options = { vus: 50, duration: '5m' };
export default function () {
const res = http.get('https://api.example.com/tenants/123/orders?since=7d');
check(res, { 'p95<400ms': (r) => r.timings.duration < 400 });
sleep(1);
}Playbook 2: Thread pool saturation and tail latency
Symptom: p99 spikes during traffic bursts, CPU isn’t maxed, but you see queue growth; 5xx from timeouts.
Verify with metrics
- Queue length, active threads, and task wait time. For JVM:
jvm_threads_*, customexecutor_queue_depth. async-profiler/pprofshows time inpoll/waitor synchronous I/O.
Tools: async-profiler, pprof, service config for pools (Spring Boot, Netty, Go http.Server), Envoy circuit breaker.
Steps
- Instrument queues and timeouts. In Spring Boot:
# application.properties server.tomcat.max-threads=300 server.tomcat.accept-count=100 server.connection-timeout=2s management.metrics.enable.executor=true - Set budgets and timeouts end-to-end: client, proxy, service, DB.
- Prefer async I/O or batch external calls to reduce blocking.
- Add a circuit breaker at the edge (Envoy example):
circuit_breakers: thresholds: - max_connections: 1024 max_pending_requests: 512 max_requests: 1024 max_retries: 3 outlier_detection: consecutive_5xx: 5 interval: 5s base_ejection_time: 30s - Right-size pools: keep queues short; reject fast with backpressure; set
maxbased on CPU cores and external SLAs. - Canary with a shadow read or 10% traffic in
Istioand verify tail latency.
Proof: p99 reduced, queue depth stays flat under burst, error rate falls. Add alerts on queue saturation.
Playbook 3: Cache miss storms and fan-out amplification
Symptom: Random 10x latency; Redis CPU spikes; thundering herd after deploy or cache eviction; traces show 10+ downstream calls.
Verify with metrics
- Cache hit ratio by key space; downstream call count per request; Redis
evicted_keys,rejected_connections.
Tools: Redis with allkeys-lru, request coalescing, singleflight (Go), bulkheads, rate limiting.
Steps
- Instrument cache hit/miss and fan-out per route.
- Add request coalescing to collapse duplicate concurrent lookups (Go):
var g singleflight.Group v, err, _ := g.Do(key, func() (any, error) { return loadFromDB(key) }) - Use a two-tier cache: in-process LRU + Redis; warm critical keys on deploy.
- Set TTLs and memory policy:
# redis.conf maxmemory 8gb maxmemory-policy allkeys-lru hz 10 - Cap fan-out: batch calls, add bulkheads (separate pools) for slow dependencies.
- Canary with a cold-cache scenario; validate miss penalty and herd containment.
Proof: Miss storms disappear; hit ratio > 95% on hot keys; downstream QPS stable under cache flush.
Playbook 4: GC pauses and memory pressure
Symptom: Throughput cliff at higher load; p99 spikes align with GC logs; RSS grows until OOMKilled.
Verify with metrics
- Heap usage, allocation rate, GC pause time (
jvm_gc_pause_seconds_sum), CPU steal, container memory limits.
Tools: async-profiler/Java Flight Recorder (JVM), pprof heap (Go), py-spy/tracemalloc (Python).
Steps (JVM example)
- Capture a 2–5 minute profile at load:
async-profiler -e cpu -d 120 -f /tmp/cpu.html <pid> async-profiler -e alloc -d 120 -f /tmp/alloc.html <pid> - Check GC configuration. For JDK 17+, try
G1GCorZGCfor tail latency-sensitive services:JAVA_TOOL_OPTIONS="-XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:+ParallelRefProcEnabled" - Reduce allocation churn: reuse buffers, avoid
Stringconcatenation in hot paths, pre-size collections. - Right-size heap vs container limits: leave headroom to avoid kernel reclaim; set
-XX:MaxRAMPercentage=70. - Add safeguards: request limits on large payloads; cap batch sizes.
Proof: GC pause p99 < 200ms; allocations per request down; container OOMs eliminated. Save flamegraphs and configs in the repo.
Playbook 5: Kubernetes CPU throttling and wrong autoscaling signals
Symptom: p99 spikes under load even though pods aren’t at 100% CPU; container_cpu_cfs_throttled_seconds_total climbs; HPA flaps or lags.
Verify with metrics
- Throttling ratio, CPU requests vs limits, HPA event logs; service latency vs replica count.
Tools: kubectl top, kube-state-metrics, HPA/KEDA, custom metrics adapters.
Steps
- Check throttling:
kubectl -n prod top pods kubectl -n prod describe pod <pod> | grep -i throttle -C2 - Align requests/limits: set
limit = requestfor CPU on latency-critical services to avoid CFS throttling. - Drive HPA with an SLO-correlated metric (p95 latency or RPS per pod), not raw CPU.
- Example HPA on custom metric:
apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: api-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: api minReplicas: 4 maxReplicas: 40 metrics: - type: Pods pods: metric: name: http_p95_latency_ms target: type: AverageValue averageValue: 300m # 300ms - Add Pod Disruption Budgets and max surge for smooth rollouts. Validate with a synthetic load test after each deploy.
Proof: Throttling near zero; stable p95 during autoscale; replicas correlate with latency target, not with CPU noise.
Make it stick: codify, canary, and measure
Playbooks die if they live in someone’s head. Bake them into your repo and your release process.
- Codify runbooks: Each playbook gets a
README.md, dashboard JSON, alert rules, and example configs in/runbooks/<area>. - GitOps the knobs: Apply infra changes via
ArgoCD; track diffs, canary safely, and roll back in seconds. - SLOs and error budgets: Tie each service to p95/p99 and throughput targets. Stop changes that burn the budget.
- Regression tests: Add a
k6scenario to CI that asserts latency under a baseline load; fail fast. - Post-incident snapshots: Save flamegraphs,
EXPLAIN ANALYZE, HPA events alongside the incident report. - Clean up vibe code: If AI-generated “optimizations” slipped in (I’ve seen some wild
WHERE to_char(date)gems), schedule avibe code cleanupandAI code refactoringpass. Set lint rules that forbid pathological patterns.
Example CI guard using vegeta:
vegeta attack -duration=60s -rate=200 -targets=targets.txt | vegeta report -type=json \
| jq -e '.latencies.p95 <= 300000000' # 300msResults worth bragging about: at a fintech client, the DB hot query playbook cut checkout p99 from 1.8s to 420ms in two days; at a marketplace, swapping HPA’s CPU target for p95 latency cut MTTR for brownouts by ~50% because scaling reacted to what users felt.
What to standardize right now
- Dashboards: one per service with RED, one shared infra with USE.
- Alerts: p99 latency, DB lock waits, thread queue depth, Redis evictions, CPU throttling.
- Runbooks: the five playbooks above, in Git, referenced in alerts.
- Tooling:
OpenTelemetrytracing everywhere, profilers reachable in prod,k6scripts in the repo. - Rollout: canary by default with
IstioandArgoCD. Canary bake time tied to traffic volume.
If you want a fast start, we’ve templated all of this at GitPlumbers. We install the dashboards, alerts, and runbooks, run a joint incident drill, and leave you with a system the team actually uses instead of a slide deck. When the next p99 spike hits, you’ll know which playbook to open and which knob to turn.
Key takeaways
- Codify bottleneck fixes as playbooks with steps, metrics, and rollback points.
- Instrument first: RED/USE + tracing. Guessing is slower than profiling.
- Treat the database, thread pools, caches, GC, and K8s throttling as first-class failure domains.
- Use canaries and p95/p99 SLOs to validate improvements; watch saturation, not just averages.
- Make the fix durable: dashboards, alerts, runbooks, and guardrail configs in Git (GitOps).
Implementation checklist
- Define SLOs (p95, p99, error budget) and baseline dashboards.
- Enable tracing and profiling in prod-like traffic paths.
- Pick the playbook matching the symptom (DB, threads, cache, GC, K8s).
- Run load tests (k6/vegeta/wrk) to reproduce and measure.
- Apply the fix behind a feature flag or config gate; canary with ArgoCD/Istio.
- Validate metrics vs SLOs; capture before/after flamegraphs.
- Document runbook, create alerts, and bake in regression tests.
Questions we hear from teams
- What’s the fastest way to tell if it’s the DB or the app?
- Check RED: if latency spikes align with `db_cpu` and `active_connections`, and traces show long DB spans, it’s the DB. Otherwise, capture a flamegraph; if most time is in wait/lock or syscalls, look at thread pools and external calls.
- Should we scale up hardware first?
- Only as a stopgap. Without fixing the bottleneck, you’ll pay more for the same problem at a higher RPS. Apply the matching playbook, then right-size infra.
- How do we keep AI-generated code from regressing performance?
- Add performance linters and tests in CI. Flag dangerous SQL patterns, enforce query timeouts, and run a light load test gate. Schedule periodic vibe code cleanup and code rescue reviews.
- We use a vendor APM. Do we still need OpenTelemetry?
- Yes. OTel gives you portable context propagation and avoids lock-in. Most APMs ingest OTel now; you can switch vendors without ripping out instrumentation.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
