The p95 Kill Kit: Battle‑Tested Playbooks for CPU, DB, GC, and Cache Bottlenecks

Stop vibe‑debugging. Use these focused playbooks to cut p95 latency, tame costs, and ship without fear.

“Performance work is boring on purpose. The playbook is the point.”
Back to all posts

Baseline first: speed limits and gauges

If you don’t have SLOs and a shared dashboard, you’re optimizing by vibes. I’ve watched teams shave 5 ms off an endpoint that only runs once a day while the checkout API burns its error budget by lunch.

  • Define SLOs per API and job: p95 < 200ms, error_rate < 0.1%, availability > 99.9%.
  • Adopt RED + USE:
    • RED (for services): rate, errors, duration
    • USE (for resources): utilization, saturation, errors
  • Instrument once, leverage everywhere:
    • Tracing: OpenTelemetry to Jaeger/Tempo/Datadog.
    • Metrics: Prometheus + Grafana with histograms for latency.
    • Profiles: pprof, py-spy/rbspy, async-profiler, eBPF (Parca/Pixie).

Example Prometheus histogram (Go):

var apiLatency = prometheus.NewHistogramVec(prometheus.HistogramOpts{
    Namespace: "checkout",
    Name:      "http_latency_seconds",
    Buckets:   prometheus.ExponentialBuckets(0.005, 2, 12),
}, []string{"route","status"})

Quick OS sanity checks during an incident:

# CPU, runqueue, context switches
mpstat -P ALL 1 | head -50
vmstat 1 5

# I/O wait and disk saturation
iostat -xz 1 5

# Top kernel stack consumers (root cause hunting)
sudo perf top --sort comm,dso

Checkpoint

  • You have a single dashboard showing p50/p95/p99, RPS, error rate, CPU, memory, GC, I/O wait, DB connections, cache hit rate.
  • You can click a slow trace into spans and see db.statement (redacted) and external calls.

CPU hot paths: starve the flame

Symptom: high CPU, p95 blows up with RPS, or the box is “busy” but not doing useful work. I’ve seen teams scale pods 10x before doing a 30‑minute profile that cut CPU in half.

  1. Capture a profile in prod (30–60s window):
    • Go:
      import _ "net/http/pprof"
      func init() { go http.ListenAndServe(":6060", nil) }
      go tool pprof -http=:0 http://svc:6060/debug/pprof/profile?seconds=30
    • Python:
      pip install py-spy
      sudo py-spy record -p $(pidof yourproc) -o cpu.svg --duration 30
    • JVM:
      ./async-profiler/profiler.sh -d 30 -e cpu -f cpu.svg <PID>
  2. Read the flamegraph: wide stacks are your budget. Look for JSON marshalling, regexes, crypto, or poorly batched loops.
  3. Fix patterns that always pay:
    • Replace per‑request allocations with pooling.
    • Avoid O(n^2) joins in hot code; precompute maps.
    • Batch I/O and RPCs.
    • In Go, cap goroutines and use singleflight to dedupe.
var g singleflight.Group
val, err, _ := g.Do(key, func() (any, error) { return fetchExpensive(), nil })

Metrics & checkpoints

  • CPU utilization drops by 30–60% under the same load.
  • p95 improves proportionally; no increase in GC pauses or I/O wait.
  • Flamegraph top frame shrinks; hot function < 20% of total.

Memory and GC: stop death by pause

Symptom: latency spikes with RPS, GC pauses in logs, RSS climbs until OOMKill. I’ve watched a JVM go from 250ms p95 to 80ms by switching GC and killing a rogue cache.

  1. Measure first:
    • JVM: JFR/async-profiler for allocation hot spots; export GC logs.
    • Go: GODEBUG=gctrace=1 and pprof /heap.
    • Node: clinic flame or --trace-gc for pause insight.
  2. Kill allocation hot paths:
    • Reuse buffers (e.g., bytes.Buffer pools in Go; ByteBuffer in Java).
    • Avoid per‑request JSON encode/decode when you can cache.
  3. Tune GC sanely:
    • JVM (G1 as default for mixed workloads):
      -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:+ParallelRefProcEnabled -Xlog:gc*:stdout:time
    • Go: raise or lower heap growth target based on headroom:
      GOGC=100 ./server   # start here; consider 150–200 if lots of headroom
  4. Cap caches and watch cardinality (Prometheus labels too): unbounded caches and high‑card metrics are classic RAM eaters.

Metrics & checkpoints

  • GC total pause time < 1% CPU wall time during load.
  • RSS stable within 10% over a 60‑minute soak.
  • p95 improves; heap/alloc profiles show top allocators reduced by 50%.

Database hotspots: make Postgres do less

If you’re not running pg_stat_statements, you’re flying blind. Most “database is slow” incidents I’ve handled were 2–3 bad queries and a mis‑sized connection pool.

  1. Find the real offenders:
    -- Enable once (superuser):
    CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
    
    -- Top 10 by total time
    SELECT query, calls, total_time, mean_time
    FROM pg_stat_statements
    ORDER BY total_time DESC
    LIMIT 10;
  2. Explain with buffers:
    EXPLAIN (ANALYZE, BUFFERS)
    SELECT ...
    • Look for Seq Scan on large tables, high Rows Removed by Filter, or Bitmap Heap Scan with high rechecks.
  3. Add the right index without downtime:
    CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_orders_user_created
    ON orders(user_id, created_at DESC);
  4. Fix N+1s and batch: fetch related entities with JOIN/IN or application‑side batching (e.g., DataLoader).
  5. Right‑size connection pools (avoid oversubscription):
    • HikariCP:
      maximumPoolSize=20
      minimumIdle=5
      connectionTimeout=30000
    • Don’t set pool size > num_cores * 2 per app; let Postgres work.

Metrics & checkpoints

  • p95 query latency down 30–70% for offenders.
  • Cache hit ratio (Postgres) > 99% for hot tables.
  • Lock wait time near zero; active connections < max_connections * 0.7.

Pro tip: put query fingerprints in traces with OpenTelemetry and redact literals. You’ll correlate slow spans to the exact plan.


Cache misses and thundering herd: serve stale, not fail

Every outage postmortem I read includes “cache was cold” or “everyone stampeded origin.” Fixing this earns you cheap, durable wins.

  1. Measure and expose cache hit rate per keyspace.
  2. Add stale‑while‑revalidate and circuit breakers at the edge.
  3. Coalesce duplicate work in the app layer.

Nginx edge cache with stale‑on‑error:

proxy_cache_path /var/cache/nginx levels=1:2 keys_zone=api_cache:50m inactive=10m use_temp_path=off;
server {
  location /api/ {
    proxy_pass http://origin;
    proxy_cache api_cache;
    proxy_cache_valid 200 1m;
    proxy_cache_use_stale updating error timeout http_500 http_502 http_503 http_504;
    add_header X-Cache-Status $upstream_cache_status;
  }
}

App‑side dogpile prevention (Go):

import "golang.org/x/sync/singleflight"
var g singleflight.Group
func cached(key string, fetch func() (string, error)) (string, error) {
  if v, ok := redisGet(key); ok { return v, nil }
  v, err, _ := g.Do(key, func() (any, error) { return fetch() })
  if err == nil { redisSetEX(key, v.(string), 60*time.Second) }
  return v.(string), err
}

Metrics & checkpoints

  • Cache hit rate > 90% for hot keys.
  • Origin RPS drops 5–10x during spikes; p95 remains within SLO using stale.
  • No simultaneous >N fetches for the same key (coalescing works).

Thread pools and event loops: avoid self‑DoS

I’ve seen Tomcat with maxThreads=5000 and a Postgres pool at 200. The system looked “concurrent” and then face‑planted under backpressure.

  1. Expose saturation metrics: request queue depth, thread pool utilization, event loop lag.
    • Node event loop utilization:
      const { monitorEventLoopDelay, performance } = require('perf_hooks');
      const h = monitorEventLoopDelay(); h.enable();
      setInterval(() => console.log('eventLoopLagMs', Math.round(h.mean()/1e6)), 1000);
  2. Right‑size pools and timeouts:
    • Tomcat:
      <Connector port="8080" maxThreads="200" minSpareThreads="20" acceptCount="100" connectionTimeout="20000"/>
    • HikariCP pool <= DB core count * 2; align with thread pool.
    • Node: avoid blocking sync work; if using workers, cap UV_THREADPOOL_SIZE sanely (e.g., 16).
  3. Apply backpressure early: fail fast with circuit breakers and timeouts.

Istio example (retries + timeouts):

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata: { name: checkout }
spec:
  hosts: ["checkout"]
  http:
  - route: [{ destination: { host: checkout } }]
    retries: { attempts: 2, perTryTimeout: 300ms, retryOn: connect-failure,refused-stream,reset,5xx }
    timeout: 800ms

Metrics & checkpoints

  • Queue depth near zero at steady state; short spikes drain quickly.
  • p95 stable under burst tests; no thread starvation or Node lag > 50ms.
  • Retries < 5% of requests; no retry storms during incidents.

Tail latency in the mesh: budget your retries

99th percentile is where user happiness goes to die. A few slow replicas or a flaky downstream can torch your error budget via retry amplification.

  1. Add outlier detection and circuit breakers:
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata: { name: payments }
spec:
  host: payments
  trafficPolicy:
    outlierDetection:
      consecutive5xx: 5
      interval: 5s
      baseEjectionTime: 30s
      maxEjectionPercent: 50
    connectionPool:
      http: { http1MaxPendingRequests: 100 }
      tcp:  { maxConnections: 100 }
  1. Use retry budgets, not blind retries. Track retry ratio and cap it.
  2. Precompute and cache tail‑heavy calls (idempotent reads) or move them to async.

Metrics & checkpoints

  • p99 within 2x p95; outlier ejections occur during incidents.
  • Retry ratio < 2% at steady state, < 10% during failures.

Prove it: canary, load, and lock in the win

The only performance change that matters is one that survives traffic.

  1. Write a load test that matches your mix (k6):
import http from 'k6/http';
import { sleep, check } from 'k6';
export const options = {
  stages: [
    { duration: '2m', target: 100 },
    { duration: '5m', target: 100 },
    { duration: '2m', target: 0 },
  ],
  thresholds: {
    http_req_duration: ['p(95)<200'],
    http_req_failed: ['rate<0.001'],
  },
};
export default function () {
  const res = http.get(__ENV.TARGET);
  check(res, { 'status 200': r => r.status === 200 });
  sleep(1);
}
  1. Canary with automated analysis (Argo Rollouts):
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata: { name: api }
spec:
  replicas: 10
  strategy:
    canary:
      steps:
      - setWeight: 10
      - pause: { duration: 120 }
      - analysis:
          templates:
          - templateName: success-slo
  1. Alert on budget burn, not CPU. Prometheus:
- alert: ErrorBudgetBurn
  expr: 
    (sum(rate(http_request_duration_seconds_bucket{le="+Inf",status!~"5.."}[5m]))
    / sum(rate(http_request_duration_seconds_count[5m])))
    < 0.99
  for: 10m
  labels: { severity: page }
  annotations:
    description: "SLO burn detected; investigate latest deploy or dependency."
  1. Document the win: before/after metrics, links to profiles and queries, and a rollback plan baked into runbooks.

Metrics & checkpoints

  • Canary passes thresholds automatically; rollout completes.
  • p95 down X%, infra spend down Y% (CPU/req, DB time/req).
  • Regression alerts configured; dashboards updated.

heroQuote: "Performance work is boring on purpose. The playbook is the point."

Related Resources

Key takeaways

  • Define SLOs first; optimize to budget, not vibes.
  • Use a repeatable playbook per bottleneck type; don’t improvise under pager pressure.
  • Instrument once, reuse everywhere: tracing + RED/USE metrics + profiles.
  • Always verify gains under load with canaries and thresholds—no “it feels faster” sign‑offs.
  • Lock in wins with guardrails: alerting, dashboards, and regression tests.

Implementation checklist

  • Have p95/p99 latency SLOs per critical endpoint.
  • Enable tracing (OpenTelemetry) and RED/USE metrics in prod and staging.
  • Profile in prod safely (pprof, eBPF, py-spy/rbspy) with time‑boxed windows.
  • Enable pg_stat_statements and sample EXPLAIN ANALYZE for top queries.
  • Track cache hit rate and configure stale‑while‑revalidate/circuit breakers.
  • Right‑size thread/connection pools; monitor queue depth and saturation.
  • Use canary rollouts with automated checks (Argo Rollouts + k6 thresholds).
  • Set alerts on budget exhaustion (e.g., 25% of SLO error budget burned in 1h).

Questions we hear from teams

How do I profile in production safely?
Time‑box profiles (30–60s), sample not trace every call, and whitelist access to pprof endpoints via mTLS/IP allowlists. eBPF profilers (Parca, Pixie) provide low‑overhead continuous profiles with aggregation.
We’re on Kubernetes—what K8s settings matter most for perf?
Right‑size `requests/limits` to avoid CPU throttling, set `podDisruptionBudget` to keep capacity during rollouts, and use HPA on a meaningful metric (RPS or queue depth via custom metrics), not CPU% alone.
What if AI‑generated code made things worse?
Seen it. Use the same playbooks: profile, trace, and measure. Then do a vibe code cleanup: remove unnecessary layers, reduce allocations, and add tests around perf‑critical paths. GitPlumbers does code rescue and AI code refactoring with trace‑driven guidance.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Schedule a 60‑min performance triage Get the performance playbook checklist (PDF)

Related resources