The P99 Killers: Playbooks We Actually Use to Fix DB Hotspots, Thread Pools, and K8s Throttle Loops

If you keep firefighting the same latency spikes, it’s time to standardize the fix. Here are the performance playbooks we install at clients so teams can diagnose and resolve bottlenecks in hours, not quarters.

If you can’t prove the fix with a graph, it didn’t happen.
Back to all posts

You don’t need another war room; you need repeatable playbooks

I’ve watched teams burn weeks debating whether the p99 spike is “the network” or “GC” while the answer was sitting in pg_stat_statements. I’ve also seen AI-generated code ship a 12-table join that looked clever and blew the cache. Here’s the kit we install at GitPlumbers when a client asks for performance help: opinionated playbooks for the bottlenecks that account for 80% of incidents—database hotspots, thread pool saturation, cache storms, GC pauses, and Kubernetes CPU throttling.

Each playbook below includes: the symptom, metrics to verify, tooling, a step-by-step fix, and a quick way to prove it worked. Keep this tight, keep it in Git, and stop reinventing the wheel in the middle of an outage.

Pro tip: standardize on RED/USE. RED (Rate, Errors, Duration) for services; USE (Utilization, Saturation, Errors) for resources. It keeps the team aligned when the alarms go off.

Baseline once; reuse forever

Before you touch app code, get the fundamentals in place.

  • Metrics: Prometheus + Grafana. Collect http_request_duration_seconds_bucket, process_cpu_seconds_total, go_threads/jvm_threads, node_cpu_seconds_total, container_cpu_cfs_throttled_seconds_total.
  • Tracing: OpenTelemetry SDKs + Jaeger/Tempo for sampling + trace exemplars in histograms.
  • Profiling: async-profiler (JVM), pprof (Go), py-spy (Python), clinic flame (Node), perf/bpftrace (Linux) for eBPF-based insights.
  • Load gen: k6, vegeta, or wrk for controlled experiments.
  • Release control: ArgoCD + Istio/Envoy for canaries, Feature Flags for toggles.

Baseline dashboard (copy/paste into Grafana):

# Prometheus RED panel (p95/p99) example query
histogram_quantile(0.95, sum by (le, route) (rate(http_request_duration_seconds_bucket[5m])))

And a cheap, high-signal alert:

# alerts/p99_latency.yml
groups:
- name: service-latency
  rules:
  - alert: P99LatencyHigh
    expr: histogram_quantile(0.99, sum by (le, route) (rate(http_request_duration_seconds_bucket[5m]))) > 0.800
    for: 10m
    labels:
      severity: page
    annotations:
      summary: "p99 latency > 800ms on {{ $labels.route }}"
      runbook: "https://gitplumbers.dev/runbooks/p99"

Checkpoint: You can answer “what’s slow” in under 2 minutes with metrics + a trace link.

Playbook 1: Database hot query or lock contention

Symptom: p95/p99 creeps up under load; DB CPU pegged; connections pile up; traces show long spans in SELECT statements.

Verify with metrics

  • db_cpu_utilization, active_connections, locks_waiting.
  • pg_stat_statements (PostgreSQL) or pt-query-digest (MySQL) for top queries by total time.

Tools: EXPLAIN (ANALYZE, BUFFERS), pg_stat_statements, pganalyze, pgBouncer.

Steps

  1. Identify hot statements:
    -- PostgreSQL: top 10 by total time
    SELECT query, calls, total_time, mean_time
    FROM pg_stat_statements
    ORDER BY total_time DESC
    LIMIT 10;
  2. Reproduce with parameters from logs; run EXPLAIN ANALYZE:
    EXPLAIN (ANALYZE, BUFFERS)
    SELECT ... WHERE tenant_id = $1 AND created_at > now() - interval '7 days';
  3. Fix the actual plan problem:
    • Add or refine composite indexes ((tenant_id, created_at)), avoid functions on indexed columns.
    • Replace N+1 fetches with batched queries or JOIN LATERAL.
    • For lock contention: split writes, shorten transactions, move long reads to READ COMMITTED.
  4. Add connection pooling and timeouts:
    • pgBouncer in transaction mode.
    • App: statement_timeout=2s, idle_in_transaction_session_timeout=5s.
  5. Cache aggressively where safe: Redis with TTLs for metadata tables.
  6. Canary and validate under load (k6) with the exact route patterns.

Proof: p99 down >30%, total_time for the query cut in half, locks waiting near zero. Update runbook with the query fingerprint and the index rationale.

Example k6 script stub:

import http from 'k6/http';
import { check, sleep } from 'k6';
export let options = { vus: 50, duration: '5m' };
export default function () {
  const res = http.get('https://api.example.com/tenants/123/orders?since=7d');
  check(res, { 'p95<400ms': (r) => r.timings.duration < 400 });
  sleep(1);
}

Playbook 2: Thread pool saturation and tail latency

Symptom: p99 spikes during traffic bursts, CPU isn’t maxed, but you see queue growth; 5xx from timeouts.

Verify with metrics

  • Queue length, active threads, and task wait time. For JVM: jvm_threads_*, custom executor_queue_depth.
  • async-profiler/pprof shows time in poll/wait or synchronous I/O.

Tools: async-profiler, pprof, service config for pools (Spring Boot, Netty, Go http.Server), Envoy circuit breaker.

Steps

  1. Instrument queues and timeouts. In Spring Boot:
    # application.properties
    server.tomcat.max-threads=300
    server.tomcat.accept-count=100
    server.connection-timeout=2s
    management.metrics.enable.executor=true
  2. Set budgets and timeouts end-to-end: client, proxy, service, DB.
  3. Prefer async I/O or batch external calls to reduce blocking.
  4. Add a circuit breaker at the edge (Envoy example):
    circuit_breakers:
      thresholds:
      - max_connections: 1024
        max_pending_requests: 512
        max_requests: 1024
        max_retries: 3
    outlier_detection:
      consecutive_5xx: 5
      interval: 5s
      base_ejection_time: 30s
  5. Right-size pools: keep queues short; reject fast with backpressure; set max based on CPU cores and external SLAs.
  6. Canary with a shadow read or 10% traffic in Istio and verify tail latency.

Proof: p99 reduced, queue depth stays flat under burst, error rate falls. Add alerts on queue saturation.

Playbook 3: Cache miss storms and fan-out amplification

Symptom: Random 10x latency; Redis CPU spikes; thundering herd after deploy or cache eviction; traces show 10+ downstream calls.

Verify with metrics

  • Cache hit ratio by key space; downstream call count per request; Redis evicted_keys, rejected_connections.

Tools: Redis with allkeys-lru, request coalescing, singleflight (Go), bulkheads, rate limiting.

Steps

  1. Instrument cache hit/miss and fan-out per route.
  2. Add request coalescing to collapse duplicate concurrent lookups (Go):
    var g singleflight.Group
    v, err, _ := g.Do(key, func() (any, error) { return loadFromDB(key) })
  3. Use a two-tier cache: in-process LRU + Redis; warm critical keys on deploy.
  4. Set TTLs and memory policy:
    # redis.conf
    maxmemory 8gb
    maxmemory-policy allkeys-lru
    hz 10
  5. Cap fan-out: batch calls, add bulkheads (separate pools) for slow dependencies.
  6. Canary with a cold-cache scenario; validate miss penalty and herd containment.

Proof: Miss storms disappear; hit ratio > 95% on hot keys; downstream QPS stable under cache flush.

Playbook 4: GC pauses and memory pressure

Symptom: Throughput cliff at higher load; p99 spikes align with GC logs; RSS grows until OOMKilled.

Verify with metrics

  • Heap usage, allocation rate, GC pause time (jvm_gc_pause_seconds_sum), CPU steal, container memory limits.

Tools: async-profiler/Java Flight Recorder (JVM), pprof heap (Go), py-spy/tracemalloc (Python).

Steps (JVM example)

  1. Capture a 2–5 minute profile at load:
    async-profiler -e cpu -d 120 -f /tmp/cpu.html <pid>
    async-profiler -e alloc -d 120 -f /tmp/alloc.html <pid>
  2. Check GC configuration. For JDK 17+, try G1GC or ZGC for tail latency-sensitive services:
    JAVA_TOOL_OPTIONS="-XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:+ParallelRefProcEnabled"
  3. Reduce allocation churn: reuse buffers, avoid String concatenation in hot paths, pre-size collections.
  4. Right-size heap vs container limits: leave headroom to avoid kernel reclaim; set -XX:MaxRAMPercentage=70.
  5. Add safeguards: request limits on large payloads; cap batch sizes.

Proof: GC pause p99 < 200ms; allocations per request down; container OOMs eliminated. Save flamegraphs and configs in the repo.

Playbook 5: Kubernetes CPU throttling and wrong autoscaling signals

Symptom: p99 spikes under load even though pods aren’t at 100% CPU; container_cpu_cfs_throttled_seconds_total climbs; HPA flaps or lags.

Verify with metrics

  • Throttling ratio, CPU requests vs limits, HPA event logs; service latency vs replica count.

Tools: kubectl top, kube-state-metrics, HPA/KEDA, custom metrics adapters.

Steps

  1. Check throttling:
    kubectl -n prod top pods
    kubectl -n prod describe pod <pod> | grep -i throttle -C2
  2. Align requests/limits: set limit = request for CPU on latency-critical services to avoid CFS throttling.
  3. Drive HPA with an SLO-correlated metric (p95 latency or RPS per pod), not raw CPU.
  4. Example HPA on custom metric:
    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    metadata:
      name: api-hpa
    spec:
      scaleTargetRef:
        apiVersion: apps/v1
        kind: Deployment
        name: api
      minReplicas: 4
      maxReplicas: 40
      metrics:
      - type: Pods
        pods:
          metric:
            name: http_p95_latency_ms
          target:
            type: AverageValue
            averageValue: 300m  # 300ms
  5. Add Pod Disruption Budgets and max surge for smooth rollouts. Validate with a synthetic load test after each deploy.

Proof: Throttling near zero; stable p95 during autoscale; replicas correlate with latency target, not with CPU noise.

Make it stick: codify, canary, and measure

Playbooks die if they live in someone’s head. Bake them into your repo and your release process.

  1. Codify runbooks: Each playbook gets a README.md, dashboard JSON, alert rules, and example configs in /runbooks/<area>.
  2. GitOps the knobs: Apply infra changes via ArgoCD; track diffs, canary safely, and roll back in seconds.
  3. SLOs and error budgets: Tie each service to p95/p99 and throughput targets. Stop changes that burn the budget.
  4. Regression tests: Add a k6 scenario to CI that asserts latency under a baseline load; fail fast.
  5. Post-incident snapshots: Save flamegraphs, EXPLAIN ANALYZE, HPA events alongside the incident report.
  6. Clean up vibe code: If AI-generated “optimizations” slipped in (I’ve seen some wild WHERE to_char(date) gems), schedule a vibe code cleanup and AI code refactoring pass. Set lint rules that forbid pathological patterns.

Example CI guard using vegeta:

vegeta attack -duration=60s -rate=200 -targets=targets.txt | vegeta report -type=json \
  | jq -e '.latencies.p95 <= 300000000'  # 300ms

Results worth bragging about: at a fintech client, the DB hot query playbook cut checkout p99 from 1.8s to 420ms in two days; at a marketplace, swapping HPA’s CPU target for p95 latency cut MTTR for brownouts by ~50% because scaling reacted to what users felt.

What to standardize right now

  • Dashboards: one per service with RED, one shared infra with USE.
  • Alerts: p99 latency, DB lock waits, thread queue depth, Redis evictions, CPU throttling.
  • Runbooks: the five playbooks above, in Git, referenced in alerts.
  • Tooling: OpenTelemetry tracing everywhere, profilers reachable in prod, k6 scripts in the repo.
  • Rollout: canary by default with Istio and ArgoCD. Canary bake time tied to traffic volume.

If you want a fast start, we’ve templated all of this at GitPlumbers. We install the dashboards, alerts, and runbooks, run a joint incident drill, and leave you with a system the team actually uses instead of a slide deck. When the next p99 spike hits, you’ll know which playbook to open and which knob to turn.

Related Resources

Key takeaways

  • Codify bottleneck fixes as playbooks with steps, metrics, and rollback points.
  • Instrument first: RED/USE + tracing. Guessing is slower than profiling.
  • Treat the database, thread pools, caches, GC, and K8s throttling as first-class failure domains.
  • Use canaries and p95/p99 SLOs to validate improvements; watch saturation, not just averages.
  • Make the fix durable: dashboards, alerts, runbooks, and guardrail configs in Git (GitOps).

Implementation checklist

  • Define SLOs (p95, p99, error budget) and baseline dashboards.
  • Enable tracing and profiling in prod-like traffic paths.
  • Pick the playbook matching the symptom (DB, threads, cache, GC, K8s).
  • Run load tests (k6/vegeta/wrk) to reproduce and measure.
  • Apply the fix behind a feature flag or config gate; canary with ArgoCD/Istio.
  • Validate metrics vs SLOs; capture before/after flamegraphs.
  • Document runbook, create alerts, and bake in regression tests.

Questions we hear from teams

What’s the fastest way to tell if it’s the DB or the app?
Check RED: if latency spikes align with `db_cpu` and `active_connections`, and traces show long DB spans, it’s the DB. Otherwise, capture a flamegraph; if most time is in wait/lock or syscalls, look at thread pools and external calls.
Should we scale up hardware first?
Only as a stopgap. Without fixing the bottleneck, you’ll pay more for the same problem at a higher RPS. Apply the matching playbook, then right-size infra.
How do we keep AI-generated code from regressing performance?
Add performance linters and tests in CI. Flag dangerous SQL patterns, enforce query timeouts, and run a light load test gate. Schedule periodic vibe code cleanup and code rescue reviews.
We use a vendor APM. Do we still need OpenTelemetry?
Yes. OTel gives you portable context propagation and avoids lock-in. Most APMs ingest OTel now; you can switch vendors without ripping out instrumentation.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Install the performance playbooks with GitPlumbers See how we cut p99 in production at a fintech

Related resources