Five Battle‑Tested Performance Playbooks: CPU Hot Paths, DB Latency, GC Pauses, I/O Stall, and Lock Contention

When prod melts, you don’t need platitudes. You need playbooks with hard checkpoints, precise metrics, and the right tools. This is what we actually run at 2 a.m.

You don’t fix performance at 2 a.m. with guesses. You fix it with playbooks, metrics, and a clean rollback.
Back to all posts

The reality: performance fires don’t care about your sprint plan

I’ve lost count of the launches where a marketing bump turned into a paging storm. Same story: dashboards missing, everyone guessing, someone proposes a rewrite, and we’re two hours into a war room with zero proof. The teams that survive don’t “move fast”; they run the right playbook fast.

This guide covers the five playbooks we actually use at GitPlumbers when prod melts: CPU hot paths, DB latency, GC pauses, I/O stall, and lock contention. Each has triggers, steps, checkpoints, and tooling. Copy them, adapt them, and stop improvising at 2 a.m.

If it’s not measured, it didn’t happen. If it’s not repeatable, it won’t stick.

1) Baseline and guardrails before touching a line of code

You can’t optimize what you can’t see. Get your baseline and rollback ready.

  • SLOs: Define availability and latency objectives. Example: p95 API latency <= 300ms, error rate < 1%.
  • Golden signals: latency, traffic, errors, saturation (RED/USE). Make them one-click in Grafana.
  • Traceability: OpenTelemetry traces to show where time goes; propagate trace_id in logs.
  • Load generator: k6, Vegeta, or wrk against a prod-like staging. Freeze feature flags during tests.
  • Release safety: Canary deploys with Argo Rollouts or Flagger; circuit breakers in Envoy/Istio.

Quick starters:

# k6 smoke to establish baseline
k6 run --vus 50 --duration 2m --summary-trend-stats="avg,p(90),p(95),p(99)" baseline.js

# Vegeta sustained RPS
echo "GET https://api.example.com/v1/search?q=test" | vegeta attack -rate=200 -duration=60s | vegeta report

PromQL you’ll need:

# p95 latency by route
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, route))

# error rate
sum(rate(http_requests_total{code=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

# saturation
avg(node_load1) / count(node_cpu_seconds_total{mode="idle"})

Checkpoint: You have baselines for p50/p95, error rate, CPU, memory, GC pause, DB qps/latency, and iowait. Canary + rollback works. If not, stop here and fix that first. I’ve seen teams shave 5ms off a handler and lose 50% traffic because their rollback failed. Don’t be that team.

2) Playbook: CPU hot paths (when cores peg and p95 climbs)

Trigger: CPU > 80% for 5+ minutes, runq > 1 per core, flame graphs show a single function dominating.

Steps:

  1. Triage: scale first to buy time.
    • Kubernetes: bump replicas or enable HPA.
    • Add a short-lived cache (Redis/in-memory) for the hot endpoint.
  2. Profile: capture real profiles under load.
    • pprof (Go), eBPF via parca-agent or pyroscope/ebpf for any language.
    • For Node: clinic flame --on-port 3000.
  3. Optimize: attack the top 1–2 stacks from the flame.
    • Remove needless JSON marshalling, precompute templates, vectorize loops, avoid regex backtracking.
  4. Verify: re-run load test; canary with a guardrail on p95_latency and CPU.

HPA example for a quick mitigation:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 4
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60

Go pprof grab:

# Enable net/http/pprof in your app, then:
curl -s http://api:6060/debug/pprof/profile?seconds=30 > cpu.prof
go tool pprof -http=:0 cpu.prof

Checkpoint: CPU < 70% at steady state; p95 latency back under SLO; flame top stack reduced by >50%. If not, consider algorithmic changes or precomputation (e.g., Bloom filters, memoization, denormalized reads).

3) Playbook: Database latency (slow queries, missing indexes, pool saturation)

Trigger: p95 DB latency > SLO, connection pool at 100%, cache hit rate < 90%, or application thread wait on DB > 30%.

Steps:

  1. Triage: reduce load and fix pool limits.
    • Increase pgBouncer pool, add request queue / rate limit at the edge.
    • Turn on read-through cache for hot keys with a tight TTL.
  2. Identify hot queries.
    • Postgres: pg_stat_statements for top total_time and mean_time.
    • EXPLAIN (ANALYZE, BUFFERS) on the worst offenders.
  3. Optimize: add missing indexes, rewrite N+1s, narrow SELECT lists, batch operations.
  4. Verify: look for index-only scans, cache warming, and pool headroom.

SQL helpers:

-- Top slow queries
SELECT query, calls, total_time, mean_time
FROM pg_stat_statements
ORDER BY total_time DESC
LIMIT 10;

-- Locks and waiters
SELECT relation::regclass, mode, granted, pid, query
FROM pg_locks JOIN pg_stat_activity USING (pid)
WHERE NOT granted;

-- Example index
CREATE INDEX CONCURRENTLY idx_orders_user_created
ON orders(user_id, created_at DESC);

-- Validate plan
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders WHERE user_id=$1 ORDER BY created_at DESC LIMIT 20;

Read-through cache snippet (Node + Redis):

async function getUserOrders(userId: string) {
  const key = `orders:${userId}`;
  const cached = await redis.get(key);
  if (cached) return JSON.parse(cached);
  const rows = await db.query('SELECT ...');
  await redis.setEx(key, 30, JSON.stringify(rows)); // 30s TTL
  return rows;
}

PromQL to watch DB saturation:

# Pool saturation
sum(db_pool_in_use) / sum(db_pool_size)

# Cache hit rate (Postgres)
(sum(rate(pg_stat_database_blks_hit[5m])) /
 sum(rate(pg_stat_database_blks_hit[5m]) + rate(pg_stat_database_blks_read[5m])))

Checkpoint: p95 DB latency within SLO; pool saturation < 80%; top query mean_time reduced > 40%; cache hit > 90%. If not, consider read replicas for GET-heavy endpoints and introduce CQRS or precomputed materialized views.

4) Playbook: GC pauses and heap pressure (JVM, Go, Node)

Trigger: GC pause p95 > 100ms (JVM/Node), heap growth without return, frequent young-gen promotions, or Go GODEBUG=gctrace=1 shows high STW %.

Steps:

  1. Triage: reduce allocation rate.
    • Increase batch sizes, reuse buffers, avoid per-request object creation.
  2. Capture GC telemetry.
    • JVM: enable GC logs and JFR; Go: pprof heap; Node: clinic heapprof.
  3. Tune collectors carefully; test under load.
  4. Fix the leaks; watch object retention in profiles.

JVM flags (G1GC sane start):

JAVA_TOOL_OPTIONS="-XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:+ParallelRefProcEnabled \
-XX:+UnlockExperimentalVMOptions -XX:+AlwaysPreTouch -XX:+UseStringDeduplication \
-XX:InitiatingHeapOccupancyPercent=30 -Xms4g -Xmx4g"

Go memory tuning:

# Lower GC aggressiveness to reduce CPU if heap is stable
export GOGC=150
# Heap profile under load
curl -s http://svc:6060/debug/pprof/heap > heap.prof
go tool pprof -http=:0 heap.prof

Node heap snapshot with Clinic:

npx clinic heapprof -- node server.js

Checkpoint: GC pause p95 under target (e.g., <100ms), stable heap after steady state, allocation rate down >30%. If still red, consider off-heap caches, pooling, or moving large object graphs out of request path.

5) Playbook: I/O stall and network pain (disks, TLS, proxies)

Trigger: High iowait, EBS credit depletion, upstream timeouts, or tail latency spikes at TLS handshakes.

Steps:

  1. Triage: raise timeouts and add retries with jitter; enable keepalives.
  2. Measure: node_disk_io_time_seconds_total, node_network_transmit_queue_length, proxy logs.
  3. Fix: right-size storage class, tune thread pools, optimize TLS, and set sensible proxy timeouts.

Envoy circuit breaker + timeout example:

cluster:
  name: upstream_api
  connect_timeout: 2s
  type: STRICT_DNS
  lb_policy: ROUND_ROBIN
  circuit_breakers:
    thresholds:
      - max_connections: 1024
        max_pending_requests: 2048
        max_retries: 3
  outlier_detection:
    consecutive_5xx: 5
    interval: 5s
    base_ejection_time: 30s
    max_ejection_percent: 50

Nginx keepalive and proxy tuning:

proxy_http_version 1.1;
proxy_set_header Connection "";
keepalive_requests 1000;
keepalive_timeout 65s;
proxy_connect_timeout 2s;
proxy_read_timeout 5s;
proxy_send_timeout 5s;

Storage checks on Linux:

iostat -xz 1
# watch r/s, w/s, await, svctm

# EBS volume throughput checks on AWS
aws cloudwatch get-metric-statistics --namespace AWS/EBS --metric-name VolumeThroughputPercentage ...

Checkpoint: iowait < 5% steady, proxy error rate < 0.5%, p99 handshake < 50ms with TLS session reuse. If not, move hot data to faster media (gp3/io2), shard, or move heavy uploads off sync paths (S3 pre-signed URLs).

6) Playbook: Lock contention and queue backpressure

Trigger: Thread dumps show blocked threads, Go mutex profile hot, DB pg_locks waits, or consumer lag rising with flat producer rate.

Steps:

  1. Triage: apply backpressure and circuit breakers; shed load gracefully.
  2. Measure: queue depth, consumer lag, lock wait time, mutex hold time.
  3. Fix: reduce lock scope, adopt RW locks, batch, or move to idempotent async patterns.

Go mutex profile:

// Enable mutex profiling
runtime.SetMutexProfileFraction(5)
// Then pprof: /debug/pprof/mutex

Postgres lock view during incident:

SELECT now(), locktype, relation::regclass, mode, granted, pid, query
FROM pg_locks JOIN pg_stat_activity USING (pid)
ORDER BY granted, relation;

Resiliency policy (Envoy + retry + retry budgets):

retry_policy:
  retry_on: connect-failure,reset,5xx
  num_retries: 2
  retry_back_off:
    base_interval: 50ms
    max_interval: 300ms

Checkpoint: queue depth stabilizes; consumer lag decreasing; lock wait times < 10ms p95; error budget burn normalized. If not, split critical sections, introduce partitioning/sharding keys, or move hot workflows to outbox + async workers.

7) Make it stick: dashboards, canaries, and runbooks

Performance fixes rot unless you bake them into operations.

  • Dashboards: one Grafana folder per service with RED + resource views, GC, DB, queue metrics.
  • SLOs & burn alerts: alert on burn_rate > 2 for 30m windows; page on p99 if it correlates with error budget burn.
  • GitOps: track HPA, proxy timeouts, and circuit breaker configs in Git; diff them like code.
  • Canary: use Argo Rollouts with metric analysis on p95 latency and error rate.
  • Drills: quarterly load drills; record MTTR, rollback time, and change failure rate.

Example Argo Rollouts metric gate:

analysis:
  templates:
  - name: latency-check
    prometheus:
      address: http://prometheus.monitoring:9090
      query: |
        histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{route="/v1/search"}[2m])) by (le))
      threshold: 0.3  # seconds
  startingStep: 2

Results we typically see when teams adopt these playbooks:

  • 30–70% reduction in p95 latency on hot endpoints within 2–4 weeks.
  • 20–50% infra cost reduction by right-sizing CPU/memory and storage classes.
  • MTTR dropping from hours to minutes because mitigations are scripted and safe.

Write the runbooks. Treat them like code. And run the drills. That’s the difference between a one-off save and a reliable system.

Related Resources

Key takeaways

  • Playbooks beat ad‑hoc heroics: define triggers, checkpoints, and rollback upfront.
  • Instrument first: if it’s not in Prometheus/Otel, it doesn’t exist for performance work.
  • Always separate mitigations (buy time) from optimizations (fix root cause).
  • Validate changes with load tests and canaries; never trust local benchmarks alone.
  • Codify wins into dashboards, SLOs, and runbooks; make optimizations durable.

Implementation checklist

  • Define SLOs and error budgets; agree on p50/p95/p99 targets.
  • Establish golden signals dashboards and RED/USE views.
  • Stand up a load generator (k6/Vegeta/wrk) and a safe staging environment.
  • Create canary + rollback pipeline (Argo Rollouts/Flagger).
  • For each playbook, document triggers, first actions, tooling, and success criteria.
  • Run drills quarterly; measure MTTR and change failure rate.

Questions we hear from teams

How do I profile safely in Kubernetes without changing the app?
Use eBPF-based profilers like Parca or Pyroscope’s eBPF agent. They sample kernel/user stacks without in-process agents. Scope via namespace and limit overhead by sampling at 1–10 Hz during incidents.
Should we scale first or optimize first?
Scale first to stop the bleeding (HPA, cache, rate limit). Then optimize with profilers and EXPLAIN ANALYZE. Always separate mitigations (reversible, fast) from optimizations (risky, lasting).
What if I don’t have a prod-like staging for load tests?
Carve out a shadow env that mirrors prod topology with 10–20% of prod capacity. Replay sampled traffic with tc mirroring or service mesh traffic split. If that’s impossible, use a small canary with strict abort thresholds.
How do I make sure a change actually improved things?
Define success criteria upfront (e.g., p95 < 300ms, CPU < 70%). Run the same workload before/after, compare Prometheus metrics, and only complete rollout if both latency and error rate improve. Record in a runbook PR.
Any quick JVM GC win without deep tuning?
Switch to G1GC if you’re on CMS/Parallel, set Xms=Xmx to avoid catastrophic resizing, cap pause target ~200ms, and fix allocation hotspots first. Most wins come from reducing churn, not GC flags.
Are feature flags safe during performance work?
Yes—if they’re typed and server-evaluated with consistent hashing. Pair with canary analysis so you can roll back instantly when a flag increases error budget burn.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Run a performance fire drill with GitPlumbers Download the performance playbook template (runbook YAML)

Related resources