The Performance Playbooks That Actually Move the Needle: Tail Latency, N+1 DB, Cache Storms, and Backpressure

Stop guessing. Standardize how your team hunts and kills the bottlenecks that blow up SLOs and burn cash.

Performance isn’t an art project. It’s a checklist and a stopwatch.
Back to all posts

Stop guesswork: standardize performance triage

I’ve watched teams spend weeks “optimizing” the wrong thing because they never agreed on what good looks like or how to measure it. The fix: standardize the playbook format and instrument enough to see reality.

  • SLOs: Set p95/p99 targets per endpoint and per key background job. Define error budgets.
  • Methods: Use the RED method for services (Rate, Errors, Duration) and USE for resources (Utilization, Saturation, Errors).
  • Telemetry: OpenTelemetry traces + Prometheus metrics + logs you can actually join. Jaeger/Tempo for traces, Grafana for dashboards.

Quick baseline queries (Prometheus):

# Request rate and tail latency (histogram)
sum(rate(http_server_requests_seconds_count{job="api"}[5m])) by (route)
histogram_quantile(0.99, sum(rate(http_server_requests_seconds_bucket{job="api"}[5m])) by (le, route))

# Saturation: CPU run queue and GC pause (JVM)
node_load1 / count without(cpu)(node_cpu_seconds_total{mode="system", job="node"})
sum(rate(jvm_gc_pause_seconds_sum{job="api"}[5m])) by (service)

Rule zero: no optimization without a failing SLO and a flame graph or trace that shows where the time goes.

Below are the playbooks we actually use at GitPlumbers when the alarms go off.

Playbook: API tail latency (p99 bursts over SLO)

Symptoms: p99 spikes above target during traffic ramps or specific tenants. Usually a chatty downstream, synchronous I/O, or missing timeouts.

  1. Reproduce and scope
    • Run k6 or wrk against the exact endpoint and tenant payloads. Warm caches.
    • Capture traces: ensure http.server.duration spans with traceparent propagation into downstreams.
    • Checkpoints:
      • p99 reproduced within 10% of production.
      • Trace count > 300 samples across the spike window.
k6 run --vus 50 --duration 5m scripts/api-smoke.js
wrk -t8 -c128 -d120s https://api.example.com/v1/orders
  1. Trace the slow path

    • In Jaeger, sort by duration; inspect critical path for serial downstream calls, missing indexes, or a retry storm.
    • Look for external spans > 100ms and any retry loops.
    • Checkpoints:
      • Identify top 2 contributors to span duration.
  2. Enforce timeouts and circuit breakers

    • Set per-route timeouts, max inflight requests, and outlier detection. Prefer to fail fast and degrade gracefully.
# Istio DestinationRule with circuit breaking
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: payments-dr
spec:
  host: payments.svc.cluster.local
  trafficPolicy:
    outlierDetection:
      consecutive5xx: 5
      interval: 5s
      baseEjectionTime: 30s
    connectionPool:
      http:
        http1MaxPendingRequests: 100
        maxRequestsPerConnection: 100
        idleTimeout: 5s
    tls:
      mode: ISTIO_MUTUAL
  1. De-chat and cache

    • Collapse sequential downstream calls; parallelize safe calls; memoize stable lookups for 30–120s.
    • Add ETag or per-tenant cache keys; apply TTL jitter to avoid thundering herds.
  2. OS and runtime sanity

    • Ensure somaxconn and accept queue aren’t capping throughput; thread pools sized to cores.
sysctl -w net.core.somaxconn=4096
sysctl -w net.ipv4.ip_local_port_range="10240 65535"

Acceptance

  • p99 <= SLO for 30 minutes at production QPS (+20% headroom)
  • Error rate flat; no new 5xx; downstream doesn’t exceed its SLO
  • Saturation (CPU run queue, thread pool queue) < 0.8

Playbook: N+1 and slow SQL (Postgres/MySQL)

Symptoms: p95 queries > 50–100ms, DB CPU spikes, lock waits. Common in ORMs (Rails, Django, Hibernate) under load.

  1. Turn on visibility
-- Postgres: enable pg_stat_statements
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
ALTER SYSTEM SET pg_stat_statements.max = 10000;
ALTER SYSTEM SET pg_stat_statements.track = 'all';
SELECT pg_reload_conf();
  • Run EXPLAIN (ANALYZE, BUFFERS) for the top 5 queries in pg_stat_statements.
  • For MySQL, use pt-query-digest and SHOW PROCESSLIST.
  1. Kill N+1

    • Replace per-row lookups with JOIN + WHERE IN (...) or batch fetch.
    • In ORMs, use includes/eager_load (Rails), select_related/prefetch_related (Django).
  2. Index the access path, not fantasies

-- Typical fix: composite index on filter + order by
CREATE INDEX CONCURRENTLY idx_orders_acct_created ON orders (account_id, created_at DESC);
  • Avoid over-indexing (write amplification). Verify with EXPLAIN that your index is used.
  1. Pooling and timeouts
    • Right-size connection pools: 4–8x CPU cores for pgbouncer (transaction pooling), 2–4x per app.
    • Set statement timeouts to protect the DB.
ALTER DATABASE appdb SET statement_timeout = '2s';

Acceptance

  • p95 query latency < 25ms for hot paths; lock wait time near zero
  • DB CPU < 70%, disk IOPS stable; slow query log quiet
  • App RPS unchanged or higher, error rate flat

Playbook: Cache miss storms and hot keys (Redis/Memcached)

Symptoms: sudden traffic to origin, Redis CPU single-thread pegged, latency spikes, or a single key dominates CPU.

  1. Inspect health
redis-cli INFO stats | egrep 'hit|evicted'
redis-cli --latency
redis-cli --latency-history
  • Monitor hit ratio, evictions, instantaneous_ops_per_sec, and top keys (use redis-cell or keyspace sampling; avoid MONITOR in prod unless you’re drowning).
  1. Stop stampedes
    • Apply request coalescing.
// Go: singleflight to coalesce cache-miss fetches
var g singleflight.Group
val, err, _ := g.Do(key, func() (interface{}, error) {
    return fetchFromOrigin(ctx, key)
})
  • Add TTL jitter (±20%) to spread expirations.
  1. Hot key mitigation

    • Shard or namespace hot keys per tenant; if needed, add a small local in-process LRU (Caffeine/Ristretto).
  2. Sensible Redis config

# redis.conf
maxmemory 8gb
maxmemory-policy allkeys-lru
lazyfree-lazy-eviction yes

Acceptance

  • Cache hit ratio > 0.9 on read-heavy paths
  • No single key exceeds 5% of CPU for >5 minutes
  • Origin traffic increase < 10% during rotations/deploys

Playbook: Thread pool and socket exhaustion (app + kernel)

Symptoms: rising connection time, SYN backlog drops, 502/503 under low CPU. Seen this at unicorns and mom-and-pop shops alike.

  1. Verify saturation
ss -s
cat /proc/net/netstat | egrep 'ListenOverflows|ListenDrops'
ulimit -n
  • Watch app metrics: thread pool active vs max, queue length, and accept backlog.
  1. OS kernel tune-up
# Reasonable starters; persist via /etc/sysctl.d
sysctl -w net.core.somaxconn=4096
sysctl -w net.core.netdev_max_backlog=8192
sysctl -w net.ipv4.tcp_max_syn_backlog=8192
sysctl -w net.ipv4.tcp_fin_timeout=15
sysctl -w fs.file-max=2000000
  1. Right-size application pools
# Spring Boot (Tomcat)
server:
  tomcat:
    max-threads: 200
    accept-count: 200
  connection-timeout: 2s

# Node: run multiple workers (or use PM2)
# pm2 start app.js -i max
  • Set upstream/downstream timeouts; don’t let queues accumulate indefinitely.

Acceptance

  • Zero ListenOverflows/ListenDrops for 24h
  • Connection establishment p95 < 20ms; app thread queue length ~0 under steady state
  • No 502/503 at 1.2x target RPS

Playbook: Queue backpressure and consumer lag (Kafka/RabbitMQ/SQS)

Symptoms: growing lag, old messages, DLQ filling, “eventual consistency” turning into “tomorrow.”

  1. Measure the right things
  • Kafka: consumer lag per partition, records/sec, rebalances, max batch time
  • SQS: ApproximateAgeOfOldestMessage, inflight count
  • RabbitMQ: queue_messages_ready, prefetch
# Kafka lag snapshot
kafka-consumer-groups --bootstrap-server broker:9092 \
  --describe --group billing-consumer
  1. Scale by lag, not CPU
# KEDA: scale consumers on Kafka lag
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: billing-consumer
spec:
  scaleTargetRef:
    name: billing-consumer-deploy
  triggers:
  - type: kafka
    metadata:
      bootstrapServers: broker:9092
      consumerGroup: billing-consumer
      topic: invoices
      lagThreshold: "5000"
  1. Increase batch and throughput safely
  • Tune max.poll.records (Kafka), prefetch (RabbitMQ), and batch DB writes with idempotency keys.
  • Ensure consumers are idempotent; use an outbox or dedupe table.
  1. Shed and prioritize
  • Drop or degrade non-critical work during incidents; move heavy transforms to a side pipeline.

Acceptance

  • Lag drains to steady state (< 1 min at P50, < 5 min at P95) after a 10x burst
  • DLQ rate near zero; rebalances < 1/hour per consumer group
  • Producer p99 unaffected, no broker throttling

Playbook: JVM GC pauses (and general memory pressure)

Symptoms: p99 spikes with GC pause bursts, allocation rate spikes, or OOMKills in k8s.

  1. Observe
# Total GC pause time per 5m
sum(rate(jvm_gc_pause_seconds_sum[5m])) by (service)
# Allocation rate
sum(rate(jvm_memory_pool_bytes_used{pool="PS Eden Space"}[1m])) by (service)
  1. JVM tuning
# Start here; verify with load tests
-XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Xms2g -Xmx2g -XX:+AlwaysPreTouch
  • On JDK 17+, try ZGC for latency-sensitive services.
  • Avoid tiny containers: give GC room; set -Xms = -Xmx.
  1. Reduce churn in code
  • Reuse buffers; pre-size collections; avoid String concatenation in hot loops.
  • In Go, preallocate slices (make([]T, 0, n)), reuse byte buffers (sync.Pool).

Acceptance

  • GC pause p99 < 100ms (or your SLO) across peak hour
  • Allocation rate stable; no OOMKills; CPU not materially higher post-change

Make it stick: a reusable playbook template

Playbooks rot unless they’re easy to find, run, and update. Bake them into your repo (/runbooks/perf/) and your on-call rotation. Here’s a template we use at GitPlumbers:

# Playbook: <Problem>

## Trigger
- Alert: <name>
- SLO breached: <definition>

## Metrics to watch
- <list of Prometheus/CloudWatch metrics>

## Tools
- <e.g., Jaeger, async-profiler, k6>

## Steps
1. <action>
2. <action>

## Checkpoints
- <observable criteria after each step>

## Rollback
- <how to revert safely>

## Owner
- <team/person>

## Post-fix tasks
- <tests, infra as code, dashboards>

When we rolled these into a fintech’s GitOps flow (ArgoCD + progressive canaries), they cut p99 on their money-mover endpoint by 48% in two weeks and decommissioned 20% of compute without touching business logic. No magic—just disciplined, repeatable playbooks.

If you want a second set of eyes, we’ve done this a hundred times. GitPlumbers will pair with your leads, not parachute in with a slide deck. See our Performance Engineering Services and a case study where we cut p99 in half.

Related Resources

Key takeaways

  • Codify performance fixes as playbooks with clear triggers, steps, and acceptance criteria.
  • Measure percentiles, saturation, and error budgets—never averages.
  • Instrument first: OpenTelemetry traces + Prometheus metrics + a flame profiler on hot paths.
  • Solve tail latency by tracing, setting timeouts/circuit breakers, and removing chatty dependencies.
  • Kill N+1/slow SQL with pg_stat_statements, EXPLAIN ANALYZE, and minimal indexes.
  • Stop cache stampedes using request coalescing and TTL jitter; track hot keys and hit rate.
  • Treat queue lag as a first-class SLO; scale by lag, batch, and ensure idempotency.
  • Tame thread/socket exhaustion with OS sysctls and right-sized pools; respect backpressure.

Implementation checklist

  • Define service SLOs (p95/p99) with budgets per quarter.
  • Ensure baseline telemetry: Prometheus, Grafana, OpenTelemetry, Jaeger.
  • Enable DB and cache visibility: pg_stat_statements, Redis INFO/latency.
  • Adopt a standard playbook template with triggers, steps, rollback, and owners.
  • Automate canaries and rollbacks (ArgoCD/Spinnaker + progressive delivery).
  • Set alerts on saturation signals (CPU steal, run queue, thread pool queue length, queue lag).
  • Review playbook outcomes in post-incident reviews; update playbooks monthly.

Questions we hear from teams

How do I prioritize which playbook to run first?
Start with breached SLOs and the biggest error-budget burners. Use tracing to confirm the actual critical path, then pick the playbook that addresses the slowest span (API tail latency vs DB vs cache vs queue).
What if we don’t have tracing yet?
Add OpenTelemetry SDKs and run an agent/collector sidecar. You can start with sampling at 1–5% to keep costs down. Without traces, rely on high-cardinality logs and service mesh metrics, but expect slower iterations.
How do we prove the optimizations worked?
Define acceptance checks upfront: p99 target, error rate, and saturation thresholds. Run a controlled load test (k6/wrk), compare pre/post dashboards, and attach snapshots to the PR. Keep a ‘perf-changes.md’ ledger in the repo.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Get a performance playbook started See how we cut p99 in half

Related resources