The Performance Playbooks That Actually Move the Needle: Tail Latency, N+1 DB, Cache Storms, and Backpressure
Stop guessing. Standardize how your team hunts and kills the bottlenecks that blow up SLOs and burn cash.
Performance isn’t an art project. It’s a checklist and a stopwatch.Back to all posts
Stop guesswork: standardize performance triage
I’ve watched teams spend weeks “optimizing” the wrong thing because they never agreed on what good looks like or how to measure it. The fix: standardize the playbook format and instrument enough to see reality.
- SLOs: Set p95/p99 targets per endpoint and per key background job. Define error budgets.
- Methods: Use the
REDmethod for services (Rate, Errors, Duration) andUSEfor resources (Utilization, Saturation, Errors). - Telemetry: OpenTelemetry traces + Prometheus metrics + logs you can actually join. Jaeger/Tempo for traces, Grafana for dashboards.
Quick baseline queries (Prometheus):
# Request rate and tail latency (histogram)
sum(rate(http_server_requests_seconds_count{job="api"}[5m])) by (route)
histogram_quantile(0.99, sum(rate(http_server_requests_seconds_bucket{job="api"}[5m])) by (le, route))
# Saturation: CPU run queue and GC pause (JVM)
node_load1 / count without(cpu)(node_cpu_seconds_total{mode="system", job="node"})
sum(rate(jvm_gc_pause_seconds_sum{job="api"}[5m])) by (service)Rule zero: no optimization without a failing SLO and a flame graph or trace that shows where the time goes.
Below are the playbooks we actually use at GitPlumbers when the alarms go off.
Playbook: API tail latency (p99 bursts over SLO)
Symptoms: p99 spikes above target during traffic ramps or specific tenants. Usually a chatty downstream, synchronous I/O, or missing timeouts.
- Reproduce and scope
- Run
k6orwrkagainst the exact endpoint and tenant payloads. Warm caches. - Capture traces: ensure
http.server.durationspans withtraceparentpropagation into downstreams. - Checkpoints:
- p99 reproduced within 10% of production.
- Trace count > 300 samples across the spike window.
- Run
k6 run --vus 50 --duration 5m scripts/api-smoke.js
wrk -t8 -c128 -d120s https://api.example.com/v1/ordersTrace the slow path
- In Jaeger, sort by duration; inspect critical path for serial downstream calls, missing indexes, or a retry storm.
- Look for external spans > 100ms and any retry loops.
- Checkpoints:
- Identify top 2 contributors to span duration.
Enforce timeouts and circuit breakers
- Set per-route timeouts, max inflight requests, and outlier detection. Prefer to fail fast and degrade gracefully.
# Istio DestinationRule with circuit breaking
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: payments-dr
spec:
host: payments.svc.cluster.local
trafficPolicy:
outlierDetection:
consecutive5xx: 5
interval: 5s
baseEjectionTime: 30s
connectionPool:
http:
http1MaxPendingRequests: 100
maxRequestsPerConnection: 100
idleTimeout: 5s
tls:
mode: ISTIO_MUTUALDe-chat and cache
- Collapse sequential downstream calls; parallelize safe calls; memoize stable lookups for 30–120s.
- Add
ETagor per-tenant cache keys; apply TTL jitter to avoid thundering herds.
OS and runtime sanity
- Ensure
somaxconnand accept queue aren’t capping throughput; thread pools sized to cores.
- Ensure
sysctl -w net.core.somaxconn=4096
sysctl -w net.ipv4.ip_local_port_range="10240 65535"Acceptance
- p99 <= SLO for 30 minutes at production QPS (+20% headroom)
- Error rate flat; no new 5xx; downstream doesn’t exceed its SLO
- Saturation (CPU run queue, thread pool queue) < 0.8
Playbook: N+1 and slow SQL (Postgres/MySQL)
Symptoms: p95 queries > 50–100ms, DB CPU spikes, lock waits. Common in ORMs (Rails, Django, Hibernate) under load.
- Turn on visibility
-- Postgres: enable pg_stat_statements
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
ALTER SYSTEM SET pg_stat_statements.max = 10000;
ALTER SYSTEM SET pg_stat_statements.track = 'all';
SELECT pg_reload_conf();- Run
EXPLAIN (ANALYZE, BUFFERS)for the top 5 queries inpg_stat_statements. - For MySQL, use
pt-query-digestandSHOW PROCESSLIST.
Kill N+1
- Replace per-row lookups with
JOIN + WHERE IN (...)or batch fetch. - In ORMs, use
includes/eager_load(Rails),select_related/prefetch_related(Django).
- Replace per-row lookups with
Index the access path, not fantasies
-- Typical fix: composite index on filter + order by
CREATE INDEX CONCURRENTLY idx_orders_acct_created ON orders (account_id, created_at DESC);- Avoid over-indexing (write amplification). Verify with
EXPLAINthat your index is used.
- Pooling and timeouts
- Right-size connection pools: 4–8x CPU cores for pgbouncer (transaction pooling), 2–4x per app.
- Set statement timeouts to protect the DB.
ALTER DATABASE appdb SET statement_timeout = '2s';Acceptance
- p95 query latency < 25ms for hot paths; lock wait time near zero
- DB CPU < 70%, disk IOPS stable; slow query log quiet
- App RPS unchanged or higher, error rate flat
Playbook: Cache miss storms and hot keys (Redis/Memcached)
Symptoms: sudden traffic to origin, Redis CPU single-thread pegged, latency spikes, or a single key dominates CPU.
- Inspect health
redis-cli INFO stats | egrep 'hit|evicted'
redis-cli --latency
redis-cli --latency-history- Monitor hit ratio, evictions,
instantaneous_ops_per_sec, and top keys (useredis-cellor keyspace sampling; avoidMONITORin prod unless you’re drowning).
- Stop stampedes
- Apply request coalescing.
// Go: singleflight to coalesce cache-miss fetches
var g singleflight.Group
val, err, _ := g.Do(key, func() (interface{}, error) {
return fetchFromOrigin(ctx, key)
})- Add TTL jitter (±20%) to spread expirations.
Hot key mitigation
- Shard or namespace hot keys per tenant; if needed, add a small local in-process LRU (Caffeine/Ristretto).
Sensible Redis config
# redis.conf
maxmemory 8gb
maxmemory-policy allkeys-lru
lazyfree-lazy-eviction yesAcceptance
- Cache hit ratio > 0.9 on read-heavy paths
- No single key exceeds 5% of CPU for >5 minutes
- Origin traffic increase < 10% during rotations/deploys
Playbook: Thread pool and socket exhaustion (app + kernel)
Symptoms: rising connection time, SYN backlog drops, 502/503 under low CPU. Seen this at unicorns and mom-and-pop shops alike.
- Verify saturation
ss -s
cat /proc/net/netstat | egrep 'ListenOverflows|ListenDrops'
ulimit -n- Watch app metrics: thread pool active vs max, queue length, and accept backlog.
- OS kernel tune-up
# Reasonable starters; persist via /etc/sysctl.d
sysctl -w net.core.somaxconn=4096
sysctl -w net.core.netdev_max_backlog=8192
sysctl -w net.ipv4.tcp_max_syn_backlog=8192
sysctl -w net.ipv4.tcp_fin_timeout=15
sysctl -w fs.file-max=2000000- Right-size application pools
# Spring Boot (Tomcat)
server:
tomcat:
max-threads: 200
accept-count: 200
connection-timeout: 2s
# Node: run multiple workers (or use PM2)
# pm2 start app.js -i max- Set upstream/downstream timeouts; don’t let queues accumulate indefinitely.
Acceptance
- Zero
ListenOverflows/ListenDropsfor 24h - Connection establishment p95 < 20ms; app thread queue length ~0 under steady state
- No 502/503 at 1.2x target RPS
Playbook: Queue backpressure and consumer lag (Kafka/RabbitMQ/SQS)
Symptoms: growing lag, old messages, DLQ filling, “eventual consistency” turning into “tomorrow.”
- Measure the right things
- Kafka: consumer lag per partition, records/sec, rebalances, max batch time
- SQS:
ApproximateAgeOfOldestMessage, inflight count - RabbitMQ:
queue_messages_ready,prefetch
# Kafka lag snapshot
kafka-consumer-groups --bootstrap-server broker:9092 \
--describe --group billing-consumer- Scale by lag, not CPU
# KEDA: scale consumers on Kafka lag
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: billing-consumer
spec:
scaleTargetRef:
name: billing-consumer-deploy
triggers:
- type: kafka
metadata:
bootstrapServers: broker:9092
consumerGroup: billing-consumer
topic: invoices
lagThreshold: "5000"- Increase batch and throughput safely
- Tune
max.poll.records(Kafka), prefetch (RabbitMQ), and batch DB writes with idempotency keys. - Ensure consumers are idempotent; use an outbox or dedupe table.
- Shed and prioritize
- Drop or degrade non-critical work during incidents; move heavy transforms to a side pipeline.
Acceptance
- Lag drains to steady state (< 1 min at P50, < 5 min at P95) after a 10x burst
- DLQ rate near zero; rebalances < 1/hour per consumer group
- Producer p99 unaffected, no broker throttling
Playbook: JVM GC pauses (and general memory pressure)
Symptoms: p99 spikes with GC pause bursts, allocation rate spikes, or OOMKills in k8s.
- Observe
# Total GC pause time per 5m
sum(rate(jvm_gc_pause_seconds_sum[5m])) by (service)
# Allocation rate
sum(rate(jvm_memory_pool_bytes_used{pool="PS Eden Space"}[1m])) by (service)- JVM tuning
# Start here; verify with load tests
-XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Xms2g -Xmx2g -XX:+AlwaysPreTouch- On JDK 17+, try ZGC for latency-sensitive services.
- Avoid tiny containers: give GC room; set
-Xms = -Xmx.
- Reduce churn in code
- Reuse buffers; pre-size collections; avoid
Stringconcatenation in hot loops. - In Go, preallocate slices (
make([]T, 0, n)), reuse byte buffers (sync.Pool).
Acceptance
- GC pause p99 < 100ms (or your SLO) across peak hour
- Allocation rate stable; no OOMKills; CPU not materially higher post-change
Make it stick: a reusable playbook template
Playbooks rot unless they’re easy to find, run, and update. Bake them into your repo (/runbooks/perf/) and your on-call rotation. Here’s a template we use at GitPlumbers:
# Playbook: <Problem>
## Trigger
- Alert: <name>
- SLO breached: <definition>
## Metrics to watch
- <list of Prometheus/CloudWatch metrics>
## Tools
- <e.g., Jaeger, async-profiler, k6>
## Steps
1. <action>
2. <action>
## Checkpoints
- <observable criteria after each step>
## Rollback
- <how to revert safely>
## Owner
- <team/person>
## Post-fix tasks
- <tests, infra as code, dashboards>When we rolled these into a fintech’s GitOps flow (ArgoCD + progressive canaries), they cut p99 on their money-mover endpoint by 48% in two weeks and decommissioned 20% of compute without touching business logic. No magic—just disciplined, repeatable playbooks.
If you want a second set of eyes, we’ve done this a hundred times. GitPlumbers will pair with your leads, not parachute in with a slide deck. See our Performance Engineering Services and a case study where we cut p99 in half.
Key takeaways
- Codify performance fixes as playbooks with clear triggers, steps, and acceptance criteria.
- Measure percentiles, saturation, and error budgets—never averages.
- Instrument first: OpenTelemetry traces + Prometheus metrics + a flame profiler on hot paths.
- Solve tail latency by tracing, setting timeouts/circuit breakers, and removing chatty dependencies.
- Kill N+1/slow SQL with pg_stat_statements, EXPLAIN ANALYZE, and minimal indexes.
- Stop cache stampedes using request coalescing and TTL jitter; track hot keys and hit rate.
- Treat queue lag as a first-class SLO; scale by lag, batch, and ensure idempotency.
- Tame thread/socket exhaustion with OS sysctls and right-sized pools; respect backpressure.
Implementation checklist
- Define service SLOs (p95/p99) with budgets per quarter.
- Ensure baseline telemetry: Prometheus, Grafana, OpenTelemetry, Jaeger.
- Enable DB and cache visibility: pg_stat_statements, Redis INFO/latency.
- Adopt a standard playbook template with triggers, steps, rollback, and owners.
- Automate canaries and rollbacks (ArgoCD/Spinnaker + progressive delivery).
- Set alerts on saturation signals (CPU steal, run queue, thread pool queue length, queue lag).
- Review playbook outcomes in post-incident reviews; update playbooks monthly.
Questions we hear from teams
- How do I prioritize which playbook to run first?
- Start with breached SLOs and the biggest error-budget burners. Use tracing to confirm the actual critical path, then pick the playbook that addresses the slowest span (API tail latency vs DB vs cache vs queue).
- What if we don’t have tracing yet?
- Add OpenTelemetry SDKs and run an agent/collector sidecar. You can start with sampling at 1–5% to keep costs down. Without traces, rely on high-cardinality logs and service mesh metrics, but expect slower iterations.
- How do we prove the optimizations worked?
- Define acceptance checks upfront: p99 target, error rate, and saturation thresholds. Run a controlled load test (k6/wrk), compare pre/post dashboards, and attach snapshots to the PR. Keep a ‘perf-changes.md’ ledger in the repo.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
