How do I pick SLO thresholds without over-optimizing?

Start from user journeys and business outcomes. If users tolerate 1s on search but only 500ms on checkout, encode that. Use historical p95/p99 plus error budget math to set realistic targets. Revisit quarterly as traffic and features change.

Is profiling in production safe?

Yes, with care. Go pprof CPU at 60s sampling and JFR/async-profiler are low overhead when used briefly. Keep endpoints locked down (localhost/mtls), timebox profiles, and store artifacts off-box.

Why p99 and not average?

Averages hide pain. Users experience tail latency. p95 reveals systemic issues; p99 exposes outliers and contention that torpedo conversion. Track all three, alert on p99 for critical paths.

When should I cache vs. scale?

Cache if the data is read-heavy and tolerates staleness for a short TTL; scale if data is highly dynamic and correctness is critical. Often it’s both: microcache at the edge plus origin capacity and backpressure.

What about microservices and cross-service latency?

Invest in tracing. Add per-hop budgets and circuit breakers (Istio/Envoy). Use bulkheads so one slow dependency doesn’t cascade. A good playbook includes dependency maps and fallback behavior.

Guides · Nov 11, 2025 · 10 minute read

The Bottleneck Playbooks I Reach For When Prod Starts Smoking

Stop guessing. Ship a small set of rehearsed performance plays you can run at 3 a.m.—with checkpoints, metrics, and tools that actually work.

Alex Kim

Principal Engineer, GitPlumbers

20 years in the trenches from the LAMP era to cloud-native and AI. Led SRE and platform teams at a unicorn you’ve definitely paid, survived two replatforms, and still has opinions about p99.

Performance isn’t a project; it’s a set of rehearsed plays you can run at 3 a.m.

Back to all posts

The 3 a.m. page is not the time to invent your strategy

I’ve watched smart teams spend an hour debating a theory while p99 latency blew through the SLO and the error budget melted. The teams that win don’t guess—they run a small set of proven plays. This guide gives you those plays: what to measure, which buttons to push, and how to prove you fixed it.

Audience: senior engineers and leads who’ve been burned by hand-wavy advice.
Goal: ship a handful of performance playbooks you can reuse across services.
Scope: API latency spikes, database hotspots, cache stampedes, GC pauses, and Kubernetes resource pain.

Performance isn’t a project; it’s a set of rehearsed plays you can run under pressure.

Start with SLOs and a Baseline You Trust

Before you touch a knob, make performance observable and tied to user impact.

Define SLOs per critical path.
- Latency: p50/p95/p99 on key endpoints (e.g., checkout p99 < 800ms).
- Availability: success rate (5xx and timeouts). Use an error budget to decide when to pause feature work.
- Capacity: target RPS/QPS and saturation (CPU, memory, I/O, queue lag).
Instrument the RED/USE core.
- RED: Requests, Errors, Duration. USE: Utilization, Saturation, Errors.
- Tools: Prometheus + Grafana, OpenTelemetry traces to Jaeger or Tempo.
Add production-safe profiling paths now.
- Go: enable net/http/pprof.
- JVM: JFR, async-profiler.
- Node: clinic flame or 0x.
Create an alert that links to a runbook.

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: api-latency
spec:
  groups:
  - name: api-slo
    rules:
    - alert: APIHighLatencyP99
      expr: histogram_quantile(0.99, sum by (le, service) (rate(http_request_duration_seconds_bucket{service="payments"}[5m]))) > 0.800
      for: 10m
      labels:
        severity: page
      annotations:
        summary: "payments p99 > 800ms for 10m"
        runbook_url: https://git.company.com/runbooks/api-latency.md

Checkpoint: You can answer “What’s the p99 for checkout right now? What’s saturated?” within 60 seconds.

A Reusable Playbook Template You Can Drop in Any Repo

Codify the choreography. Here’s a minimal YAML you can put in runbooks/perf/api-latency.yaml and render in Backstage/Docs.

name: API latency spike (p99)
trigger:
  alert: APIHighLatencyP99
  threshold: p99 > 800ms for 10m
dashboards:
  - grafana: /d/service-latency/payments
  - traces: /jaeger/search?service=payments
steps:
  - check: "Capacity"
    run:
      - kubectl top pods -n prod -l app=payments
      - promql: rate(container_cpu_cfs_throttled_seconds_total{pod=~"payments-.*"}[5m])
    checkpoint: "CPU throttle < 10%"
  - check: "Hot path"
    run:
      - open flamegraph: /profiling/payments
      - inspect: "upstream calls > 1 per request?"
    checkpoint: "No >20% self time in JSON parse"
mitigations:
  - "Temporarily raise CPU limit to 2 cores"
  - "Enable gzip offload in NGINX"
rollback:
  - "Scale back limits and create ticket PERF-123"
owner: "payments-oncall"

Keep each playbook:

Tied to a specific SLO and alert.
Stepwise with checkpoints you can prove.
Containing both a quick mitigation and a follow-up fix.
With a named owner and rollback.

Playbook 1: API Latency Spike (CPU-bound, lock contention, or I/O)

Symptoms: p99 erupts, RPS steady or increasing, error rate may rise, pods not OOMing.

Steps:

Validate the alert and scope.
- Grafana: compare p99 vs RPS, error rate, and saturation.
- Check container_cpu_cfs_throttled_seconds_total and pod restarts.
Trace the hot path.
- Open recent OpenTelemetry traces for the slow endpoint.
- Look for N+1 calls, external dependency waits, or serialization hotspots.
Profile before tuning.
- Go:

import (
  _ "net/http/pprof"
  "net/http"
  "log"
)
func main() {
  go func() { log.Println(http.ListenAndServe("localhost:6060", nil)) }()
  // app
}

go tool pprof -http=:0 http://localhost:6060/debug/pprof/profile?seconds=60

JVM:

jcmd <pid> JFR.start name=profile settings=profile duration=120s filename=/tmp/app.jfr
./profiler.sh -d 60 -e cpu -f /tmp/cpu.html <pid>   # async-profiler

Node:

npm i -g clinic
clinic flame -- node server.js

Checkpoints to decide action.
- If CPU throttle ratio > 20% → increase resources.limits.cpu or remove limits temporarily.
- If time spent in JSON/XML decode > 30% → switch to jsoniter/serde/streaming parser or reduce payload.
- If outbound call dominates → introduce a circuit breaker + timeout; cache responses.
- If locks/mutex hotspots → shard or reduce critical section scope.
Mitigations you can apply fast.
- Raise CPU limit + HPA min replicas by 1–2x; add a canary before full rollout.
- Add request rate limits in the API gateway (e.g., Envoy/NGINX) to protect the backend.
- Enable gzip/brotli at the edge to shrink payloads.
Verify.
- Rerun synthetic load test (k6) and confirm p99 back under SLO.

k6 script you can keep in-repo:

import http from 'k6/http';
import { check, sleep } from 'k6';
export let options = { thresholds: { http_req_duration: ['p(99)<800'] }, vus: 50, duration: '5m' };
export default function () {
  const res = http.get(__ENV.TARGET + '/v1/payments?id=123');
  check(res, { 'status was 200': (r) => r.status === 200 });
  sleep(1);
}

Expected result: p99 returns under SLO and CPU throttle < 10% within 10–15 minutes.

Playbook 2: Database Hotspots (slow queries, locks, connection storms)

95% of “mysterious” latency is one slow query or a lock chain. I’ve seen teams scale Postgres to the moon when one missing index was the culprit.

Steps:

Identify the top offenders.
- Ensure pg_stat_statements is enabled.

SELECT query, calls, total_time, mean_time, rows
FROM pg_stat_statements
ORDER BY total_time DESC
LIMIT 10;

Enable auto_explain for nasty surprises in logs.

Explain, don’t guess.

EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT /* your slow query */

Check for seq scans on large tables, nested loops over big sets, or misestimates.

Check locks and waiting queries.

SELECT
  bl.pid AS blocked_pid,
  a.usename AS blocked_user,
  ka.query AS blocking_query,
  now() - ka.query_start AS blocking_duration
FROM pg_stat_activity a
JOIN pg_locks bl ON bl.pid = a.pid AND NOT bl.granted
JOIN pg_locks kl ON kl.transactionid = bl.transactionid AND kl.granted
JOIN pg_stat_activity ka ON ka.pid = kl.pid;

Checkpoints to decide action.
- If a query dominates total_time and lacks a selective index → add it; verify with EXPLAIN.
- If row estimates are off by >10x → ANALYZE, consider extended stats.
- If wait events show LWLock:buffer_content or IO → increase shared_buffers, tune work_mem, or reduce fan-out.
- If CPU is low but latency high with many connections → deploy pgBouncer in transaction mode; cap max_connections to sane values.
Mitigations you can apply fast.
- Create a covering index and a quick migration; verify on a canary.
- Add a read replica and route read-heavy endpoints via feature flag.
- Break long transactions; ensure application sets sane timeouts.

Checkpoint: pg_stat_statements.mean_time for the offender drops by >50% and DB CPU/IO returns to baseline.

Nice-to-have: add a Grafana panel for pg_stat_statements top-N and pg_locks wait trees.

Playbook 3: Cache Misses and Stampedes (Redis/HTTP/CDN)

If your cache hit ratio drops under 80% during peak, your origin is about to eat glass. I’ve watched a single naive invalidation melt a cluster.

Steps:

Measure the basics.
- Redis: INFO stats hit ratio, SLOWLOG, and latency doctor.

redis-cli info stats | egrep 'keyspace_hits|keyspace_misses'
redis-cli slowlog get 10

CDN/NGINX: add a header to see cache status.

add_header X-Cache $upstream_cache_status;

Introduce request coalescing to prevent dogpiles.
- Go example with singleflight:

var g singleflight.Group
v, err, _ := g.Do("user:123", func() (any, error) {
  return expensiveFetch(123)
})

Add edge or local caching with predictable keys and TTLs.
- NGINX microcache:

proxy_cache_path /var/cache/nginx levels=1:2 keys_zone=api_cache:10m inactive=10m max_size=1g;
server {
  location /v1/data {
    proxy_cache api_cache;
    proxy_cache_key $request_uri;
    proxy_cache_valid 200 1m;
    add_header X-Cache $upstream_cache_status;
    proxy_pass http://backend;
  }
}

Checkpoints to decide action.
- If hit ratio < 80% and TTLs are tiny → bump TTLs; add soft TTL + background refresh.
- If stampedes happen on invalidation → gate with a lock (SETNX) or singleflight.
- If Redis CPU > 80% with small objects → enable pipeline/batching and avoid large Lua locks in hot paths.
Verify.
- Cache hit ratio recovers >90%; origin RPS drops; p95 improves proportionally.

Bonus: align HTTP caching headers (Cache-Control, ETag) with CDN behavior; add a dashboard panel for X-Cache breakdown.

Playbook 4: GC Pauses and Memory Leaks (Go/JVM)

When p99 stalls but CPU looks fine, suspect GC or leaks. I’ve seen runaway allocs from innocent JSON marshaling grind services.

Steps:

Confirm GC involvement.
- Go: run with GC traces.

GODEBUG=gctrace=1 ./service

JVM: check pause times via JFR or GC logs.

Profile allocations.
- Go heap profile:

curl -s http://localhost:6060/debug/pprof/heap > /tmp/heap.pb.gz
go tool pprof -http=:0 /tmp/heap.pb.gz

JVM allocation/CPU:

jcmd <pid> JFR.start name=alloc settings=profile duration=120s filename=/tmp/app.jfr
./profiler.sh -d 60 -e alloc -f /tmp/alloc.html <pid>

Checkpoints to decide action.
- If short-lived allocations dominate (>70%) → preallocate buffers, reuse with sync.Pool (Go), avoid reflection-heavy codecs.
- If heap grows across requests → look for caches without bounds; add size/TTL caps.
- If GC pauses > 100ms for latency-critical endpoints → reduce object churn; for Go, tune GOGC; for JVM, evaluate G1/ZGC and right-size heap.
Mitigations you can apply fast.
- Reduce response payloads; compress at edge, not in app.
- Cap in-process caches; move to Redis with eviction.
- For Go, export GOGC=100 (or higher) to trade memory for fewer GCs temporarily.

Verify: GC pause p99 < 50ms, RSS stable, and endpoint p99 back under SLO within one deploy.

Playbook 5: Kubernetes Resource Thrash (limits, throttling, HPA)

More than once, the “CPU is only 40%” line fooled teams—because the kernel was throttling them to death. Watch the right metrics.

Steps:

Check for throttling and OOMs.

rate(container_cpu_cfs_throttled_seconds_total{container!="",pod!=""}[5m])
  / rate(container_cpu_cfs_periods_total{container!="",pod!=""}[5m]) > 0.2

Inspect kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} and restarts.

Compare requests/limits to actual usage.
- kubectl top pods -n prod and Grafana Container CPU/Memory dashboard.
Apply sane autoscaling.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: payments
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: payments
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "20"

Checkpoints to decide action.
- If throttling ratio > 10% → raise CPU limit or remove it; ensure requests reflect typical usage.
- If OOMs during spikes → raise memory requests; verify GC/leaks aren’t the root cause.
- If HPA flaps → add stabilization windows and scale on an SLI (RPS/latency) not just CPU.
Verify.
- Throttling < 5%, restarts flat, p95 stabilizes under SLO.

Note: Consider VPA in recommend mode to right-size over time; commit the recommendations via GitOps, not clicks.

Operationalize It: GitOps, Alerts, and Drills

This is where teams either make it stick or let it rot on Confluence.

Put everything in Git.
- Runbooks: runbooks/perf/*.yaml.
- Alerts: prometheusrules/*.yaml with runbook_url pointing to the repo.
- Dashboards: JSON exports in grafana/.
- Load tests: perf/k6/*.js with Makefile targets.
Wire it to your deploy flow.
- ArgoCD/Flux applies alerts/dashboards.
- CI runs smoke k6 after canary; blocks if SLO thresholds fail.
Run drills quarterly.
- Trigger the alert in staging with a controlled load; time the MTTR.
- Rotate on-call to run the playbook end-to-end.
Track business metrics.
- Tie SLOs to conversion, abandonment, and cost. E.g., shaving 200ms off checkout p99 lifted conversion 2.1% at one client—real money.

If you don’t rehearse, you won’t execute when it matters. This is SRE 101, but it’s shocking how many orgs skip it.

GitPlumbers note: we audit and harden these playbooks, then pressure-test them with your stack. If you want a second set of eyes, we’re here.

Related Resources

Key takeaways

Performance work should be a small set of rehearsed plays tied to SLOs and error budgets, not ad hoc heroics.
Each playbook needs triggers, reproducible diagnostics, checkpoints, mitigations, and a rollback path.
Profile before you tune: flame graphs and traces beat log-diving and hunches every time.
Focus on the usual suspects: API hot paths, database hotspots, cache stampedes, GC pauses, and K8s resource throttling.
Operationalize via GitOps: version runbooks, alerts, and dashboards; test them with synthetic load and chaos drills.

Implementation checklist

Define user-facing SLOs with p50/p95/p99 and an error budget.
Instrument RED/USE metrics and wire tracing with OpenTelemetry.
Create a runbook template with trigger, steps, checkpoints, mitigations, and rollback.
Author Prometheus alerts with runbook_url annotations for each play.
Stand up profiling (pprof/JFR/async-profiler) and ensure it’s safe to use in prod.
Create DB observability: pg_stat_statements, auto_explain, slow query logs.
Put a cache policy in writing: TTLs, keys, coalescing, and circuit breakers.
Right-size K8s requests/limits; add HPA (and VPA where safe); watch CPU throttling.
Store the playbooks, alerts, dashboards, and k6 tests in Git; run drills quarterly.

Questions we hear from teams

How do I pick SLO thresholds without over-optimizing?: Start from user journeys and business outcomes. If users tolerate 1s on search but only 500ms on checkout, encode that. Use historical p95/p99 plus error budget math to set realistic targets. Revisit quarterly as traffic and features change.
Is profiling in production safe?: Yes, with care. Go pprof CPU at 60s sampling and JFR/async-profiler are low overhead when used briefly. Keep endpoints locked down (localhost/mtls), timebox profiles, and store artifacts off-box.
Why p99 and not average?: Averages hide pain. Users experience tail latency. p95 reveals systemic issues; p99 exposes outliers and contention that torpedo conversion. Track all three, alert on p99 for critical paths.
When should I cache vs. scale?: Cache if the data is read-heavy and tolerates staleness for a short TTL; scale if data is highly dynamic and correctness is critical. Often it’s both: microcache at the edge plus origin capacity and backpressure.
What about microservices and cross-service latency?: Invest in tracing. Add per-hop budgets and circuit breakers (Istio/Envoy). Use bulkheads so one slow dependency doesn’t cascade. A good playbook includes dependency maps and fallback behavior.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Have us pressure-test your runbooks See how others cut p99 and MTTR

The 3 a.m. page is not the time to invent your strategy

Start with SLOs and a Baseline You Trust

A Reusable Playbook Template You Can Drop in Any Repo

Playbook 1: API Latency Spike (CPU-bound, lock contention, or I/O)

Playbook 2: Database Hotspots (slow queries, locks, connection storms)

Playbook 3: Cache Misses and Stampedes (Redis/HTTP/CDN)

Playbook 4: GC Pauses and Memory Leaks (Go/JVM)

Playbook 5: Kubernetes Resource Thrash (limits, throttling, HPA)

Operationalize It: GitOps, Alerts, and Drills

Related Resources

Key takeaways

Implementation checklist

Questions we hear from teams

Ready to modernize your codebase?

Related resources