How do we choose which playbook to start with?

Start where users feel pain and where you control the levers. If LCP/INP is red, run the SPA + Edge playbook. If API p95 spikes under load, run Microservices + Mesh. If event age is high, run Kafka. Don’t mix everything—pick one, set an SLO, measure, and iterate.

What if our stack is part monolith, part microservices?

That’s normal. Apply playbooks per boundary: monolith behind CDN gets caching/DB fixes; microservices behind mesh get retry/circuit/autoscaling. Keep SLOs per user journey so improvements roll up to business metrics.

How do we prevent regressions after we fix p95?

Codify configs (Terraform + ArgoCD), add canary gates, and pin dashboards to PRs. Use error budgets to halt rollouts when p95 or error rate regresses. Bake load tests (k6) into CI for critical journeys.

Can these help with AI-generated code that’s already in production?

Yes. We run a code rescue: instrument hot paths, add SLOs, and apply the matching playbooks. We’ve cleaned up vibe-coded services by adding pooling, circuit breakers, and streaming—then refactoring safely behind feature flags.

Performance-optimization · Nov 14, 2025 · 10 minute read

The Six Playbooks I Reuse to Cut p95 in Half: Monoliths, Meshes, Kafka, Serverless, SPAs, and AI Inference

Stop firefighting performance on Slack at 2 a.m. Build reusable, measurable playbooks that move user-facing metrics and revenue.

Alex Duarte

Partner, Performance & Reliability at GitPlumbers

20+ years tuning systems from Rails monoliths to Istio meshes and GPU inference. Ex-SRE lead at a Fortune 100 retailer, helped multiple unicorns cut p95 in half without rewrites.

Performance wins are boring only if you measure the right things and can roll them out safely—boring is what prints money.

Back to all posts

Why playbooks beat heroics

I’ve been paged for the same latency incident at three different companies: p95 jumped, checkout conversions dipped, marketing yelled, and someone proposed “maybe add more pods.” I’ve seen that fail. Throwing CPU at a broken path is how you burn cash and trust.

What works is having playbooks scoped to your architecture and tied to user-facing metrics. Not a wiki page with vague advice—actual configs, commands, and rollout steps. At GitPlumbers, we ship these as repo-ready modules with SLOs, dashboards, and test scripts. The outcome is boring on purpose: repeatable p95 wins and fewer 2 a.m. guesses.

Measure what users feel. If it doesn’t move p95/p99, Core Web Vitals, or checkout completion rate, it’s noise.

User-facing KPIs we anchor on:

Web: LCP, INP, TTFB, and conversion rate
APIs: p95/p99 latency, tail error rate, and cost/request
Data/async: consumer lag, end-to-end event age
AI: time-to-first-token, tokens/sec, and answer quality (hallucination rate)

Playbook 1: Monolith + CDN + Relational DB

When you’ve got a Rails/Django/Node monolith behind Nginx and Postgres, the fastest wins are boring:

Cache the right stuff at the edge
Kill N+1s and tune DB connections
Push static and long-tail semi-static content off the app

Concrete steps:

Edge caching with revalidation
- Set Cache-Control with stale-while-revalidate and ETags. I’ve cut p95 TTFB from 380ms → 120ms on product pages this way.

location ~* \.(js|css|png|jpg|svg)$ {
  add_header Cache-Control "public, max-age=31536000, immutable";
}
location /catalog/ {
  proxy_cache catalogs;
  add_header Cache-Control "public, max-age=60, stale-while-revalidate=120";
  add_header ETag $upstream_http_etag;
}

Kill N+1s
- Use query logs and APM flamegraphs (pg_stat_statements, New Relic, Datadog). Prioritize the top 3 offenders. One client’s p95 went 1.2s → 480ms after preloading associations.
Connection pooling with PgBouncer
- You’ll drown Postgres with 1:1 app connections. Target ~`(cores*2)` active on Postgres; pool the rest.

# pgbouncer.ini
[databases]
app = host=postgres.local dbname=app

[pgbouncer]
listen_port = 6432
pool_mode = transaction
max_client_conn = 2000
default_pool_size = 50

Indexes that pay rent
- Use EXPLAIN (ANALYZE, BUFFERS) to prove value. Create composite indexes that match query predicates.

CREATE INDEX CONCURRENTLY idx_orders_user_status_created
  ON orders (user_id, status, created_at DESC);

Expected outcomes we’ve repeated:

Catalog page LCP improved 35–55%
API p95 down 40–60%
DB CPU down 25–35% from pooling and fewer full scans

Playbook 2: Microservices + API Gateway/Service Mesh (Kubernetes)

I’ve seen teams ship retries everywhere and accidentally DDoS their own services. Meshes like Istio give you superpowers—and footguns. The playbook:

Set retry budgets and circuit breakers
Autoscale on the metric that matters
Use canaries to de-risk changes

Configs:

Retry/circuit breaker (Istio DestinationRule)

apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: checkout-svc
spec:
  host: checkout
  trafficPolicy:
    connectionPool:
      http:
        http1MaxPendingRequests: 100
        maxRequestsPerConnection: 100
    outlierDetection:
      consecutive5xxErrors: 5
      interval: 5s
      baseEjectionTime: 30s
      maxEjectionPercent: 50
    loadBalancer:
      simple: LEAST_CONN

Retry policy (Envoy/Istio VirtualService) with a budget (no infinite thrash):

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: checkout-vs
spec:
  hosts: ["checkout"]
  http:
    - route:
        - destination: { host: checkout, subset: v1 }
      retries:
        attempts: 2
        perTryTimeout: 300ms
        retryOn: 5xx,connect-failure,reset

Autoscaling on RPS or queue depth, not CPU

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: checkout-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: checkout
  minReplicas: 3
  maxReplicas: 30
  metrics:
    - type: Pods
      pods:
        metric:
          name: requests_per_second
        target:
          type: AverageValue
          averageValue: "25"

Canary with progressive traffic

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: checkout-canary
spec:
  hosts: ["checkout"]
  http:
    - route:
        - destination: { host: checkout, subset: v1, weight: 90 }
        - destination: { host: checkout, subset: v2, weight: 10 }

Observed wins:

Checkout API p95: 900ms → 420ms by right-sizing retries and opening circuits early
MTTR: down 30–50% with circuit breaking and better autoscaling signals
Cloud bill: down 15–25% by scaling on RPS instead of CPU spikes

Playbook 3: Kafka/Event-Driven Pipelines

Throughput without backpressure = incident. The usual failures: single hot partition, tiny batches, and consumer GC pauses.

What works:

Partitioning and keys
- Align keys with access patterns. For truly hot keys, use a sharded key (userId#shard) to avoid one-partition hotspots.
Batching + Compression
- Producers: linger.ms 5–20ms, batch.size 64–128KB, compression.type zstd.

# producer.properties
acks=all
linger.ms=10
batch.size=131072
compression.type=zstd

Consumer parallelism
- One consumer per partition. Use async processing inside the consumer only if you preserve ordering when required.
Idempotency + exactly-once semantics (EOS) where it matters
- Turn on producer idempotence; for stateful sinks, coordinate with a transactional outbox.
Lag-based autoscaling in Kubernetes

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: orders-consumer-hpa
spec:
  minReplicas: 2
  maxReplicas: 50
  metrics:
    - type: External
      external:
        metric:
          name: kafka_consumer_group_lag
          selector:
            matchLabels:
              group: orders-cg
        target:
          type: AverageValue
          averageValue: "500"

JVM GC sanity (for Java consumers)
- G1GC, cap heap to avoid long pauses; instrument with jvm_gc_pause_seconds_bucket in Prometheus.

Repeated outcomes:

Event age p95: 1.8s → 300ms
Consumer CPU: down 20–35% via batching + compression
Fewer replays thanks to idempotent producers and outbox patterns

Playbook 4: Serverless APIs (AWS Lambda, Cloud Run)

Most “serverless is slow” incidents are cold starts, VPC egress, or unbounded concurrency taking out a shared dependency (hello, RDS). Fixes that stick:

Provisioned concurrency / min instances
- Keep 1–3 warm per AZ. Use schedules to ramp before traffic.

aws lambda put-provisioned-concurrency-config \
  --function-name checkout \
  --qualifier prod \
  --provisioned-concurrent-executions 6

Bundle smart, not big
- Tree-shake, native deps by layer, lazy-init SDKs. I’ve seen p95 cold starts drop from 1.2s → 200ms on Node by trimming 40MB of deps.
VPC trade-offs
- Avoid VPC unless needed; if required, use NAT with keep-alives and RDS Proxy.
Concurrency guards
- Cap concurrency to protect downstreams; add a queue (SQS, Pub/Sub) if bursts are part of life.

Expected outcomes:

API p95: 600–900ms → 200–350ms
Error rate: down 30–60% during spikes with concurrency caps + queues
Cost/request: down 15–25% after bundle and cache tweaks

Playbook 5: SPA + Edge (Next.js, Nginx, CloudFront/Fastly)

Users feel jank before they read your release notes. This playbook aims at Core Web Vitals and conversion.

Push render work to build time
- getStaticProps for stable content; ISR for near-real-time.

// next.config.js
module.exports = {
  experimental: { instrumentationHook: true },
  images: { formats: ['image/avif', 'image/webp'] },
  compress: true,
};

Preconnect + early hints
- Use 103 Early Hints for fonts/APIs. Many CDNs support it now.
Code split + delay hydration
- Ship less JS. Measure INP and defer non-critical hydration.
Edge caching with SWR

Cache-Control: public, max-age=60, stale-while-revalidate=300

Image optimization
- AVIF/WebP, responsive sizes; LCP often dominated by hero images.

Outcomes we’ve banked across e-comm and SaaS:

LCP: 3.0s → 1.7s (mobile) in two sprints
INP: 280ms → 140ms, measurable uplift in funnel completion (+2–4%)

Playbook 6: AI Inference Paths (LLM/RAG)

AI latency isn’t magic; it’s queueing + token throughput. The usual failure: maxing GPU memory with oversized models, no batching, and no streaming to the user.

Right-size the model + quantize
- If your use case tolerates it, 8/4-bit quant with vLLM often halves cost while keeping quality.
Batching and KV cache

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --dtype auto --max-num-seqs 64 --gpu-memory-utilization 0.9 \
  --enable-prefix-caching --tensor-parallel-size 1

Stream tokens to the UI
- Users perceive speed with time-to-first-token; wire SSE and flush ASAP.
Canary new models with guardrails for cost and hallucination rate
- Route 5–10% traffic; track answer quality and deflection to human.
Cache embeddings and RAG chunks
- Redis/Valkey with ttl; dedupe queries; warm caches on deploy.

Observed outcomes in production help desks and sales assistants:

Time-to-first-token: >1.2s → 250–400ms with streaming and KV cache
Tokens/sec: up 1.5–2.2× via batching
Cost/session: down 30–45% with smaller/quantized models and cache hits

Guardrails, rollout, and proving ROI

Playbooks aren’t done until they’re safe by default and measurable end-to-end.

SLOs and error budgets
- Define per-journey SLOs (e.g., Checkout API p95 < 400ms, 99.9% of the time). Tie deployment gates to budgets.
Load test like reality (no synthetic fantasies)

// k6 smoke that mimics your mix
import http from 'k6/http';
import { sleep } from 'k6';
export let options = { stages: [ { duration: '2m', target: 200 } ] };
export default function () {
  http.get(__ENV.BASE_URL + '/catalog');
  sleep(1);
}

Observe everything
- Prometheus + Grafana + OpenTelemetry traces; correlate p95 drops with conversion lifts.
GitOps or it didn’t happen
- Terraform infra + ArgoCD app configs. No kubectl-in-prod heroics.
Chaos Engineering (safely)
- Kill pods, inject latency, verify circuits and budgets. Do it in staging first.
Business proof
- For each playbook run, capture: before/after p95, CWV deltas, error rate, infra cost, and revenue proxy (conversion, churn, support tickets). That’s what closes the loop with your CFO.

I’ve watched teams clean up AI-generated code that “worked on my laptop” but cratered p99 in prod. Bake these fixes into code, not Slack threads: feature flags, infra as code, dashboards, and runbooks. GitPlumbers calls this a code rescue, and yes, it includes vibe code cleanup and AI code refactoring when the genie has already written half your service.

Related Resources

Key takeaways

Performance wins stick when you codify them as playbooks tied to user-facing KPIs.
Each architecture has a small set of high-ROI levers—cache headers for monoliths, retry/circuit policies for meshes, batching for Kafka, concurrency for serverless, CWV for SPAs, and token streaming for AI.
Measure improvements in p95/p99, Core Web Vitals (LCP/INP/TTFB), and cost-to-serve—not just synthetic benchmarks.
Roll out with canaries, protect with SLO/error budgets, and validate with load tests that mimic production mixes.
Bake configs into GitOps (ArgoCD + Terraform) so improvements don’t decay into tribal knowledge.

Implementation checklist

Define SLOs for each user journey before touching configs.
Instrument baseline with Prometheus/OpenTelemetry and real-user monitoring (RUM).
Pick the playbook that matches your architecture; do not randomize optimizations.
Roll out with canary + feature flags and watch error budgets.
Document before/after metrics and codify as infra/app code (GitOps).

Questions we hear from teams

How do we choose which playbook to start with?: Start where users feel pain and where you control the levers. If LCP/INP is red, run the SPA + Edge playbook. If API p95 spikes under load, run Microservices + Mesh. If event age is high, run Kafka. Don’t mix everything—pick one, set an SLO, measure, and iterate.
What if our stack is part monolith, part microservices?: That’s normal. Apply playbooks per boundary: monolith behind CDN gets caching/DB fixes; microservices behind mesh get retry/circuit/autoscaling. Keep SLOs per user journey so improvements roll up to business metrics.
How do we prevent regressions after we fix p95?: Codify configs (Terraform + ArgoCD), add canary gates, and pin dashboards to PRs. Use error budgets to halt rollouts when p95 or error rate regresses. Bake load tests (k6) into CI for critical journeys.
Can these help with AI-generated code that’s already in production?: Yes. We run a code rescue: instrument hot paths, add SLOs, and apply the matching playbooks. We’ve cleaned up vibe-coded services by adding pooling, circuit breakers, and streaming—then refactoring safely behind feature flags.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to GitPlumbers about a performance playbook for your stack See how we cut p95 in half without a rewrite