The Six Playbooks I Reuse to Cut p95 in Half: Monoliths, Meshes, Kafka, Serverless, SPAs, and AI Inference
Stop firefighting performance on Slack at 2 a.m. Build reusable, measurable playbooks that move user-facing metrics and revenue.
Performance wins are boring only if you measure the right things and can roll them out safely—boring is what prints money.Back to all posts
Why playbooks beat heroics
I’ve been paged for the same latency incident at three different companies: p95 jumped, checkout conversions dipped, marketing yelled, and someone proposed “maybe add more pods.” I’ve seen that fail. Throwing CPU at a broken path is how you burn cash and trust.
What works is having playbooks scoped to your architecture and tied to user-facing metrics. Not a wiki page with vague advice—actual configs, commands, and rollout steps. At GitPlumbers, we ship these as repo-ready modules with SLOs, dashboards, and test scripts. The outcome is boring on purpose: repeatable p95 wins and fewer 2 a.m. guesses.
Measure what users feel. If it doesn’t move p95/p99, Core Web Vitals, or checkout completion rate, it’s noise.
User-facing KPIs we anchor on:
- Web: LCP, INP, TTFB, and conversion rate
- APIs: p95/p99 latency, tail error rate, and cost/request
- Data/async: consumer lag, end-to-end event age
- AI: time-to-first-token, tokens/sec, and answer quality (hallucination rate)
Playbook 1: Monolith + CDN + Relational DB
When you’ve got a Rails/Django/Node monolith behind Nginx and Postgres, the fastest wins are boring:
- Cache the right stuff at the edge
- Kill N+1s and tune DB connections
- Push static and long-tail semi-static content off the app
Concrete steps:
- Edge caching with revalidation
- Set
Cache-Controlwithstale-while-revalidateand ETags. I’ve cut p95 TTFB from 380ms → 120ms on product pages this way.
- Set
location ~* \.(js|css|png|jpg|svg)$ {
add_header Cache-Control "public, max-age=31536000, immutable";
}
location /catalog/ {
proxy_cache catalogs;
add_header Cache-Control "public, max-age=60, stale-while-revalidate=120";
add_header ETag $upstream_http_etag;
}Kill N+1s
- Use query logs and APM flamegraphs (
pg_stat_statements, New Relic, Datadog). Prioritize the top 3 offenders. One client’s p95 went 1.2s → 480ms after preloading associations.
- Use query logs and APM flamegraphs (
Connection pooling with PgBouncer
- You’ll drown Postgres with 1:1 app connections. Target ~`(cores*2)` active on Postgres; pool the rest.
# pgbouncer.ini
[databases]
app = host=postgres.local dbname=app
[pgbouncer]
listen_port = 6432
pool_mode = transaction
max_client_conn = 2000
default_pool_size = 50- Indexes that pay rent
- Use
EXPLAIN (ANALYZE, BUFFERS)to prove value. Create composite indexes that match query predicates.
- Use
CREATE INDEX CONCURRENTLY idx_orders_user_status_created
ON orders (user_id, status, created_at DESC);Expected outcomes we’ve repeated:
- Catalog page LCP improved 35–55%
- API p95 down 40–60%
- DB CPU down 25–35% from pooling and fewer full scans
Playbook 2: Microservices + API Gateway/Service Mesh (Kubernetes)
I’ve seen teams ship retries everywhere and accidentally DDoS their own services. Meshes like Istio give you superpowers—and footguns. The playbook:
- Set retry budgets and circuit breakers
- Autoscale on the metric that matters
- Use canaries to de-risk changes
Configs:
- Retry/circuit breaker (Istio DestinationRule)
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: checkout-svc
spec:
host: checkout
trafficPolicy:
connectionPool:
http:
http1MaxPendingRequests: 100
maxRequestsPerConnection: 100
outlierDetection:
consecutive5xxErrors: 5
interval: 5s
baseEjectionTime: 30s
maxEjectionPercent: 50
loadBalancer:
simple: LEAST_CONN- Retry policy (Envoy/Istio VirtualService) with a budget (no infinite thrash):
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: checkout-vs
spec:
hosts: ["checkout"]
http:
- route:
- destination: { host: checkout, subset: v1 }
retries:
attempts: 2
perTryTimeout: 300ms
retryOn: 5xx,connect-failure,reset- Autoscaling on RPS or queue depth, not CPU
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: checkout-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: checkout
minReplicas: 3
maxReplicas: 30
metrics:
- type: Pods
pods:
metric:
name: requests_per_second
target:
type: AverageValue
averageValue: "25"- Canary with progressive traffic
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: checkout-canary
spec:
hosts: ["checkout"]
http:
- route:
- destination: { host: checkout, subset: v1, weight: 90 }
- destination: { host: checkout, subset: v2, weight: 10 }Observed wins:
- Checkout API p95: 900ms → 420ms by right-sizing retries and opening circuits early
- MTTR: down 30–50% with circuit breaking and better autoscaling signals
- Cloud bill: down 15–25% by scaling on RPS instead of CPU spikes
Playbook 3: Kafka/Event-Driven Pipelines
Throughput without backpressure = incident. The usual failures: single hot partition, tiny batches, and consumer GC pauses.
What works:
Partitioning and keys
- Align keys with access patterns. For truly hot keys, use a sharded key (
userId#shard) to avoid one-partition hotspots.
- Align keys with access patterns. For truly hot keys, use a sharded key (
Batching + Compression
- Producers:
linger.ms5–20ms,batch.size64–128KB,compression.typezstd.
- Producers:
# producer.properties
acks=all
linger.ms=10
batch.size=131072
compression.type=zstdConsumer parallelism
- One consumer per partition. Use async processing inside the consumer only if you preserve ordering when required.
Idempotency + exactly-once semantics (EOS) where it matters
- Turn on producer idempotence; for stateful sinks, coordinate with a transactional outbox.
Lag-based autoscaling in Kubernetes
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: orders-consumer-hpa
spec:
minReplicas: 2
maxReplicas: 50
metrics:
- type: External
external:
metric:
name: kafka_consumer_group_lag
selector:
matchLabels:
group: orders-cg
target:
type: AverageValue
averageValue: "500"- JVM GC sanity (for Java consumers)
- G1GC, cap heap to avoid long pauses; instrument with
jvm_gc_pause_seconds_bucketin Prometheus.
- G1GC, cap heap to avoid long pauses; instrument with
Repeated outcomes:
- Event age p95: 1.8s → 300ms
- Consumer CPU: down 20–35% via batching + compression
- Fewer replays thanks to idempotent producers and outbox patterns
Playbook 4: Serverless APIs (AWS Lambda, Cloud Run)
Most “serverless is slow” incidents are cold starts, VPC egress, or unbounded concurrency taking out a shared dependency (hello, RDS). Fixes that stick:
- Provisioned concurrency / min instances
- Keep 1–3 warm per AZ. Use schedules to ramp before traffic.
aws lambda put-provisioned-concurrency-config \
--function-name checkout \
--qualifier prod \
--provisioned-concurrent-executions 6Bundle smart, not big
- Tree-shake, native deps by layer, lazy-init SDKs. I’ve seen p95 cold starts drop from 1.2s → 200ms on Node by trimming 40MB of deps.
VPC trade-offs
- Avoid VPC unless needed; if required, use NAT with keep-alives and RDS Proxy.
Concurrency guards
- Cap concurrency to protect downstreams; add a queue (SQS, Pub/Sub) if bursts are part of life.
Expected outcomes:
- API p95: 600–900ms → 200–350ms
- Error rate: down 30–60% during spikes with concurrency caps + queues
- Cost/request: down 15–25% after bundle and cache tweaks
Playbook 5: SPA + Edge (Next.js, Nginx, CloudFront/Fastly)
Users feel jank before they read your release notes. This playbook aims at Core Web Vitals and conversion.
- Push render work to build time
getStaticPropsfor stable content;ISRfor near-real-time.
// next.config.js
module.exports = {
experimental: { instrumentationHook: true },
images: { formats: ['image/avif', 'image/webp'] },
compress: true,
};Preconnect + early hints
- Use
103 Early Hintsfor fonts/APIs. Many CDNs support it now.
- Use
Code split + delay hydration
- Ship less JS. Measure INP and defer non-critical hydration.
Edge caching with SWR
Cache-Control: public, max-age=60, stale-while-revalidate=300- Image optimization
- AVIF/WebP, responsive sizes; LCP often dominated by hero images.
Outcomes we’ve banked across e-comm and SaaS:
- LCP: 3.0s → 1.7s (mobile) in two sprints
- INP: 280ms → 140ms, measurable uplift in funnel completion (+2–4%)
Playbook 6: AI Inference Paths (LLM/RAG)
AI latency isn’t magic; it’s queueing + token throughput. The usual failure: maxing GPU memory with oversized models, no batching, and no streaming to the user.
Right-size the model + quantize
- If your use case tolerates it, 8/4-bit quant with
vLLMoften halves cost while keeping quality.
- If your use case tolerates it, 8/4-bit quant with
Batching and KV cache
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--dtype auto --max-num-seqs 64 --gpu-memory-utilization 0.9 \
--enable-prefix-caching --tensor-parallel-size 1Stream tokens to the UI
- Users perceive speed with time-to-first-token; wire SSE and flush ASAP.
Canary new models with guardrails for cost and hallucination rate
- Route 5–10% traffic; track answer quality and deflection to human.
Cache embeddings and RAG chunks
- Redis/Valkey with ttl; dedupe queries; warm caches on deploy.
Observed outcomes in production help desks and sales assistants:
- Time-to-first-token: >1.2s → 250–400ms with streaming and KV cache
- Tokens/sec: up 1.5–2.2× via batching
- Cost/session: down 30–45% with smaller/quantized models and cache hits
Guardrails, rollout, and proving ROI
Playbooks aren’t done until they’re safe by default and measurable end-to-end.
SLOs and error budgets
- Define per-journey SLOs (e.g., Checkout API p95 < 400ms, 99.9% of the time). Tie deployment gates to budgets.
Load test like reality (no synthetic fantasies)
// k6 smoke that mimics your mix
import http from 'k6/http';
import { sleep } from 'k6';
export let options = { stages: [ { duration: '2m', target: 200 } ] };
export default function () {
http.get(__ENV.BASE_URL + '/catalog');
sleep(1);
}Observe everything
- Prometheus + Grafana + OpenTelemetry traces; correlate p95 drops with conversion lifts.
GitOps or it didn’t happen
- Terraform infra + ArgoCD app configs. No kubectl-in-prod heroics.
Chaos Engineering (safely)
- Kill pods, inject latency, verify circuits and budgets. Do it in staging first.
Business proof
- For each playbook run, capture: before/after p95, CWV deltas, error rate, infra cost, and revenue proxy (conversion, churn, support tickets). That’s what closes the loop with your CFO.
I’ve watched teams clean up AI-generated code that “worked on my laptop” but cratered p99 in prod. Bake these fixes into code, not Slack threads: feature flags, infra as code, dashboards, and runbooks. GitPlumbers calls this a code rescue, and yes, it includes vibe code cleanup and AI code refactoring when the genie has already written half your service.
Key takeaways
- Performance wins stick when you codify them as playbooks tied to user-facing KPIs.
- Each architecture has a small set of high-ROI levers—cache headers for monoliths, retry/circuit policies for meshes, batching for Kafka, concurrency for serverless, CWV for SPAs, and token streaming for AI.
- Measure improvements in p95/p99, Core Web Vitals (LCP/INP/TTFB), and cost-to-serve—not just synthetic benchmarks.
- Roll out with canaries, protect with SLO/error budgets, and validate with load tests that mimic production mixes.
- Bake configs into GitOps (ArgoCD + Terraform) so improvements don’t decay into tribal knowledge.
Implementation checklist
- Define SLOs for each user journey before touching configs.
- Instrument baseline with Prometheus/OpenTelemetry and real-user monitoring (RUM).
- Pick the playbook that matches your architecture; do not randomize optimizations.
- Roll out with canary + feature flags and watch error budgets.
- Document before/after metrics and codify as infra/app code (GitOps).
Questions we hear from teams
- How do we choose which playbook to start with?
- Start where users feel pain and where you control the levers. If LCP/INP is red, run the SPA + Edge playbook. If API p95 spikes under load, run Microservices + Mesh. If event age is high, run Kafka. Don’t mix everything—pick one, set an SLO, measure, and iterate.
- What if our stack is part monolith, part microservices?
- That’s normal. Apply playbooks per boundary: monolith behind CDN gets caching/DB fixes; microservices behind mesh get retry/circuit/autoscaling. Keep SLOs per user journey so improvements roll up to business metrics.
- How do we prevent regressions after we fix p95?
- Codify configs (Terraform + ArgoCD), add canary gates, and pin dashboards to PRs. Use error budgets to halt rollouts when p95 or error rate regresses. Bake load tests (k6) into CI for critical journeys.
- Can these help with AI-generated code that’s already in production?
- Yes. We run a code rescue: instrument hot paths, add SLOs, and apply the matching playbooks. We’ve cleaned up vibe-coded services by adding pooling, circuit breakers, and streaming—then refactoring safely behind feature flags.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
