The Performance Playbooks I Wish I’d Had: Pattern-by-Pattern, p95 Down, Revenue Up
Battle-tested optimization playbooks for the architectures you actually run—monoliths, microservices, queues, serverless, SPAs, and AI inference—focused on user-facing metrics and business impact.
Performance improvements don’t count until they show up in user metrics and revenue—everything else is LARPing as engineering.Back to all posts
Why playbooks beat hope-driven tuning
If you’ve ever stared at a Grafana board at 2 a.m. wondering why p95 blew up after a “small” release, you know the truth: ad‑hoc tuning is just wishful thinking. What actually works is having a playbook per architecture—a short list of moves you can execute quickly, with predictable outcomes, and measurable user impact.
We’ve used these playbooks at fintechs, marketplaces, and SaaS shops from 5 to 5,000 engineers. The patterns don’t change that much; the stakes do. Tie every move to user-facing metrics: p95/p99 latency, error rate, Core Web Vitals (LCP/INP), and success rates per critical journey (checkout, sign-in). Then tie those to business: conversion, retention, infra cost per transaction.
If you can’t show a graph where p95 goes down and revenue goes up, it didn’t happen.
Below are the playbooks we deploy at GitPlumbers when a system squeals. They’re short, specific, and oriented around money.
Monolith + RDBMS: kill the hot path first
Symptoms you’ve seen: slow TTFB on a few endpoints, DB CPU pegged, and a pile of AI-generated code that does three ORM calls where one SQL would do. I’ve seen teams throw read replicas at this and make it worse (more replicas ≠ fewer N+1s).
What moves the needle:
- Profile real user traffic first: enable
pg_stat_statements(PostgreSQL 14+) and sample queries by latency and frequency. - Fix the top 3 queries (hot path): add composite indexes, collapse ORM chatter into server-side joins, cache expensive lookups in
Redis 7with short TTLs. - Pool connections:
pgbouncerin transaction mode. Monoliths love to exhaust DB connections. - Separate read-heavy endpoints to replicas once queries are sane.
SQL and config you’ll actually use:
-- Find hot queries
SELECT query, mean_time, calls
FROM pg_stat_statements
ORDER BY mean_time * calls DESC
LIMIT 10;
-- Typical fix: composite index for sort+filter
CREATE INDEX CONCURRENTLY idx_orders_account_created
ON orders(account_id, created_at DESC);; pgbouncer.ini
[databases]
app = host=db-primary port=5432 dbname=app
[pgbouncer]
pool_mode = transaction
max_client_conn = 2000
default_pool_size = 50
reserve_pool_size = 20
server_reset_query = DISCARD ALL// Redis read-through cache with 30s TTL
const key = `acct:${accountId}:plan`;
let plan = await redis.get(key);
if (!plan) {
plan = await db.query('SELECT plan FROM accounts WHERE id=$1', [accountId]);
await redis.set(key, JSON.stringify(plan), { EX: 30 });
}Measured results we’ve seen:
- Checkout p95 TTFB from 1.8s → 450ms; error rate −60%; conversion +2.3% week over week.
- DB CPU −35% and 1 read replica instead of 3 (infra savings).
Sync microservices: break the fan‑out and tame retries
You know the smell: one gateway call fans out to 6 services, two of them fan out again. Add a little jitter, and your p95 becomes p-never. I’ve watched teams “scale” this by adding pods; the real fix is reducing synchronous depth and isolating failure.
Moves that work:
- Collapse chatty calls into a purpose-built aggregator service or GraphQL resolver with data loader batching.
- Set circuit breakers and bulkheads in the mesh (Istio 1.20+/Envoy 1.27) to stop thundering herds.
- Time budget per request: enforce 200–300ms per hop, total 1s at gateway. Everything else goes async via a queue.
- Cache idempotent reads at the edge (
CDNorEnvoyrate-limit/caching filter) for 30–120s.
Istio config we deploy on day one:
# DestinationRule with sane connection pool + outlier detection
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: svc-payments
spec:
host: svc-payments
trafficPolicy:
connectionPool:
http:
http1MaxPendingRequests: 100
maxRequestsPerConnection: 50
tcp:
maxConnections: 100
outlierDetection:
consecutive5xxErrors: 5
interval: 5s
baseEjectionTime: 30sAnd a gateway timeout budget:
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: api-gateway
spec:
hosts: ["api.example.com"]
http:
- timeout: 1s
retries:
attempts: 2
perTryTimeout: 300ms
route:
- destination: { host: svc-gateway }Add tracing across hops (OpenTelemetry), and you’ll see the fan-out instantly. We pair this with resilience4j in JVM services or Envoy circuit breakers for polyglot stacks.
Outcomes we’ve seen:
- API p95 from 920ms → 310ms; tail p99 from 2.8s → 780ms.
- Incident MTTR −40% because retries stopped melting dependencies.
Event-driven & queues: control backpressure, not just speed
Queues buy you resilience and cost efficiency, but I’ve watched teams scale consumers until Kafka begs for mercy while downstreams drown. The playbook is about backpressure and batching.
Moves that work:
- Define SLO as end-to-end latency (event accepted → side effect completed). Optimize for that, not messages/sec.
- Tune batch size and concurrency per topic based on payload and downstream capacity.
- Autoscale consumers on lag with
KEDA, not CPU. - Right-size visibility timeouts (SQS) or max.poll.interval (Kafka) to avoid poison-pill re-drives.
Configs we ship:
# KEDA autoscaling on Kafka lag
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: orders-consumer
spec:
scaleTargetRef:
name: orders-consumer
minReplicaCount: 2
maxReplicaCount: 30
triggers:
- type: kafka
metadata:
bootstrapServers: kafka:9092
topic: orders
consumerGroup: orders-consumer
lagThreshold: "1000"# Kafka consumer overrides
KAFKA_MAX_POLL_RECORDS=500
KAFKA_FETCH_MIN_BYTES=1048576 # 1MB to encourage batchingResults:
- End-to-end SLO 95th from 12m → 2m at steady state; cost −18% by running fewer, fatter consumers.
- Retry storms eliminated after fixing visibility timeouts and idempotency keys.
Serverless APIs: nuke cold starts and right-size timeouts
Serverless is great until p95 is just your runtime waking up. The fix is largely mechanical.
Moves that work:
- Turn on
provisioned concurrencyfor interactive paths (Lambda) or minimum instances (Cloud Functions/Cloud Run). - Bundle smarter: use
esbuildto tree-shake, avoid giantnode_modules, and preferarm64. - Keep connections warm: HTTP/2 keep-alive and RDS Proxy for databases.
Provisioned concurrency with Terraform:
resource "aws_lambda_function" "api" {
function_name = "api-handler"
image_uri = aws_ecr_image.api.image_uri
architectures = ["arm64"]
memory_size = 1024
timeout = 5
}
resource "aws_lambda_provisioned_concurrency_config" "api" {
function_name = aws_lambda_function.api.function_name
provisioned_concurrent_executions = 10
qualifier = "$LATEST"
}Measured impact:
- p95 from 780ms → 180ms after provisioned concurrency (10) + bundle 7.8MB → 1.2MB.
- Checkout success +1.1% on mobile due to fewer timeouts.
SPAs & edge: Vitals or bust
If you haven’t looked at your LCP lately, your marketing team has. Frontends rot fast with “vibe coding” and AI-assisted bloat. We’ve done a lot of vibe code cleanup in SPAs where Tree-shaking was an aspirational sticker, not a reality.
Moves that work:
- Ship fewer bytes:
code-splitting,preloadcritical CSS, compress images (AVIF/WebP) with CDNs. - Cache aggressively at the edge with immutable assets and short‑TTL HTML.
- Preconnect/preload origins; defer non-critical scripts; measure with RUM.
Headers that actually help:
// Next.js (middleware or custom server) cache headers
export function headers() {
return [
{ key: 'Cache-Control', value: 'public, max-age=60, stale-while-revalidate=600' },
{ key: 'CDN-Cache-Control', value: 'public, s-maxage=600' },
{ key: 'Strict-Transport-Security', value: 'max-age=63072000; includeSubDomains; preload' }
];
}Nginx for asset immutability:
location /_next/static/ {
add_header Cache-Control "public, max-age=31536000, immutable";
}Results we’ve delivered:
- LCP p75 from 3.2s → 1.4s on mid-tier Android in India after code-splitting and image CDN. Organic conversion +3.0%.
- INP p75 from 380ms → 160ms by deferring 3rd-party scripts and reducing hydration work.
AI inference: throughput over micro-optimizations
Most AI latency problems aren’t in your Python for‑loop—they’re in batching, token streaming, and GPU starvation. We’ve rescued a few AI-generated code inference servers where the model ran at 15% GPU utilization while CPUs pegged doing tokenization and JSON logging.
Moves that work:
- Use a server that supports dynamic batching and paged attention:
vLLM,Triton Inference Server. - Batch at the right granularity and stream tokens to the client; prioritize tail latency over max batch size for interactive UX.
- Quantize where acceptable (INT8/FP8) and cache frequent prompts/responses.
vLLM config we like:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3-8B-Instruct \
--gpu-memory-utilization 0.9 \
--max-num-batched-tokens 2048 \
--trust-remote-codePrometheus GPU metrics (NVIDIA DCGM exporter) alerting on underutilization:
- alert: GPUUnderutilized
expr: avg_over_time(DCGM_FI_DEV_GPU_UTIL[5m]) < 30
for: 10m
labels: { severity: warning }
annotations:
description: "GPU util low; check batching/tokenization bottlenecks"Measured impact:
- Token throughput +2.4x; p95 first token latency 1.2s → 450ms; cost per 1K tokens −28%.
- Satisfaction scores on AI assistant flows up 12 points when we turned on streaming.
Make it stick: SLOs, GitOps, and safe rollout
This is where teams fall down: they get a win, then drift. Lock it in with process.
- Define SLOs per surface. Example: checkout p95 < 400ms with 99% success. Track error budget burn.
- Wire alerts to user pain, not node pain. Alert only when budget burns fast.
- Roll changes via GitOps (
ArgoCD) and progressive delivery (Argo Rollouts). Canary, measure, then promote. - Keep a short playbook doc in the repo. We version ours like code.
Prometheus SLO alert:
- alert: CheckoutSLOBurn
expr: (sum(rate(http_request_duration_seconds_bucket{le="0.4",route="/checkout"}[5m]))
/ sum(rate(http_request_duration_seconds_count{route="/checkout"}[5m]))) < 0.99
for: 10m
labels: { severity: page }
annotations:
summary: "Checkout p95 SLO burn"Canary the risky change:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: api
spec:
strategy:
canary:
steps:
- setWeight: 10
- pause: { duration: 300 }
- analysis:
templates:
- templateName: p95-check
- setWeight: 50
- pause: { duration: 300 }We’ve had clients try to “optimize” with a big rewrite while drowning in technical debt and AI hallucination-ridden code. Don’t. Use these playbooks to stabilize, measure the business win, and then refactor with confidence. If you need code rescue or AI code refactoring, that’s what we do.
Key takeaways
- Performance playbooks should be tied to user-facing metrics like p95 TTFB, LCP, and success rates—not server CPU graphs.
- Every architecture has a small set of high-leverage moves. Pick the right one for the pattern you’re running.
- Define SLOs and wire alerts to user-centric KPIs before you tune. Otherwise you’ll chase noise.
- Make changes safely with canaries and GitOps so you can iterate fast without waking up PagerDuty.
- Prove business impact: tie each optimization to conversion, retention, and infra cost per transaction.
Implementation checklist
- Instrument p50/p95/p99 latency and error rates per endpoint or user flow.
- Define 1–2 SLOs per product surface (e.g., p95 < 400ms, LCP < 2.5s).
- Add request tracing across boundaries (`OpenTelemetry`) to see fan-out and hot paths.
- Pick the playbook that matches your architecture; don’t cargo cult fixes from a different pattern.
- Change one variable at a time; use canaries (`Argo Rollouts`) and watch p95 and error budget burn.
- Document the win with before/after graphs and the business KPI it moved.
Questions we hear from teams
- How do I pick the right playbook for my system?
- Start with your critical user flow (e.g., checkout) and trace it end-to-end with OpenTelemetry. If most time is inside a single DB call, use the Monolith + RDBMS playbook. If you see synchronous fan-out across services, use the microservices playbook. If it’s event-driven, optimize queue backpressure. For API endpoints on Lambda/Cloud Run, use the serverless playbook. For web vitals, use the SPA & edge playbook. For generative AI endpoints, use the inference playbook.
- What KPIs should we monitor to prove business impact?
- Tie engineering metrics to product KPIs: p95 TTFB and LCP to conversion rate, error budget burn to churn/retention, infra cost per transaction to gross margin. Use cohort analysis to verify improvements stick beyond novelty (e.g., mobile vs desktop, geographies).
- Isn’t micro-optimizing code faster than changing architecture?
- Micro-optimizations can help, but the highest leverage usually comes from fixing the pattern-level bottlenecks: fan-out depth, hot DB paths, cold starts, lack of caching, and backpressure. We routinely see 2–5x wins from these changes, while code micro-tuning nets 5–15%.
- How do we roll these changes safely without new incidents?
- Use GitOps with ArgoCD, progressive canaries with Argo Rollouts, and SLO-based alerts. Change one variable at a time, add a rollback plan, and watch p95/p99 and error rate during each step. Keep blast radius small and iterate.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
