Seven Performance Playbooks That Actually Move the Needle (Core Web Vitals to Token Throughput)
Reusable, testable playbooks that tie p95 latency to revenue — for SPA+BFF, Monolith+DB, Microservices, Kafka, Serverless, and LLM inference.
You don’t need a silver bullet — you need seven boring playbooks you can run in your sleep.Back to all posts
The playbook mindset: from latency to revenue
I’ve seen the same movie at a dozen companies: teams run “performance sprints,” speed up a few endpoints, and six months later regressions creep back. The fix isn’t heroics — it’s playbooks. For each architecture you run, you need a repeatable checklist that ties user-facing metrics to business outcomes, with clear rollout and rollback.
- Metrics that matter:
LCP,TTI,CLS(web),p95/p99latency (APIs),TTFTandTPOT(LLM),Apdex,consumer_lag(Kafka), error rate, and cache hit ratio. - Business linkage: conversion rate, revenue per minute, abandonment, retention, support tickets, infra spend.
- Proof: A/B or holdout groups; compare conversion and performance before/after. Don’t ship “faster” — ship “+0.8% conversion at p95 -200ms”.
The Amazon and Google playbooks have beaten this into us for years: 100ms can cost measurable revenue. I’ve watched an LCP drop from 3.1s to 1.9s lift mobile conversion 1–2% at a mid-market retail client. That’s the language finance understands.
SPA + BFF + CDN: win the first interaction
This is the stack where Core Web Vitals pay the bills. Your goal is LCP < 2.5s, TTI < 2s, stable layouts, and predictable navigations.
Measure
- Run
LighthouseCI andweb-vitalsRUM. Export toPrometheusand chart inGrafana. - Capture
server-timingheaders in the BFF for TTFB breakdown.
- Run
Quick wins (often 30–60% LCP improvement in a week)
- Push static assets to a CDN with
Cache-Control: public, max-age=31536000, immutableandETag. - Enable
brotliandhttp3(QUIC) at the edge (Cloudflare/Akamai/Fastly). - Serve images as
AVIF/WebP, usesizes/srcset, and lazy-load below-the-fold. - Inline critical CSS (<14KB) and defer the rest; eliminate render-blocking JS.
- In Next.js/Remix, prefer
app/routing andnext/image; adoptReact Server Componentswhere feasible.
- Push static assets to a CDN with
Deeper fixes
- Move HTML to the edge with stale-while-revalidate. Cookie-aware bypass for logged-in users.
- Collapse N+1 BFF calls into a single aggregate; use
HTTP/2multiplexing withkeep-alive. - Ship priority hints:
<link rel="preload" as="image" imagesrcset="...">and<link rel="preconnect" href="https://api.yourbff.com">.
Config snippets
- NGINX brotli + cache:
brotli on; brotli_comp_level 6; location /assets/ { expires 1y; add_header Cache-Control "public, max-age=31536000, immutable"; } - Edge worker (Cloudflare) for SWR:
const ttl = 60, swr = 300 return new Response(html, {headers: { 'Cache-Control': `public, max-age=${ttl}, stale-while-revalidate=${swr}` }})
- NGINX brotli + cache:
Expected outcomes
LCP-500–1200ms,TTI-300–800ms; 0.5–2.0% conversion lift on mobile product pages.
Monolith + RDBMS: stop making the database cry
90% of the wins are query shape, indexes, and caching. I’ve watched teams throw read replicas at a SELECT * with a bad filter. Don’t be that team.
Measure
- Turn on
pg_stat_statementsand sample slow queries; trace with OpenTelemetry. EXPLAIN (ANALYZE, BUFFERS)your top 20 queries.
- Turn on
Quick wins
- Add covering indexes for hot paths. Example:
CREATE INDEX CONCURRENTLY idx_orders_user_status_created ON orders(user_id, status, created_at DESC) INCLUDE (total); - Kill ORM-generated N+1s; batch
INqueries or add data loaders. - Introduce a read-through cache (Redis) for idempotent reads:
GET order:123 # miss -> fetch DB -> SETEX order:123 300 <json> - Use
pgbouncerin transaction mode; cap app pool to protect DB.
- Add covering indexes for hot paths. Example:
Deeper fixes
- Move heavy reads to a materialized view refreshed by a job.
- Denormalize joins serving product pages into a precomputed document.
- Tune Postgres for your hardware:
shared_buffers ~25% RAM,work_memper sort/hash,effective_cache_sizerealistic.
Guardrails
- Timeouts everywhere: app query timeout <= 3000ms; cancel long-running queries.
- Circuit-breaker around Redis; stale reads > failures for catalog pages.
Expected outcomes
- API
p95-30–70%; DB CPU -40%; cache hit ratio 80–95%; infra spend -20–30%.
- API
Microservices over REST: tame the network, not the team
When a request fans out to 6 services, you don’t optimize one handler — you control blast radius. The toolkit: timeouts, backpressure, retries, and bulkheads.
Measure
- Distributed traces (
OpenTelemetry->Tempo/Jaeger), service mesh metrics (Istio/Envoy) forp95,error_rate, and retry storms.
- Distributed traces (
Quick wins
- Set sane defaults in Envoy:
route: timeout: 2s retry_policy: retry_on: 5xx,connect-failure,reset num_retries: 2 per_try_timeout: 500ms circuit_breakers: thresholds: max_connections: 1024 max_pending_requests: 512 outlier_detection: consecutive_5xx: 5 base_ejection_time: 30s - Stop synchronous chains for non-critical work; enqueue to a queue.
- Set sane defaults in Envoy:
Deeper fixes
- Introduce a BFF to reduce client fan-out; cache GETs at edge with
ETag/max-age. - Apply bulkheads: separate pools for third-party calls. Never let a flaky payment API starve product detail requests.
- Use canaries + SLO guards to prevent a bad deploy from burning the error budget in minutes.
- Introduce a BFF to reduce client fan-out; cache GETs at edge with
Expected outcomes
p95-20–50% on user flows; MTTR -30–60% thanks to fewer retry storms; fewer brownouts under peak.
Event-driven with Kafka: throughput without surprise lag
Kafka fixes fan-out costs, then adds its own footguns. The playbook is about producer batching, consumer backpressure, and observability of lag.
Measure
- Track
consumer_lag, end-to-end time from event to user-visible effect, and DLQ rate.
- Track
Quick wins
- Producer config:
linger.ms=10 batch.size=131072 compression.type=zstd acks=all - Consumers: set
max.poll.recordsand process in batches; checkpoint on success.
- Producer config:
Deeper fixes
- Increase partitions to scale, but align to consumer concurrency and keying semantics.
- For read models, switch hot-path projections to incremental updates rather than full recomputes.
- Apply backpressure: pause consumption when downstream latency spikes; expose it in metrics.
Guardrails
- DLQ with retention and alerts; replay tested via a staging topic.
- Idempotency keys at sinks to tolerate retries.
Expected outcomes
- End-to-end “write-to-visible” p95 from 2.5s -> 800ms; backlog recovery 5x faster; fewer paging incidents during spikes.
Serverless APIs (Lambda/Cloud Functions): kill cold starts and thundering herds
Serverless is great until cold starts meet chatty DBs. You need to pre-warm, proxy DBs, and right-size memory.
Measure
- Split
initDurationfromdurationin logs. Trackp95and error rate under load tests (k6).
- Split
Quick wins
- Enable
Provisioned ConcurrencyorSnapStart(Java). UseLambda Power Tuningto pick memory that yields best ms/$. - Put RDS behind RDS Proxy; reuse TCP. For DynamoDB, enable
Adaptive capacityandDAXfor read-heavy. - Bundle trim and connection reuse (
keep-alive); move secrets to env or cache.
- Enable
Deeper fixes
- Precompute expensive data into Redis/ElastiCache with TTLs; serve stale on timeout.
- Fan-out heavy work to Step Functions/SQS to avoid synchronous timeouts.
Expected outcomes
- Cold start p95 -60–90%; API p95 -30–50%; 20–40% cost reduction at same throughput.
LLM/AI inference: tokens per second is your throughput
LLM latency feels different: users notice TTFT and tokens/sec more than raw request latency. Your levers are batching, KV cache, quantization, and admission control.
Measure
- Track
TTFT,TPOT(ms/token),throughput (tokens/sec),GPU utilization, and rejection rate.
- Track
Quick wins
- Use an inference server with batching and paged KV cache (
vLLM,Triton). - Quantize to 4/8-bit (
bitsandbytes/AWQ) if quality allows; enable FlashAttention. - Cap context length; cache system prompts; stream tokens to improve perceived latency.
- Use an inference server with batching and paged KV cache (
Deeper fixes
- Batch by length; tune
max_batch_sizeandmax_tokensto keep GPUs >80% utilized. - Preload popular models; pin to GPUs with sufficient VRAM to avoid swaps.
- Admission control: shed long prompts during load; queue with user feedback.
- Batch by length; tune
Config sketch
python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Meta-Llama-3-8B-Instruct \ --tensor-parallel-size 2 \ --max-num-batched-tokens 8192 \ --gpu-memory-utilization 0.9Expected outcomes
TTFT-40–70%; tokens/sec +2–5x; infra cost per 1k tokens -30–60% with quantization and batching.
Operationalize your playbooks so they stick
A playbook isn’t a Confluence page. It’s code, alerts, and dashboards that survive reorgs.
- Versioned playbooks in a repo:
/playbooks/<pattern>/README.mdwith metrics, quick wins, configs, runbooks, dashboards links. - GitOps delivery:
Terraformfor infra,ArgoCDfor manifests; PRs update timeouts, autoscaling, or CDN rules. - SLOs and error budgets: define
p95/TTFTSLOs; alerts fire when burn rate > 2x budget. - CI gates:
k6smoke tests and Lighthouse budgets block regressions. - Dashboards: Grafana folders per playbook; “before vs after” panels and ROI overlays.
- Cadence: monthly load tests and quarterly GameDays. Archive learnings and update the playbooks.
When GitPlumbers runs these with clients, we usually see: fewer firefights, clearer ROI on infra spend, and product managers asking when they can “buy another 500ms.” That’s when you know the playbooks are doing their job.
Key takeaways
- Tie every optimization to a user-facing metric (LCP, TTI, p95, TTFT) and a business KPI (conversion, churn, AOV).
- Create architecture-specific playbooks with quick wins, deeper fixes, and guardrails; version them like code.
- Measure with synthetic and RUM; enforce with SLOs and error budgets; prove impact with A/B or holdout tests.
- Automate rollout via GitOps (ArgoCD/Terraform), and memorialize learnings in runbooks and dashboards.
- Expect diminishing returns; chase the biggest deltas first (network, cache, query shape) before exotic tuning.
Implementation checklist
- Map each service/page to user-facing metrics (LCP, TTI, p95, Apdex) and business KPIs.
- Baseline with Lighthouse, k6, production traces (OpenTelemetry), and RUM.
- Prioritize quick wins: CDN, caching, timeouts, indexes, image formats, compression.
- Set SLOs and budgets; wire alerts to user-impacting thresholds, not CPU graphs.
- Automate changes with IaC and GitOps; ship with canaries and feature flags.
- Prove value with A/B or holdouts and publish before/after dashboards.
- Schedule regular load/regression tests and GameDays to keep playbooks fresh.
Questions we hear from teams
- How do I tie performance to revenue credibly?
- Use holdouts or A/B. For the impacted pages or APIs, track both performance (LCP, p95) and business KPIs (conversion, revenue per session). Compare deltas between test/control. Share a single dashboard that shows latency down, conversion up, with confidence intervals.
- What if my biggest bottleneck is a third-party API?
- Bulkhead it with separate connection pools, strict timeouts, retries with jitter, and a circuit breaker. Cache idempotent responses and design graceful degradation (placeholders, queued actions). Negotiate rate and latency SLOs with the vendor and monitor them as if they were your own service.
- We’re already on a service mesh. Isn’t that enough?
- Meshes give you the knobs; the playbook tells you where to set them and how to verify the business impact. You still need sane timeouts, retry budgets, SLOs, and canaries wired into rollout policy.
- How often should we run load tests?
- At minimum, before major launches and monthly thereafter. Automate a short k6 smoke in CI, plus a heavier off-peak run that mimics traffic mix. Re-baseline after big dependency upgrades (framework, runtime, DB).
- Can we standardize these across teams without becoming a platform bottleneck?
- Yes. Ship opinionated defaults as reusable Terraform/Helm modules and mesh policies, with escape hatches. Guard the SLOs, not the implementation details. Empower teams to deviate only with data.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
