The Playbooks That Actually Move the Needle: Performance Recipes for Monoliths, Microservices, and Serverless
You don’t need another “optimize your queries” blog. You need battle-tested playbooks tied to user metrics and revenue. Here’s what actually works—and how to measure it.
“Performance is a product feature. If you can’t tie it to LCP, p95, and conversion, it’s just an expensive hobby.”Back to all posts
The problem you’ve actually got
If you’ve ever stared at a Grafana dashboard at 2 a.m. wondering why checkout p95 went vertical while Kubernetes looked “green,” you know the feeling. I’ve watched teams spend quarters shaving CPU only to learn their real issue was LCP on mobile and a chatty API gateway. Performance is a product problem first. If it doesn’t move user metrics—p95/p99 latency, Core Web Vitals (LCP/INP), error rates, and ultimately conversion and retention—it’s just busywork.
At GitPlumbers, we ship performance playbooks that start with user journeys and work backward. Below are the recipes we deploy for common architectures. Each one includes the levers that actually work, the configs that matter, and the metrics we use to prove it.
Playbook: Monolith on a single DB (Rails/Django/Spring + Postgres)
You don’t need a microservices rewrite to get real wins. Most monoliths I see are one pg_stat_statements
away from a 3–5x latency improvement.
- Target metrics
- API p95 < 300 ms, error rate < 0.5%, DB CPU < 70%, cache hit rate > 85%
- Business: +2–5% conversion, -15–25% infra cost (fewer wasteful DB calls)
- Steps that work
- Turn on query insight and kill the top offenders:
- Enable
pg_stat_statements
and slow query log:CREATE EXTENSION IF NOT EXISTS pg_stat_statements; ALTER SYSTEM SET shared_preload_libraries='pg_stat_statements'; ALTER SYSTEM SET log_min_duration_statement=200; SELECT query, mean_exec_time, calls FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10;
- Add missing composite indexes and fix N+1s. In Rails, add
bullet
and fail CI on N+1.
- Enable
- Cache what’s stable:
- Layer
Redis
for 5–30s TTL caches on read-heavy endpoints. Measure hit rate inPrometheus
. - Use
ETag
/Last-Modified
and CDN edge caching for catalog-like pages.
- Layer
- Right-size concurrency:
puma
workers = CPU cores, threads = 5–8; DB pool must be ≥ total threads.- Cap
ActiveRecord
pool to avoid DB thrash. Usepgbouncer
in transaction mode.
- I/O wins:
- Gzip/Brotli and
Cache-Control: public, max-age=600, stale-while-revalidate=60
atnginx
. - Ship HTTP/2; kill server-side template bloat.
- Gzip/Brotli and
- Turn on query insight and kill the top offenders:
- Measurable outcome
- Typical result: API p95 from 1.8s → 320ms, Postgres CPU -30%, checkout conversion +3.1% in 3 weeks.
If you’re still fighting the DB at p50, don’t talk sharding. Fix the 10 queries burning 80% of your time first.
Playbook: Microservices behind an API gateway (Envoy/Kong/Istio)
When microservices go slow, it’s almost always death-by-chattiness and missing budgets at the edge. Put the brakes and backoffs at the gateway, not buried in service #14.
- Target metrics
- Gateway p95 < 200 ms, p99 < 500 ms, <1% 5xx; per-service SLOs with error budgets
- Business: +1–3% conversion, -20–40% MTTR via predictable failure modes
- Controls that matter
- Timeouts, retries, and circuit breakers at the gateway:
# Envoy example route_config: virtual_hosts: - name: api routes: - match: { prefix: "/checkout" } route: timeout: 1.5s retry_policy: retry_on: 5xx,reset,connect-failure num_retries: 2 per_try_timeout: 300ms max_stream_duration: { max_stream_duration: 3s } cluster_manager: clusters: - name: payments circuit_breakers: thresholds: - priority: DEFAULT max_connections: 2000 max_requests: 5000 max_pending_requests: 1000 max_retries: 3
- Bulkheads: run high-risk calls in isolated pools/queues; throttle at gateway.
- Collapse chatty fan-outs: aggregate read paths behind a dedicated
read-api
service. - Async the non-critical: use
SQS
/Kafka
for receipts, emails, fraud signals; return 202 with idempotency keys. - Enforce budgets with
Istio
VirtualService
timeouts andDestinationRule
outlier detection.
- Timeouts, retries, and circuit breakers at the gateway:
- Release safety
- Canary with
Argo Rollouts
based on real SLOs (p95 and error rate). Auto-rollback on budget burn.
- Canary with
- Measurable outcome
- In a recent cleanup: gateway p95 650ms → 180ms, p99 -60%, checkout 5xx -70%, MTTR -35%, conversion +1.8%.
Playbook: Event-driven pipelines (Kafka/Flink/Kinesis)
Throughput is pointless if your end-to-end latency blows your SLA. Most “Kafka is slow” pages are just bad partitioning and consumer configs.
- Target metrics
- End-to-end latency p95 < 2s, consumer lag ~0 at steady state, checkpoint < 1s
- Business: faster fraud decisions, stale data complaints down, support tickets -20%
- What to tune
- Partitioning: key by high-cardinality id to avoid hot partitions; aim for
partitions ≈ max consumer concurrency × 2
. - Batching: increase
fetch.max.bytes
andmax.poll.records
for throughput; cap to meet latency SLO. - Compression:
lz4
orzstd
for network-bound topics. - Backpressure: expose consumer lag; autoscale consumers on lag with
KEDA
. - Flink resiliency: checkpoints every 5–10s, incremental,
rocksdb
state backend when state > RAM.
- Partitioning: key by high-cardinality id to avoid hot partitions; aim for
- Useful commands
- Lag:
kafka-consumer-groups --bootstrap-server $BROKER --group orders-etl --describe
- Kinesis Enhanced Fan-Out to isolate hot consumers; use
on-demand
for bursty traffic.
- Lag:
- Measurable outcome
- After repartition + KEDA autoscale: p95 E2E 4.6s → 1.3s, lag spikes gone, alert volume -50%.
Playbook: Serverless web/API (AWS Lambda + API Gateway + DynamoDB)
Cold starts and over-slim functions kill tail latency. Memory buys CPU in Lambda; spend it wisely.
- Target metrics
- p95 < 250 ms, p99 < 600 ms, init duration < 100 ms on hot paths
- Business: +1–2% checkout completion, lower abandonment on mobile
- Concrete steps
- Kill cold starts on critical routes with
Provisioned Concurrency
:aws lambda put-provisioned-concurrency-config \ --function-name checkout-handler \ --qualifier live \ --provisioned-concurrent-executions 50
- Right-size memory to lower CPU-bound latency; test 256→1024→1536 MB and pick the cheapest p95.
- Keep packages lean: bundle-only deps; lazy-load SDKs. Avoid VPC attachments unless you need RDS/ElastiCache.
- Use
API Gateway
Latency
andIntegrationLatency
to spot backend vs proxy time. - DynamoDB: define
PK/SK
access patterns up front, enableDAX
for read-heavy, andAdaptive Capacity
for hotspots.
- Kill cold starts on critical routes with
- Measurable outcome
- With 100 provisioned concurrencies + 1024MB: p95 780ms → 210ms, p99 2.1s → 520ms, cost +9% but conversion +1.4% net +$, which the CFO liked.
Playbook: SPA + Edge (Next.js/React + Cloudflare/Fastly)
If LCP is trash on 4G, nothing else matters. You don’t win Core Web Vitals in your Kubernetes cluster.
- Target metrics
- LCP < 2.5s, INP < 200ms, TTFB < 200ms on mobile; >90 Lighthouse on product pages
- Business: +3–8% mobile conversion, SEO lift, lower ad CAC
- What moves the needle
- Server render critical paths (
Next.js
app
router), stream HTML, hydrate only where needed. next/image
with AVIF/WebP, responsive sizes, andpriority
on hero.- Split bundles aggressively;
react-lazy
below the fold. - Edge cache HTML for anon users with
stale-while-revalidate
; cache APIs at the CDN when safe. - Preconnect, preload key fonts; inline critical CSS ≤ 14KB.
- Server render critical paths (
- Example headers
Cache-Control: public, max-age=600, s-maxage=600, stale-while-revalidate=60
- Measure and enforce
Lighthouse CI
in PRs;WebPageTest
for mobile;RUM
for real users viaNext.js
web-vitals
hook.
- Measurable outcome
- After edge caching + image optimization: LCP p75 3.4s → 1.9s, bounce -6%, add-to-cart +4.2%.
Playbook: LLM/Vector retrieval features (FAISS/Pinecone/Qdrant)
AI features don’t get a pass on latency. A chat assistant with 2s response is fine; 7s feels broken. Most time is in retrieval and token streaming.
- Target metrics
- Retrieval p95 < 200ms, first-token < 700ms, answer < 2.5s for short prompts
- Business: higher task completion, lower abandonment in support flows
- What to control
- ANN index tuned for recall-latency tradeoff:
HNSW
withM=32
,ef_search=128
as a starting point; profile. - Cache frequent queries and embeddings in
Redis
with 5–60s TTL; dedupe byfingerprint(prompt, user_ctx)
. - Batch embeddings; reuse normalized vectors.
- Stream tokens early; use shorter system prompts and tools over longer context.
- ANN index tuned for recall-latency tradeoff:
- Example Qdrant config
hnsw: m: 32 ef_construct: 128 quantization_config: scalar: type: int8 always_ram: true
- Measurable outcome
- Tuning
ef_search
and caching top 5% of queries: retrieval p95 420ms → 160ms, first-token 1.4s → 650ms, self-serve deflection +9%.
- Tuning
Operational discipline: make it stick with SLOs and GitOps
Without guardrails, performance rots. Bake these playbooks into your delivery system.
- SLOs by surface
- Public API:
p95 < 300ms
, error rate<0.5%
- Web: LCP p75
<2.5s
, INP<200ms
- Stream: E2E p95
<2s
- Public API:
- Instrumentation
OpenTelemetry
traces toJaeger
/Tempo
; RED metrics inPrometheus
; RUM for web vitals.
- Release gates
Argo Rollouts
canary with PromQL checks:analysis: metrics: - name: api-latency-p95 interval: 30s successCondition: result < 0.3 provider: prometheus: query: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job="api",le!="+Inf"}[5m])) by (le))
- Capacity
- K8s
HPA
onRPS
/CPU
for stateless,VPA
for memory-heavy;KEDA
on Kafka lag; LambdaProvisioned Concurrency
for hot paths.
- K8s
- Reporting
- Monthly: show p95, error rate, infra $, and conversion/retention deltas. Finance cares about the last two.
This is where GitPlumbers usually enters: we codify the playbooks, wire SLOs into rollout controllers, and leave you with dashboards leadership can actually use.
Key takeaways
- Performance work must be driven by user-facing metrics like p95 latency, LCP, and error budgets—not vanity infra graphs.
- Each architecture has a small set of high-ROI levers; focus on those and instrument them well.
- Codify your playbooks, wire them into GitOps, and gate releases on SLOs to avoid regression creep.
- Measure business impact alongside tech metrics; conversion and retention improvements make performance spend obvious to finance.
Implementation checklist
- Define SLOs by surface: API p95, web LCP/TTI, streaming end-to-end latency.
- Instrument tracing (`OpenTelemetry`), metrics (`Prometheus`), and web vitals (`Lighthouse`, `WebPageTest`).
- Pick one architecture playbook and run it top-to-bottom in 2 weeks; don’t boil the ocean.
- Automate: alerts on SLO burn, rollback via `Argo Rollouts`, and capacity via HPA/VPA or Provisioned Concurrency.
- Report both tech and business deltas: p95, errors, infra $, conversion, churn, and NPS.
Questions we hear from teams
- How do I pick which playbook to run first?
- Start where user pain and dollar impact intersect. If you have a slow web journey (LCP > 2.5s), fix the SPA + Edge playbook. If backend p95 spikes during load, do gateway + microservices. Tie the effort to a single SLO and a single KPI (e.g., checkout conversion) for 2 weeks.
- What if our infra team can’t support all these tools?
- You don’t need everything on day one. Start with `OpenTelemetry` traces, `Prometheus` RED metrics, and `Lighthouse`. GitOps the configs you touch (gateway, HPA, rollouts). Add more once SLOs stabilize.
- How do we prevent performance regressions after we fix them?
- Gate releases on SLOs with `Argo Rollouts` or `Flagger`, add perf budgets in CI (Lighthouse CI, k6 smoke), and alert on error budget burn rates. Make regressions visible in standups with a tiny scorecard.
- Will these changes blow up our cloud bill?
- Usually the opposite. Killing N+1s, adding caches, and tuning timeouts reduce waste. If you add capacity (e.g., Provisioned Concurrency), measure business lift. We’ve seen +1–4% conversion dwarf single-digit percent cost increases.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.