Stop Guessing: Performance Playbooks That Actually Move User Metrics
Seven battle-tested optimization playbooks for the architectures you actually run—mapped to user-facing KPIs, rollout steps, and measurable impact.
“If you can’t see it in Prometheus or your RUM, it doesn’t exist.”Back to all posts
The performance advice nobody wants to hear
I’ve watched teams spend quarters debating database vendors while their p95 checkout latency sat north of 1.2s. Then we shipped a two-day caching change and conversion bumped 3.4% the same week. If you’ve been burned by “move to microservices” or “just add a CDN,” you already know: ship playbooks, not slogans.
What follows are the playbooks we use at GitPlumbers when a system is slow in the ways you’ve actually seen in prod. Each playbook maps symptoms → diagnostics → concrete fixes → safe rollout. And every step ties to user-facing metrics (p95, LCP, error rate) and business impact (conversion, retention, revenue per session).
If you can’t see it in
Prometheusor your RUM tool, it doesn’t exist. Baseline first, then change one thing at a time.
Playbook 1: Monolith + Postgres bottleneck (e-commerce, SaaS dashboards)
When: p95 endpoint latency spikes under traffic; CPU is fine but DB avg_queue and locks climb; APM shows 70% time in ORM.
What users feel: slow product pages and cart updates; mobile LCP > 2.5s; checkout conversion drops 1–3%.
Steps:
- Baseline
- Prometheus:
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{route="/cart"}[5m])) by (le)) - RUM: LCP and TTFB by route; split mobile vs desktop.
- Prometheus:
- Quick wins
- Add response caching on read-heavy pages with
RedisandETagsupport. - Kill ORM N+1 with explicit
JOINs; add missing composite indexes.
- Add response caching on read-heavy pages with
- Medium lifts
- Materialize expensive aggregates nightly; serve from
Rediswith TTL. - Add read replicas; route read traffic via connection pooler (
pgbouncer).
- Materialize expensive aggregates nightly; serve from
- Rollout
- Feature flag (
Unleash,LaunchDarkly) cache and query changes by cohort. - Canary 10% traffic; abort on SLO breach (p95 +10% or error rate > 0.5%).
- Feature flag (
Postgres index example:
-- hot query filter: (tenant_id, updated_at)
CREATE INDEX CONCURRENTLY idx_orders_tenant_updated ON orders (tenant_id, updated_at DESC) WHERE status = 'OPEN';
ANALYZE orders;NGINX microcache for product pages:
proxy_cache_path /var/cache/nginx levels=1:2 keys_zone=PRODUCTS:100m max_size=1g inactive=10m use_temp_path=off;
server {
location /products/ {
proxy_cache PRODUCTS;
proxy_cache_key "$scheme$proxy_host$request_uri";
proxy_cache_valid 200 302 5m;
proxy_ignore_headers Set-Cookie;
add_header X-Cache-Status $upstream_cache_status;
proxy_pass http://app;
}
}Expected outcome:
- p95 -30–50% on read-heavy routes in 1–2 weeks.
- LCP -300–600ms on product pages.
- Conversion +1–3% (seen at a DTC apparel client; same basket, same ad spend).
Playbook 2: Microservices REST sprawl (N+1, chatty calls, noisy neighbors)
When: trace waterfall shows 6–12 downstream calls per request; p99 tail latency ugly; intermittent 5xx during deploys.
What users feel: occasional page hangs; “Save” spins; API timeouts on mobile networks.
Steps:
- Baseline
- Trace fan-out count (OpenTelemetry): avg spans/request and worst offenders.
- SLOs: set route-level p95, error budget per month.
- Quick wins
- Collapse N+1: backend for frontend (BFF) or aggregator service.
- Add client- and service-side timeouts and
circuit breaker.
- Medium lifts
- Introduce async for non-critical writes (enqueue to
KafkaorSQS). - Apply
Istioretries with budgets, not infinite retries.
- Introduce async for non-critical writes (enqueue to
- Rollout
canary deploymentwith Istio andArgoCDunder GitOps; guard with error budget burn alerts.
Istio circuit breaker + outlier detection:
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: users-dr
spec:
host: users.default.svc.cluster.local
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
http:
http1MaxPendingRequests: 50
maxRequestsPerConnection: 100
outlierDetection:
consecutive5xxErrors: 5
interval: 5s
baseEjectionTime: 30s
maxEjectionPercent: 50Node.js opossum circuit breaker on a chatty client:
import CircuitBreaker from 'opossum';
import fetch from 'node-fetch';
const breaker = new CircuitBreaker((url: string) => fetch(url, { timeout: 800 }), {
timeout: 900,
errorThresholdPercentage: 50,
resetTimeout: 5000
});
const res = await breaker.fire('https://svc/users?id=123');Expected outcome:
- p95 per route -25–40%; p99 tail tamed.
- MTTR -30% from fewer cascading failures.
- Error rate -0.3–0.8 pts when deploys happen during traffic.
Playbook 3: Event-driven pipelines (Kafka) and back-pressure
When: consumer lag climbs during spikes; retries cause duplicate work; batch jobs starve interactive traffic.
What users feel: delayed notifications, “ghost orders,” or analytics dashboards hours behind.
Steps:
- Baseline
- Track consumer lag and end-to-end latency histogram.
- Instrument DLQ volume; alert on spike.
- Quick wins
- Right-size partitions to target per-consumer throughput; cap
max.poll.interval.ms. - Use idempotent consumers with natural keys; dedupe at sink.
- Right-size partitions to target per-consumer throughput; cap
- Medium lifts
- Apply back-pressure via rate limits; bulkhead long-running batch consumers.
- Split topics: hot-path events vs batch/analytics.
- Rollout
- Canary new consumers with 10% partitions; promote if DLQ steady and p95 E2E latency improves.
Kafka consumer tuning (Java):
max.poll.records=500
fetch.min.bytes=1048576
fetch.max.wait.ms=200
enable.auto.commit=false
max.poll.interval.ms=300000
session.timeout.ms=10000Idempotency example (Postgres upsert):
INSERT INTO invoice (id, amount_cents, status)
VALUES ($1, $2, $3)
ON CONFLICT (id) DO UPDATE SET amount_cents = EXCLUDED.amount_cents, status = EXCLUDED.status;Expected outcome:
- E2E p95 for notifications -40–60% under peak.
- DLQ volume -70% after idempotency and partitioning.
- Support 2–3x traffic spikes without manual babysitting.
Playbook 4: Edge/API Gateway under bursty traffic
When: TTFB spikes during promos; origin CPU ok but 5xx from gateway; TLS handshakes burn time; cold starts for auth.
What users feel: landing page slow; login spikes time out.
Steps:
- Baseline
- RUM: TTFB by geo; LCP on landing pages.
- CDN/gateway logs: cache hit ratio, origin fetch errors.
- Quick wins
- Turn on CDN caching and stale-while-revalidate for static and semi-static JSON.
- Pre-warm TLS and keep-alive; compress with
brotli.
- Medium lifts
- Move auth/session checks to edge workers; cache JWKS for 5–10 minutes.
- Rate limit abusive IPs; token bucket per route.
- Rollout
- Canary CDN rules in one region; compare TTFB/LCP and error rate before global.
Cloudflare edge cache rule (Workers):
export default {
async fetch(req, env) {
const cacheKey = new Request(new URL(req.url).toString(), req);
let res = await caches.default.match(cacheKey);
if (!res) {
res = await fetch(req);
}
res = new Response(res.body, res);
res.headers.set('Cache-Control', 'public, max-age=60, stale-while-revalidate=600');
return res;
}
}Expected outcome:
- TTFB -100–300ms globally; LCP -200–500ms on landing pages.
- 20–40% origin offload; infra cost -10–20% during campaigns.
- Fewer auth cold starts; login success +1–2 pts under load.
Playbook 5: AI inference in the critical path (RAG, moderation, re-rank)
When: A feature depends on LLM responses; p95 2–4s; costs spike; “vibe coding” shipped naive sync calls; hallucinations leak to users.
What users feel: slow typeahead and uneven quality; bounced sessions.
Steps:
- Baseline
- Track p95 response for AI-backed endpoints; user-visible timeout at 1.5–2s.
- Measure answer quality with guardrails; count interventions.
- Quick wins
- Cache embeddings and rerank results; stream tokens to improve perceived latency.
- Use smaller model for first pass; upgrade to bigger model behind the scenes if needed.
- Medium lifts
- Switch to async + optimistic UI; precompute embeddings; apply
circuit breakerto provider. - Add RAG chunking and retrieval limits; clamp temperature to cut variance.
- Switch to async + optimistic UI; precompute embeddings; apply
- Rollout
- Feature flag by cohort; A/B measure session length, CTR, and support tickets.
Provider circuit breaker with resilience4j (Java):
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
.failureRateThreshold(50)
.slowCallDurationThreshold(Duration.ofMillis(1200))
.slowCallRateThreshold(50)
.waitDurationInOpenState(Duration.ofSeconds(10))
.build();Simple streaming response (Node):
const resp = await fetch('/api/rag?q=' + encodeURIComponent(q));
for await (const chunk of resp.body) {
streamToUI(chunk.toString()); // perceived latency drops ~800ms
}Expected outcome:
- Perceived latency -600–1200ms with streaming and cache.
- Cost -20–40% by model tiering and caching.
- Hallucinations down with guardrails; support tickets -10–20%.
Note: If AI-generated code snuck into the hot path, do a vibe code cleanup. We’ve done code rescue on several teams where “vibe coding” added blocking calls in render. Fixes: async boundaries, bulk prefetch, back-off. GitPlumbers runs short AI code refactoring engagements for this exact mess.
Make improvements stick: SLOs, canaries, and GitOps
Here’s the part that separates a one-off hero sprint from sustainable performance.
- Set SLOs and error budgets per user journey. Example: Checkout
p95 < 600ms, error rate< 0.5%. Tie to conversion and NPS. - Use canary + auto-revert. Istio/ArgoCD example below flips back if SLOs breach.
- Automate rollout via GitOps so configs don’t drift. Terraform for infra, ArgoCD for deploys.
- Make Prometheus dashboards and alert rules part of the PR.
ArgoCD canary with SLO guard:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: checkout
spec:
strategy:
canary:
steps:
- setWeight: 10
- pause: {duration: 300}
- setWeight: 50
- pause: {duration: 300}
analysis:
templates:
- templateName: slo-checkPrometheus SLO probe (pseudo):
checkout_p95 = histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{route="/checkout"}[5m])) by (le))
checkout_errors = sum(rate(http_requests_total{route="/checkout",code=~"5.."}[5m]))
alert if checkout_p95 > 0.6 or checkout_errors > 0.005Terraform HPA for the chatty service:
resource "kubernetes_horizontal_pod_autoscaler_v2" "svc" {
metadata { name = "svc" }
spec {
min_replicas = 2
max_replicas = 20
scale_target_ref { kind = "Deployment" name = "svc" api_version = "apps/v1" }
metric {
type = "Resource"
resource { name = "cpu" target { type = "Utilization" average_utilization = 60 } }
}
}
}Results you can take to the CFO:
- Predictable rollouts reduced
MTTRby 30–50% on bad deploys. - Error budget policy cut “YOLO Friday merges.” SRE practices stuck because they lived in Git, not Confluence.
How to run a 2-week performance sprint without burning the team
- Day 1–2: Baseline and pick one playbook.
- Build the dashboard; capture p50/p95/p99, LCP, error rate. Snapshot conversion/retention.
- Day 3–5: Quick wins behind flags.
- Cache, indexes, timeouts, circuit breakers. Ship small, monitored changes.
- Day 6–8: Medium lifts.
- Service aggregation, edge cache, async jobs, partition tuning.
- Day 9–10: Canary + measure.
- 10% → 50% → 100% rollout; watch SLOs and business KPIs.
- Day 11–12: Backlog and docs.
- Write it into the playbook repo; add alert rules to CI.
- Day 13–14: Debrief with finance and product.
- Show p95, LCP, conversion deltas; decide next playbook.
I’ve seen this fail when performance sprints “win” benchmarks but don’t move user KPIs. If your LCP didn’t drop and conversion didn’t rise, you optimized the wrong thing. Kill it and move on.
What we’ve learned the hard way
- Don’t scale before you cache. Most teams are 50ms from a win with microcaching or Redis.
- Avoid infinite retries; they create instant DDoS against yourself.
- Profiles beat opinions. Run
pprof,flamegraph, or your APM’s CPU profiler for the top endpoints before rewriting. - The cheapest fix is usually a config change (NGINX, Postgres index, Istio policy), not a re-architecture.
- Tie every change to a user-visible KPI and an executive-visible outcome.
If you’ve inherited a legacy modernization mid-flight or a pile of AI-generated code that “mostly works,” get help. GitPlumbers exists to turn vibe code into maintainable systems that scale without surprising your pager.
Key takeaways
- Ship playbooks, not platitudes: each pattern has triggers, diagnostics, fixes, and rollout steps.
- Tie every optimization to user-facing metrics (p95, LCP, error rate) and business outcomes (conversion, retention).
- Use canary + SLO guardrails to roll out safely; kill changes that violate error budgets.
- Cache and queue first; then profile the hot path; only then scale infrastructure.
- Instrument before you optimize—if you can’t see it in Prometheus or your RUM, you can’t improve it.
- Automate enforcement with GitOps (ArgoCD) so your playbooks actually stick.
Implementation checklist
- Define SLOs for each user flow (p95, error rate) and tie to conversion/retention targets.
- Instrument RUM (LCP, TTFB) and backend tracing before changes.
- Choose the playbook matching your architecture and trigger symptoms.
- Run the 5-step rollout: baseline → feature flag → canary → observe SLOs → full rollout.
- Add post-incident learnings back into the playbook repo; version in Git; enforce via CI/CD.
Questions we hear from teams
- How do we tie technical wins to business outcomes?
- Define SLOs per user journey (p95, error rate) and track conversion/retention deltas for cohorts exposed via feature flags. If LCP drops 300ms on product pages and conversion rises 1–2%, that’s the story for the CFO.
- We’re mid-migration to microservices. Should we pause for performance?
- Don’t pause; pick Playbook 2. Collapse chatty calls with a BFF, add timeouts/circuit breakers, and canary. You’ll stabilize now and keep the migration moving.
- How safe is canarying performance changes?
- Very, if you enforce SLO-based guardrails and auto-revert with ArgoCD/Istio. We target 10% → 50% → 100% with p95/error rate checks gating each step.
- What about AI features that are slow and inconsistent?
- Use Playbook 5: cache and stream for perceived speed, tier models by need, and add circuit breakers. If AI-generated code created blocking calls, run a short code rescue to refactor async boundaries.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
