Stop Chasing P99s in the Dark: A Practical Framework to Balance Performance and Cloud Spend
When the p95 spikes and Finance pings you on Slack, you need a framework that connects user happiness to dollars. Here’s the playbook I use to tune systems without torching the budget.
Performance work only matters if users feel it and Finance can measure it.Back to all posts
The day the p95s spiked and the CFO called
We’d just shipped a “simple” personalization tweak. Thirty minutes later, p95 on the product page jumped from 320ms to 950ms, error_rate crept to 2.2%, and autoscaling doubled our c5.2xlarge pool. Real User Monitoring showed LCP in Chrome Mobile drifting past 3s, and conversions dropped 8%. Finance pinged me: “Why is EC2 up 40% this hour?” I’ve seen this movie too many times: teams chase p99 graphs while burning money, and customers are quietly exiting.
Here’s what actually works: anchor every optimization to a user-facing metric and a unit-economic target. Then use a small set of brutally practical techniques to move those numbers—without letting the cloud bill run feral.
Define the metrics that actually move the business
Forget vanity dashboards. Track what your CFO and your users feel.
User-facing:
LCP,TTFB,CLS(web),Apdex, mobile cold start, checkoutp95, error rate. Tools:Datadog RUM,New Relic Browser,Google Lighthouse,Web Vitals.Service SLOs:
p50/p95,error_rate, andavailabilityper API/domain. Instrument withOpenTelemetry, scrape withPrometheus 2.x, visualize inGrafana 10(use panels that overlay SLO targets).Cost lenses:
cost_per_1k_requests,cost_per_conversion,cost_per_active_user. Tag infra withowner,service,envand pipe toCloudWatch/CURorGCP Billing→BigQuery→Looker/Grafana.Capacity signals: queue depth,
saturation(CPU throttle,load_avg,pg_locks), cache hit ratio,Latency vs Concurrencycurves. These tell you when to scale vs optimize.
If a metric doesn’t tie to a user journey or a unit cost, it’s noise. Kill it.
A pragmatic framework for resource optimization
I keep it to four steps. It’s boring. It works.
Baseline: Capture a 7-day window of RUM, SLO, and unit cost. Freeze major features. Example: product API
p95=380ms,error=0.3%,cost_per_1k_reqs=$0.36.Budget: Set targets that map to revenue. Example: “Reduce product page
LCPto <2.5s andcost_per_1k_reqsto <$0.25 without pushingerror_rate> 0.5%.”Boundaries: Enforce guardrails before tuning: Kubernetes
requests/limits,ResourceQuota,LimitRange, rate limits, and circuit breakers. Block regressions at the gate, not during an incident.Iterate: Single-variable changes, measure for 24–72 hours, roll forward/back via
ArgoCD. If a change doesn’t move the user metric or unit cost, revert. No heroics.
Dashboards that matter (one screen, no scrolling):
Panel 1:
p50/p95/p99with SLO bands,error_rateoverlay.Panel 2:
cost_per_1k_requestsand total spend by service (tagged).Panel 3: cache hit/byte-hit, DB
avg query time,pg_stat_statementstop offenders.Panel 4: infra saturation: CPU throttle, memory RSS vs limit, GC pauses, queue depth.
Tactics that deliver measurable wins
These are the fixes I’ve seen pay back in days, not quarters.
Right-size Kubernetes before you scale out (K8s 1.27+):
Install
GoldilocksandVPA v0.13.0in recommend mode. Open PRs to adjustrequests/limits.Detect CPU throttling via
container_cpu_cfs_throttled_seconds_total. If throttle > 2% at peak, bump CPUrequestsmodestly (10–20%).Example outcome: one client dropped node count 25% by right-sizing 14 services;
p95improved 18% from reduced throttling.Example
LimitRangeYAML:
apiVersion: v1
kind: LimitRange
metadata:
name: defaults
spec:
limits:
- type: Container
default:
cpu: "500m"
memory: "512Mi"
defaultRequest:
cpu: "250m"
memory: "256Mi"Scale on meaningful signals: CPU is a terrible proxy for latency.
Use
KEDAto scale onRPS,Kafka lag, or queue depth; or HPA with custom metrics fromPrometheusAdapter.Keep
maxSurge=1,maxUnavailable=0in rollouts to avoid pileups.Result: a chatty API went from oscillating pods (thrash) to stable
p95with 30% fewer replicas.
Cache like you mean it:
Edge: Add
Cache-Control: s-maxage=300, stale-while-revalidate=60for product listings.Fastly/CloudFrontwill do the heavy lifting.App:
Rediswith TTLs and negative caching for 404s. Track hit ratio and byte hit ratio; target > 80%/70% respectively.Result: cut
TTFBby 120ms and origin egress by 55%, droppingcost_per_1k_reqsby ~$0.08.
Backpressure and circuit breaking: protect the golden path.
- At ingress (
Envoy 1.29):
- At ingress (
circuit_breakers:
thresholds:
- priority: DEFAULT
max_connections: 20000
max_pending_requests: 5000
max_requests: 10000
max_retries: 3Budget queue time: reject at 200ms queueing to keep
p95under SLO. Users prefer fast failure to spinners.Result:
error_ratesteady at 0.4% during traffic spikes instead of cascading timeouts.Kill the top 5 queries before buying bigger DBs (Postgres 14):
- Enable
pg_stat_statements. Run:
- Enable
SELECT query, calls, total_exec_time/1000 AS total_s,
mean_exec_time AS mean_ms
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 5;Add missing composite indexes and trim payload. If you’re doing
SELECT *in hot paths, you’re lighting money on fire.Add
pgbouncertransaction pooling and a read-replica for search. Result: DB CPU -35%, APIp95-28%.Runtime tuning beats instance upgrades:
Go: set
GOMAXPROCSto cores; tune GCGOGC=100–150if RSS is high and pauses are low. Measure withruntime.ReadMemStatsexported toPrometheus.Java 17: try
-XX:+UseZGCfor services with GC pauses > 200ms on G1. We cut tail latency 22% by flipping to ZGC on an IO-heavy service.
Serverless isn’t free (AWS Lambda):
Memory setting controls CPU. For CPU-bound functions, bump from 512MB → 1024MB; latency may drop 40% while cost per request improves due to shorter duration.
Use
Provisioned Concurrencyonly on cold-paths that actually hurt conversions; scale to 0 elsewhere. Trackcost_per_1k_invocations.
Case study: Taming a chatty checkout without burning cash
A mid-market retailer (K8s on EKS, microservices, Java/Go mix) saw checkout p95 drift to 900ms during promos and EC2 spend spike 38% at peak. Conversion was off by 6–9% versus baseline. They’d been told to “add nodes and shard the DB.” We did this instead:
Baselined: stitched
Datadog RUMLCP + checkoutp95tocost_per_1k_ordersinGrafana. Baseline:p95=910ms,error=1.1%,LCP web=3.2s,cost_per_1k_orders=$0.42.Budgeted: “
p95< 400ms,error< 0.5%,LCP< 2.5s,cost_per_1k_orders< $0.25.”Boundaries: applied org-wide
LimitRange, enforced OPA policy to block pods without limits, and addedEnvoycircuit breakers to the payment aggregator.Iterated:
Right-sized 11 services via
GoldilocksPRs; removed 2x CPU throttling. Node count -23%.Cached catalog and tax rates at edge with
stale-while-revalidate=60. Origin requests -47%.Killed 3 queries (missing composite index, N+1 in promotions, fat payload). DB CPU -33%.
Go runtime:
GOGC=125, shaved RSS 18% and stabilized pauses.Switched HPA to scale on
RPSfromCPU. Removed thrash. Added queue-time budget (200ms).
Results in 10 days:
Checkout
p95910ms → 280ms (−69%).LCP3.2s → 2.3s (−28%), mobile bounce rate −5.4%.error_rate1.1% → 0.4%.cost_per_1k_orders$0.42 → $0.21 (−50%).Promo-day revenue +7.2% vs prior promo at similar traffic. No new instances purchased.
Governance: bake cost–performance into the pipeline
If you don’t encode guardrails, entropy wins. Ship with seatbelts:
GitOps everything with
ArgoCD: requests/limits, HPA/KEDA,Envoyconfigs, budgets. No ad-hoc changes at 2 a.m.OPA/Gatekeeper policies: reject pods without limits; block
:latestimages; enforcemaxReplicasand namespaceResourceQuota.SLO-first rollouts: canary with
Argo RolloutsorFlagger. Auto-abort ifp95orerror_ratebreach for N minutes, not justCPU%.FinOps checks in CI: run a
infracostjob per PR; fail builds ifcost_per_1k_requestsprojected > budget. Tie to tags so Finance can see ownership.Planned chaos: run
chaos-meshorGremlinto test circuit breakers and queue budgets monthly. If it fails under test, it will fail on Black Friday.
Field notes: what I’d do differently next time
Don’t start with
p99. It’s where dragons and tail noise live. Fixp95first; if conversion still suffers, then open thep99door.Don’t use CPU as the only HPA signal. Scale on RPS/queue depth. CPU-based scaling rewards slow code.
Avoid runtimes that hide GC pain. Measure pauses explicitly; if your perf issue is GC, solve GC—not autoscaling.
Make caching a first-class KPI. Edge hit ratio and byte-hit ratio belong next to LCP on the dashboard.
Keep feature teams accountable for cost. Add
cost_per_1k_requeststo the service ownership checklist. Bad caches are everyone’s problem, but someone’s responsibility.
30/60/90-day playbook
30 days:
Instrument
OpenTelemetrytraces for hot paths; wire toPrometheus/Grafana.Stand up a unified dashboard: SLO metrics + unit cost + saturation. Kill 20% of noisy panels.
Deploy
Goldilocksand open right-sizing PRs for top 10 services.Enable
pg_stat_statements; fix top 3 queries.
60 days:
Switch HPA to RPS/queue-based scaling where applicable (
KEDAfor async).Add
Envoycircuit breakers and queue time budgets at ingress.Implement CDN caching with
s-maxageandstale-while-revalidatefor static and low-churn APIs.Start canary rollouts gated on SLOs. Add OPA policies to block pods without limits.
90 days:
Tie
infracostinto CI with budgets per service. Alert oncost_per_1k_requestsregressions.Tune runtimes (Go
GOGC, Java ZGC) for latency-heavy services.Run a chaos day to validate circuit breakers and throttling under failure. Document runbooks and rollback paths.
Key takeaways
- Tie performance work to user-facing metrics and unit economics, not vanity p99s.
- Instrument cost per request/user/transaction alongside latency and error rate.
- Use a simple, repeatable loop: baseline → budget → boundaries → iterate.
- Right-size resources with data (VPA/Goldilocks), not folklore or vendor defaults.
- Throttle and shed load before you autoscale; scale on meaningful signals (RPS, queue depth).
- Cache aggressively at the edge and app layer; measure hit ratio and byte hit ratio.
- Govern via GitOps: enforce requests/limits, budgets, and SLOs in the pipeline.
Implementation checklist
- Define SLOs mapped to revenue steps (e.g., checkout p95 < 400ms).
- Track cost per 1k requests and per conversion in the same dashboard as latency.
- Enable `pg_stat_statements` and kill the top 5 slow queries.
- Deploy `Goldilocks` and open PRs for right-sized requests/limits.
- Switch HPA to RPS/queue-depth metrics and set `maxUnavailable=0` for rollouts.
- Add CDN `Cache-Control` with `s-maxage` and `stale-while-revalidate` for static+API.
- Introduce `Envoy` circuit breakers and queue time budgets at ingress.
- Tune runtimes: `GOGC=100-150`, Java 17 `-XX:+UseZGC` where GC dominates.
- Automate guardrails: OPA policies blocking pods without limits; budget alerts in CI.
Questions we hear from teams
- How do I pick the right SLOs for performance vs cost?
- Anchor SLOs to critical user journeys (e.g., product detail, checkout) and pick `p95` thresholds where conversion inflects. Pair each SLO with a budgeted `cost_per_1k_requests` or `cost_per_conversion`. If latency is below the inflection point but cost is high, optimize spend; if above, optimize speed first.
- Isn’t autoscaling cheaper than optimizing code?
- Sometimes—until you hit saturation on shared components (DB, cache, network) and tail latency explodes. Autoscaling without backpressure hides problems and inflates spend. Fix the top 5 queries, cache aggressively, and scale on RPS/queue depth; it’s usually cheaper and faster than buying bigger nodes.
- How do I measure cost per request reliably?
- Tag resources by `service`, `env`, and `owner`. Export metered cost (via AWS CUR or GCP Billing) to a warehouse and join against request counts from `Prometheus` or your gateway logs. Plot `cost_per_1k_requests` in Grafana next to `p95` and `error_rate`. Automate with `infracost` in CI for projections per PR.
- What’s the quickest win if I have one week?
- Deploy `Goldilocks`, right-size top 10 services, add CDN caching for low-churn endpoints, and enable `pg_stat_statements` to fix the top queries. Add `Envoy` circuit breakers to cap tail latency. You’ll usually see a 20–40% latency improvement and 15–30% cost reduction in a week.
- When should I optimize `p99`?
- If your `p95` is healthy but you still see user pain (timeouts, retries, SLA penalties) or high-value workflows suffer from long tails (trading, payments), then target `p99`. Use queue time budgets, circuit breakers, and runtime GC tuning. Otherwise, `p99` is often noise that derails the team.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
