Stop Chasing P99s in the Dark: A Practical Framework to Balance Performance and Cloud Spend
When the p95 spikes and Finance pings you on Slack, you need a framework that connects user happiness to dollars. Here’s the playbook I use to tune systems without torching the budget.
Performance work only matters if users feel it and Finance can measure it.Back to all posts
The day the p95s spiked and the CFO called
We’d just shipped a “simple” personalization tweak. Thirty minutes later, p95
on the product page jumped from 320ms to 950ms, error_rate
crept to 2.2%, and autoscaling doubled our c5.2xlarge
pool. Real User Monitoring showed LCP in Chrome Mobile drifting past 3s, and conversions dropped 8%. Finance pinged me: “Why is EC2 up 40% this hour?” I’ve seen this movie too many times: teams chase p99
graphs while burning money, and customers are quietly exiting.
Here’s what actually works: anchor every optimization to a user-facing metric and a unit-economic target. Then use a small set of brutally practical techniques to move those numbers—without letting the cloud bill run feral.
Define the metrics that actually move the business
Forget vanity dashboards. Track what your CFO and your users feel.
User-facing:
LCP
,TTFB
,CLS
(web),Apdex
, mobile cold start, checkoutp95
, error rate. Tools:Datadog RUM
,New Relic Browser
,Google Lighthouse
,Web Vitals
.Service SLOs:
p50/p95
,error_rate
, andavailability
per API/domain. Instrument withOpenTelemetry
, scrape withPrometheus 2.x
, visualize inGrafana 10
(use panels that overlay SLO targets).Cost lenses:
cost_per_1k_requests
,cost_per_conversion
,cost_per_active_user
. Tag infra withowner
,service
,env
and pipe toCloudWatch/CUR
orGCP Billing
→BigQuery
→Looker/Grafana
.Capacity signals: queue depth,
saturation
(CPU throttle,load_avg
,pg_locks
), cache hit ratio,Latency vs Concurrency
curves. These tell you when to scale vs optimize.
If a metric doesn’t tie to a user journey or a unit cost, it’s noise. Kill it.
A pragmatic framework for resource optimization
I keep it to four steps. It’s boring. It works.
Baseline: Capture a 7-day window of RUM, SLO, and unit cost. Freeze major features. Example: product API
p95=380ms
,error=0.3%
,cost_per_1k_reqs=$0.36
.Budget: Set targets that map to revenue. Example: “Reduce product page
LCP
to <2.5s andcost_per_1k_reqs
to <$0.25 without pushingerror_rate
> 0.5%.”Boundaries: Enforce guardrails before tuning: Kubernetes
requests/limits
,ResourceQuota
,LimitRange
, rate limits, and circuit breakers. Block regressions at the gate, not during an incident.Iterate: Single-variable changes, measure for 24–72 hours, roll forward/back via
ArgoCD
. If a change doesn’t move the user metric or unit cost, revert. No heroics.
Dashboards that matter (one screen, no scrolling):
Panel 1:
p50/p95/p99
with SLO bands,error_rate
overlay.Panel 2:
cost_per_1k_requests
and total spend by service (tagged).Panel 3: cache hit/byte-hit, DB
avg query time
,pg_stat_statements
top offenders.Panel 4: infra saturation: CPU throttle, memory RSS vs limit, GC pauses, queue depth.
Tactics that deliver measurable wins
These are the fixes I’ve seen pay back in days, not quarters.
Right-size Kubernetes before you scale out (K8s 1.27+):
Install
Goldilocks
andVPA v0.13.0
in recommend mode. Open PRs to adjustrequests/limits
.Detect CPU throttling via
container_cpu_cfs_throttled_seconds_total
. If throttle > 2% at peak, bump CPUrequests
modestly (10–20%).Example outcome: one client dropped node count 25% by right-sizing 14 services;
p95
improved 18% from reduced throttling.Example
LimitRange
YAML:
apiVersion: v1
kind: LimitRange
metadata:
name: defaults
spec:
limits:
- type: Container
default:
cpu: "500m"
memory: "512Mi"
defaultRequest:
cpu: "250m"
memory: "256Mi"
Scale on meaningful signals: CPU is a terrible proxy for latency.
Use
KEDA
to scale onRPS
,Kafka lag
, or queue depth; or HPA with custom metrics fromPrometheusAdapter
.Keep
maxSurge=1
,maxUnavailable=0
in rollouts to avoid pileups.Result: a chatty API went from oscillating pods (thrash) to stable
p95
with 30% fewer replicas.
Cache like you mean it:
Edge: Add
Cache-Control: s-maxage=300, stale-while-revalidate=60
for product listings.Fastly
/CloudFront
will do the heavy lifting.App:
Redis
with TTLs and negative caching for 404s. Track hit ratio and byte hit ratio; target > 80%/70% respectively.Result: cut
TTFB
by 120ms and origin egress by 55%, droppingcost_per_1k_reqs
by ~$0.08.
Backpressure and circuit breaking: protect the golden path.
- At ingress (
Envoy 1.29
):
- At ingress (
circuit_breakers:
thresholds:
- priority: DEFAULT
max_connections: 20000
max_pending_requests: 5000
max_requests: 10000
max_retries: 3
Budget queue time: reject at 200ms queueing to keep
p95
under SLO. Users prefer fast failure to spinners.Result:
error_rate
steady at 0.4% during traffic spikes instead of cascading timeouts.Kill the top 5 queries before buying bigger DBs (Postgres 14):
- Enable
pg_stat_statements
. Run:
- Enable
SELECT query, calls, total_exec_time/1000 AS total_s,
mean_exec_time AS mean_ms
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 5;
Add missing composite indexes and trim payload. If you’re doing
SELECT *
in hot paths, you’re lighting money on fire.Add
pgbouncer
transaction pooling and a read-replica for search. Result: DB CPU -35%, APIp95
-28%.Runtime tuning beats instance upgrades:
Go: set
GOMAXPROCS
to cores; tune GCGOGC=100–150
if RSS is high and pauses are low. Measure withruntime.ReadMemStats
exported toPrometheus
.Java 17: try
-XX:+UseZGC
for services with GC pauses > 200ms on G1. We cut tail latency 22% by flipping to ZGC on an IO-heavy service.
Serverless isn’t free (AWS Lambda):
Memory setting controls CPU. For CPU-bound functions, bump from 512MB → 1024MB; latency may drop 40% while cost per request improves due to shorter duration.
Use
Provisioned Concurrency
only on cold-paths that actually hurt conversions; scale to 0 elsewhere. Trackcost_per_1k_invocations
.
Case study: Taming a chatty checkout without burning cash
A mid-market retailer (K8s on EKS, microservices, Java/Go mix) saw checkout p95
drift to 900ms during promos and EC2 spend spike 38% at peak. Conversion was off by 6–9% versus baseline. They’d been told to “add nodes and shard the DB.” We did this instead:
Baselined: stitched
Datadog RUM
LCP + checkoutp95
tocost_per_1k_orders
inGrafana
. Baseline:p95=910ms
,error=1.1%
,LCP web=3.2s
,cost_per_1k_orders=$0.42
.Budgeted: “
p95
< 400ms,error
< 0.5%,LCP
< 2.5s,cost_per_1k_orders
< $0.25.”Boundaries: applied org-wide
LimitRange
, enforced OPA policy to block pods without limits, and addedEnvoy
circuit breakers to the payment aggregator.Iterated:
Right-sized 11 services via
Goldilocks
PRs; removed 2x CPU throttling. Node count -23%.Cached catalog and tax rates at edge with
stale-while-revalidate=60
. Origin requests -47%.Killed 3 queries (missing composite index, N+1 in promotions, fat payload). DB CPU -33%.
Go runtime:
GOGC=125
, shaved RSS 18% and stabilized pauses.Switched HPA to scale on
RPS
fromCPU
. Removed thrash. Added queue-time budget (200ms).
Results in 10 days:
Checkout
p95
910ms → 280ms (−69%).LCP
3.2s → 2.3s (−28%), mobile bounce rate −5.4%.error_rate
1.1% → 0.4%.cost_per_1k_orders
$0.42 → $0.21 (−50%).Promo-day revenue +7.2% vs prior promo at similar traffic. No new instances purchased.
Governance: bake cost–performance into the pipeline
If you don’t encode guardrails, entropy wins. Ship with seatbelts:
GitOps everything with
ArgoCD
: requests/limits, HPA/KEDA,Envoy
configs, budgets. No ad-hoc changes at 2 a.m.OPA/Gatekeeper policies: reject pods without limits; block
:latest
images; enforcemaxReplicas
and namespaceResourceQuota
.SLO-first rollouts: canary with
Argo Rollouts
orFlagger
. Auto-abort ifp95
orerror_rate
breach for N minutes, not justCPU%
.FinOps checks in CI: run a
infracost
job per PR; fail builds ifcost_per_1k_requests
projected > budget. Tie to tags so Finance can see ownership.Planned chaos: run
chaos-mesh
orGremlin
to test circuit breakers and queue budgets monthly. If it fails under test, it will fail on Black Friday.
Field notes: what I’d do differently next time
Don’t start with
p99
. It’s where dragons and tail noise live. Fixp95
first; if conversion still suffers, then open thep99
door.Don’t use CPU as the only HPA signal. Scale on RPS/queue depth. CPU-based scaling rewards slow code.
Avoid runtimes that hide GC pain. Measure pauses explicitly; if your perf issue is GC, solve GC—not autoscaling.
Make caching a first-class KPI. Edge hit ratio and byte-hit ratio belong next to LCP on the dashboard.
Keep feature teams accountable for cost. Add
cost_per_1k_requests
to the service ownership checklist. Bad caches are everyone’s problem, but someone’s responsibility.
30/60/90-day playbook
30 days:
Instrument
OpenTelemetry
traces for hot paths; wire toPrometheus/Grafana
.Stand up a unified dashboard: SLO metrics + unit cost + saturation. Kill 20% of noisy panels.
Deploy
Goldilocks
and open right-sizing PRs for top 10 services.Enable
pg_stat_statements
; fix top 3 queries.
60 days:
Switch HPA to RPS/queue-based scaling where applicable (
KEDA
for async).Add
Envoy
circuit breakers and queue time budgets at ingress.Implement CDN caching with
s-maxage
andstale-while-revalidate
for static and low-churn APIs.Start canary rollouts gated on SLOs. Add OPA policies to block pods without limits.
90 days:
Tie
infracost
into CI with budgets per service. Alert oncost_per_1k_requests
regressions.Tune runtimes (Go
GOGC
, Java ZGC) for latency-heavy services.Run a chaos day to validate circuit breakers and throttling under failure. Document runbooks and rollback paths.
Key takeaways
- Tie performance work to user-facing metrics and unit economics, not vanity p99s.
- Instrument cost per request/user/transaction alongside latency and error rate.
- Use a simple, repeatable loop: baseline → budget → boundaries → iterate.
- Right-size resources with data (VPA/Goldilocks), not folklore or vendor defaults.
- Throttle and shed load before you autoscale; scale on meaningful signals (RPS, queue depth).
- Cache aggressively at the edge and app layer; measure hit ratio and byte hit ratio.
- Govern via GitOps: enforce requests/limits, budgets, and SLOs in the pipeline.
Implementation checklist
- Define SLOs mapped to revenue steps (e.g., checkout p95 < 400ms).
- Track cost per 1k requests and per conversion in the same dashboard as latency.
- Enable `pg_stat_statements` and kill the top 5 slow queries.
- Deploy `Goldilocks` and open PRs for right-sized requests/limits.
- Switch HPA to RPS/queue-depth metrics and set `maxUnavailable=0` for rollouts.
- Add CDN `Cache-Control` with `s-maxage` and `stale-while-revalidate` for static+API.
- Introduce `Envoy` circuit breakers and queue time budgets at ingress.
- Tune runtimes: `GOGC=100-150`, Java 17 `-XX:+UseZGC` where GC dominates.
- Automate guardrails: OPA policies blocking pods without limits; budget alerts in CI.
Questions we hear from teams
- How do I pick the right SLOs for performance vs cost?
- Anchor SLOs to critical user journeys (e.g., product detail, checkout) and pick `p95` thresholds where conversion inflects. Pair each SLO with a budgeted `cost_per_1k_requests` or `cost_per_conversion`. If latency is below the inflection point but cost is high, optimize spend; if above, optimize speed first.
- Isn’t autoscaling cheaper than optimizing code?
- Sometimes—until you hit saturation on shared components (DB, cache, network) and tail latency explodes. Autoscaling without backpressure hides problems and inflates spend. Fix the top 5 queries, cache aggressively, and scale on RPS/queue depth; it’s usually cheaper and faster than buying bigger nodes.
- How do I measure cost per request reliably?
- Tag resources by `service`, `env`, and `owner`. Export metered cost (via AWS CUR or GCP Billing) to a warehouse and join against request counts from `Prometheus` or your gateway logs. Plot `cost_per_1k_requests` in Grafana next to `p95` and `error_rate`. Automate with `infracost` in CI for projections per PR.
- What’s the quickest win if I have one week?
- Deploy `Goldilocks`, right-size top 10 services, add CDN caching for low-churn endpoints, and enable `pg_stat_statements` to fix the top queries. Add `Envoy` circuit breakers to cap tail latency. You’ll usually see a 20–40% latency improvement and 15–30% cost reduction in a week.
- When should I optimize `p99`?
- If your `p95` is healthy but you still see user pain (timeouts, retries, SLA penalties) or high-value workflows suffer from long tails (trading, payments), then target `p99`. Use queue time budgets, circuit breakers, and runtime GC tuning. Otherwise, `p99` is often noise that derails the team.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.