Stop Chasing P99s in the Dark: A Practical Framework to Balance Performance and Cloud Spend

When the p95 spikes and Finance pings you on Slack, you need a framework that connects user happiness to dollars. Here’s the playbook I use to tune systems without torching the budget.

Performance work only matters if users feel it and Finance can measure it.
Back to all posts

The day the p95s spiked and the CFO called

We’d just shipped a “simple” personalization tweak. Thirty minutes later, p95 on the product page jumped from 320ms to 950ms, error_rate crept to 2.2%, and autoscaling doubled our c5.2xlarge pool. Real User Monitoring showed LCP in Chrome Mobile drifting past 3s, and conversions dropped 8%. Finance pinged me: “Why is EC2 up 40% this hour?” I’ve seen this movie too many times: teams chase p99 graphs while burning money, and customers are quietly exiting.

Here’s what actually works: anchor every optimization to a user-facing metric and a unit-economic target. Then use a small set of brutally practical techniques to move those numbers—without letting the cloud bill run feral.

Define the metrics that actually move the business

Forget vanity dashboards. Track what your CFO and your users feel.

  • User-facing: LCP, TTFB, CLS (web), Apdex, mobile cold start, checkout p95, error rate. Tools: Datadog RUM, New Relic Browser, Google Lighthouse, Web Vitals.

  • Service SLOs: p50/p95, error_rate, and availability per API/domain. Instrument with OpenTelemetry, scrape with Prometheus 2.x, visualize in Grafana 10 (use panels that overlay SLO targets).

  • Cost lenses: cost_per_1k_requests, cost_per_conversion, cost_per_active_user. Tag infra with owner, service, env and pipe to CloudWatch/CUR or GCP BillingBigQueryLooker/Grafana.

  • Capacity signals: queue depth, saturation (CPU throttle, load_avg, pg_locks), cache hit ratio, Latency vs Concurrency curves. These tell you when to scale vs optimize.

If a metric doesn’t tie to a user journey or a unit cost, it’s noise. Kill it.

A pragmatic framework for resource optimization

I keep it to four steps. It’s boring. It works.

  1. Baseline: Capture a 7-day window of RUM, SLO, and unit cost. Freeze major features. Example: product API p95=380ms, error=0.3%, cost_per_1k_reqs=$0.36.

  2. Budget: Set targets that map to revenue. Example: “Reduce product page LCP to <2.5s and cost_per_1k_reqs to <$0.25 without pushing error_rate > 0.5%.”

  3. Boundaries: Enforce guardrails before tuning: Kubernetes requests/limits, ResourceQuota, LimitRange, rate limits, and circuit breakers. Block regressions at the gate, not during an incident.

  4. Iterate: Single-variable changes, measure for 24–72 hours, roll forward/back via ArgoCD. If a change doesn’t move the user metric or unit cost, revert. No heroics.

Dashboards that matter (one screen, no scrolling):

  • Panel 1: p50/p95/p99 with SLO bands, error_rate overlay.

  • Panel 2: cost_per_1k_requests and total spend by service (tagged).

  • Panel 3: cache hit/byte-hit, DB avg query time, pg_stat_statements top offenders.

  • Panel 4: infra saturation: CPU throttle, memory RSS vs limit, GC pauses, queue depth.

Tactics that deliver measurable wins

These are the fixes I’ve seen pay back in days, not quarters.

  • Right-size Kubernetes before you scale out (K8s 1.27+):

    • Install Goldilocks and VPA v0.13.0 in recommend mode. Open PRs to adjust requests/limits.

    • Detect CPU throttling via container_cpu_cfs_throttled_seconds_total. If throttle > 2% at peak, bump CPU requests modestly (10–20%).

    • Example outcome: one client dropped node count 25% by right-sizing 14 services; p95 improved 18% from reduced throttling.

    • Example LimitRange YAML:

apiVersion: v1
kind: LimitRange
metadata:
  name: defaults
spec:
  limits:
  - type: Container
    default:
      cpu: "500m"
      memory: "512Mi"
    defaultRequest:
      cpu: "250m"
      memory: "256Mi"
  • Scale on meaningful signals: CPU is a terrible proxy for latency.

    • Use KEDA to scale on RPS, Kafka lag, or queue depth; or HPA with custom metrics from PrometheusAdapter.

    • Keep maxSurge=1, maxUnavailable=0 in rollouts to avoid pileups.

    • Result: a chatty API went from oscillating pods (thrash) to stable p95 with 30% fewer replicas.

  • Cache like you mean it:

    • Edge: Add Cache-Control: s-maxage=300, stale-while-revalidate=60 for product listings. Fastly/CloudFront will do the heavy lifting.

    • App: Redis with TTLs and negative caching for 404s. Track hit ratio and byte hit ratio; target > 80%/70% respectively.

    • Result: cut TTFB by 120ms and origin egress by 55%, dropping cost_per_1k_reqs by ~$0.08.

  • Backpressure and circuit breaking: protect the golden path.

    • At ingress (Envoy 1.29):
circuit_breakers:
  thresholds:
  - priority: DEFAULT
    max_connections: 20000
    max_pending_requests: 5000
    max_requests: 10000
    max_retries: 3
  • Budget queue time: reject at 200ms queueing to keep p95 under SLO. Users prefer fast failure to spinners.

  • Result: error_rate steady at 0.4% during traffic spikes instead of cascading timeouts.

  • Kill the top 5 queries before buying bigger DBs (Postgres 14):

    • Enable pg_stat_statements. Run:
SELECT query, calls, total_exec_time/1000 AS total_s,
       mean_exec_time AS mean_ms
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 5;
  • Add missing composite indexes and trim payload. If you’re doing SELECT * in hot paths, you’re lighting money on fire.

  • Add pgbouncer transaction pooling and a read-replica for search. Result: DB CPU -35%, API p95 -28%.

  • Runtime tuning beats instance upgrades:

    • Go: set GOMAXPROCS to cores; tune GC GOGC=100–150 if RSS is high and pauses are low. Measure with runtime.ReadMemStats exported to Prometheus.

    • Java 17: try -XX:+UseZGC for services with GC pauses > 200ms on G1. We cut tail latency 22% by flipping to ZGC on an IO-heavy service.

  • Serverless isn’t free (AWS Lambda):

    • Memory setting controls CPU. For CPU-bound functions, bump from 512MB → 1024MB; latency may drop 40% while cost per request improves due to shorter duration.

    • Use Provisioned Concurrency only on cold-paths that actually hurt conversions; scale to 0 elsewhere. Track cost_per_1k_invocations.

Case study: Taming a chatty checkout without burning cash

A mid-market retailer (K8s on EKS, microservices, Java/Go mix) saw checkout p95 drift to 900ms during promos and EC2 spend spike 38% at peak. Conversion was off by 6–9% versus baseline. They’d been told to “add nodes and shard the DB.” We did this instead:

  1. Baselined: stitched Datadog RUM LCP + checkout p95 to cost_per_1k_orders in Grafana. Baseline: p95=910ms, error=1.1%, LCP web=3.2s, cost_per_1k_orders=$0.42.

  2. Budgeted: “p95 < 400ms, error < 0.5%, LCP < 2.5s, cost_per_1k_orders < $0.25.”

  3. Boundaries: applied org-wide LimitRange, enforced OPA policy to block pods without limits, and added Envoy circuit breakers to the payment aggregator.

  4. Iterated:

    • Right-sized 11 services via Goldilocks PRs; removed 2x CPU throttling. Node count -23%.

    • Cached catalog and tax rates at edge with stale-while-revalidate=60. Origin requests -47%.

    • Killed 3 queries (missing composite index, N+1 in promotions, fat payload). DB CPU -33%.

    • Go runtime: GOGC=125, shaved RSS 18% and stabilized pauses.

    • Switched HPA to scale on RPS from CPU. Removed thrash. Added queue-time budget (200ms).

Results in 10 days:

  • Checkout p95 910ms → 280ms (−69%).

  • LCP 3.2s → 2.3s (−28%), mobile bounce rate −5.4%.

  • error_rate 1.1% → 0.4%.

  • cost_per_1k_orders $0.42 → $0.21 (−50%).

  • Promo-day revenue +7.2% vs prior promo at similar traffic. No new instances purchased.

Governance: bake cost–performance into the pipeline

If you don’t encode guardrails, entropy wins. Ship with seatbelts:

  • GitOps everything with ArgoCD: requests/limits, HPA/KEDA, Envoy configs, budgets. No ad-hoc changes at 2 a.m.

  • OPA/Gatekeeper policies: reject pods without limits; block :latest images; enforce maxReplicas and namespace ResourceQuota.

  • SLO-first rollouts: canary with Argo Rollouts or Flagger. Auto-abort if p95 or error_rate breach for N minutes, not just CPU%.

  • FinOps checks in CI: run a infracost job per PR; fail builds if cost_per_1k_requests projected > budget. Tie to tags so Finance can see ownership.

  • Planned chaos: run chaos-mesh or Gremlin to test circuit breakers and queue budgets monthly. If it fails under test, it will fail on Black Friday.

Field notes: what I’d do differently next time

  • Don’t start with p99. It’s where dragons and tail noise live. Fix p95 first; if conversion still suffers, then open the p99 door.

  • Don’t use CPU as the only HPA signal. Scale on RPS/queue depth. CPU-based scaling rewards slow code.

  • Avoid runtimes that hide GC pain. Measure pauses explicitly; if your perf issue is GC, solve GC—not autoscaling.

  • Make caching a first-class KPI. Edge hit ratio and byte-hit ratio belong next to LCP on the dashboard.

  • Keep feature teams accountable for cost. Add cost_per_1k_requests to the service ownership checklist. Bad caches are everyone’s problem, but someone’s responsibility.

30/60/90-day playbook

30 days:

  • Instrument OpenTelemetry traces for hot paths; wire to Prometheus/Grafana.

  • Stand up a unified dashboard: SLO metrics + unit cost + saturation. Kill 20% of noisy panels.

  • Deploy Goldilocks and open right-sizing PRs for top 10 services.

  • Enable pg_stat_statements; fix top 3 queries.

60 days:

  • Switch HPA to RPS/queue-based scaling where applicable (KEDA for async).

  • Add Envoy circuit breakers and queue time budgets at ingress.

  • Implement CDN caching with s-maxage and stale-while-revalidate for static and low-churn APIs.

  • Start canary rollouts gated on SLOs. Add OPA policies to block pods without limits.

90 days:

  • Tie infracost into CI with budgets per service. Alert on cost_per_1k_requests regressions.

  • Tune runtimes (Go GOGC, Java ZGC) for latency-heavy services.

  • Run a chaos day to validate circuit breakers and throttling under failure. Document runbooks and rollback paths.

Related Resources

Key takeaways

  • Tie performance work to user-facing metrics and unit economics, not vanity p99s.
  • Instrument cost per request/user/transaction alongside latency and error rate.
  • Use a simple, repeatable loop: baseline → budget → boundaries → iterate.
  • Right-size resources with data (VPA/Goldilocks), not folklore or vendor defaults.
  • Throttle and shed load before you autoscale; scale on meaningful signals (RPS, queue depth).
  • Cache aggressively at the edge and app layer; measure hit ratio and byte hit ratio.
  • Govern via GitOps: enforce requests/limits, budgets, and SLOs in the pipeline.

Implementation checklist

  • Define SLOs mapped to revenue steps (e.g., checkout p95 < 400ms).
  • Track cost per 1k requests and per conversion in the same dashboard as latency.
  • Enable `pg_stat_statements` and kill the top 5 slow queries.
  • Deploy `Goldilocks` and open PRs for right-sized requests/limits.
  • Switch HPA to RPS/queue-depth metrics and set `maxUnavailable=0` for rollouts.
  • Add CDN `Cache-Control` with `s-maxage` and `stale-while-revalidate` for static+API.
  • Introduce `Envoy` circuit breakers and queue time budgets at ingress.
  • Tune runtimes: `GOGC=100-150`, Java 17 `-XX:+UseZGC` where GC dominates.
  • Automate guardrails: OPA policies blocking pods without limits; budget alerts in CI.

Questions we hear from teams

How do I pick the right SLOs for performance vs cost?
Anchor SLOs to critical user journeys (e.g., product detail, checkout) and pick `p95` thresholds where conversion inflects. Pair each SLO with a budgeted `cost_per_1k_requests` or `cost_per_conversion`. If latency is below the inflection point but cost is high, optimize spend; if above, optimize speed first.
Isn’t autoscaling cheaper than optimizing code?
Sometimes—until you hit saturation on shared components (DB, cache, network) and tail latency explodes. Autoscaling without backpressure hides problems and inflates spend. Fix the top 5 queries, cache aggressively, and scale on RPS/queue depth; it’s usually cheaper and faster than buying bigger nodes.
How do I measure cost per request reliably?
Tag resources by `service`, `env`, and `owner`. Export metered cost (via AWS CUR or GCP Billing) to a warehouse and join against request counts from `Prometheus` or your gateway logs. Plot `cost_per_1k_requests` in Grafana next to `p95` and `error_rate`. Automate with `infracost` in CI for projections per PR.
What’s the quickest win if I have one week?
Deploy `Goldilocks`, right-size top 10 services, add CDN caching for low-churn endpoints, and enable `pg_stat_statements` to fix the top queries. Add `Envoy` circuit breakers to cap tail latency. You’ll usually see a 20–40% latency improvement and 15–30% cost reduction in a week.
When should I optimize `p99`?
If your `p95` is healthy but you still see user pain (timeouts, retries, SLA penalties) or high-value workflows suffer from long tails (trading, payments), then target `p99`. Use queue time budgets, circuit breakers, and runtime GC tuning. Otherwise, `p99` is often noise that derails the team.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to GitPlumbers about your cost–performance hotspots See how we cut checkout latency 69% without new hardware

Related resources