The Latency Budget That Cut Our Cloud Bill 38% Without Slowing Users
A battle-tested framework to tie p95s to dollars, not dashboards — and ship faster without torching your cloud budget.
If your SLOs don’t drive a go/no-go decision in a deploy, they’re just wall art.Back to all posts
The Black Friday “AutoScale” that doubled our bill — and didn’t save conversion
We got called into a retailer whose checkout p95 spiked to 2.1s during promos. Their response was the classic “throw nodes at it” HPA policy — throughput stabilized, but cloud costs doubled in a day and conversion didn’t budge. I’ve seen this movie at marketplaces, adtech, and fintech: chasing raw throughput with autoscaling while the real issue is unindexed queries, chatty services, and image bloat.
What actually turned it around wasn’t a shinier instance class. It was a latency budget per user journey and a cost per request target. We optimized until both were green, then stopped. Result: checkout p95 dropped from 1.8s to 900ms, infra cost fell 38%, and conversion ticked up 2.3%. No heroics, just discipline.
Define a performance-to-cost contract per user journey
If you don’t make targets explicit, engineers will optimize the fun parts and ignore the expensive parts.
- Pick the top 3–5 user journeys: e.g.,
home → PDP → cart → checkout, orsearch → detail → add-to-cart. - For each, set a p95 latency budget based on observed conversion elasticity. I like 750ms p95 for checkout API and 2.5s LCP for the UI.
- Add a cost per request (CPR) target. Example:
≤ $0.003per checkout API call all-in (compute + DB + CDN egress). - Write them down as SLOs you can alert on, not aspirational wiki pages.
Example using OpenSLO for latency plus a simple custom CostBudget (we use a CRD backed by Kubecost/Cloud CUR):
# latency-slo.yaml
apiVersion: openslo/v1
kind: SLO
metadata:
name: checkout-latency-slo
spec:
service: checkout
timeWindow:
duration: 30d
objectives:
- displayName: p95 latency under 750ms
target: 0.99
op: lte
indicator:
datasource: prometheus
metric: |
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{route="/checkout"}[5m])) by (le))
---
# cost-budget.yaml (custom CRD backed by Kubecost)
apiVersion: finops.gitplumbers.dev/v1alpha1
kind: CostBudget
metadata:
name: checkout-cost-budget
spec:
window: 30d
targetCostPerRequestUsd: 0.003
selector:
matchLabels:
app: checkoutTreat the latency budget and cost budget as a contract: any change that improves one while violating the other needs a product or finance sign-off.
The optimization loop: measure, target, verify
Here’s the loop we run with clients. It’s boring; it works.
- Baseline: capture 14 days of
p50/p95/p99per key route, error rate, throughput, and CPR. Segment by device, region, and plan. - Prioritize: find the 3 endpoints where reducing p95 by 200ms will move revenue (e.g., checkout, search autocomplete, auth). Ignore the rest.
- Hypothesize: choose techniques with quantified expectation (e.g., cache warm product details → -300ms p95, +$0.0002 CPR).
- Protect: wrap changes in a feature flag and set canary analysis gates on latency AND cost.
- Verify: promote if both SLOs are met; rollback if latency/cost regressions exceed thresholds.
- Codify: lock the win via infra as code — indexes, HPA/VPA, TTLs, rate limits, dashboards.
Canary analysis using Flagger with custom metrics for latency and CPR:
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: checkout
namespace: web
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: checkout
analysis:
interval: 1m
threshold: 10
stepWeight: 10
maxWeight: 50
metrics:
- name: p95_latency
templateRef:
name: latency
thresholdRange:
max: 0.75 # seconds
- name: error_rate
thresholdRange:
max: 0.01
- name: cost_per_req
templateRef:
name: cpr
thresholdRange:
max: 0.003 # USDTechniques that move metrics (and dollars)
Skip the exotic and start with the boring winners.
- Cache the hot path (5–10x ROI):
- Read-through Redis/Memcached for product, pricing, feature flags.
- TTL 300–900s; invalidate on writes.
// Node.js read-through cache for product details
const cacheKey = `product:${id}`;
const cached = await redis.get(cacheKey);
if (cached) return JSON.parse(cached);
const product = await db.getProduct(id);
await redis.set(cacheKey, JSON.stringify(product), { EX: 300, NX: true });
return product;- Index where your query planner hurts (often 30–60% p95 drop):
-- Speed up "open orders by user" lookups
CREATE INDEX CONCURRENTLY idx_orders_user_created_at
ON orders (user_id, created_at DESC)
WHERE status = 'open';Kill chatty services (tail latency killer):
- Collapse N round trips into 1 aggregate endpoint.
- Use
gRPC/HTTP/2for multiplexing if you must keep calls separate.
Compress and resize assets at the edge (LCP wins without CPU burns):
# NGINX/Ingress compression (add Brotli if supported)
gzip on;
gzip_types text/css application/javascript application/json;
brotli on;
brotli_types text/plain text/css application/javascript application/json;Precompute the expensive: materialize top 1% queries hourly (e.g., search facets, personalized recommendations) and serve from a KV store.
Add circuit breakers and timeouts to protect the p95 from one slow dependency:
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: payments-dr
spec:
host: payments
trafficPolicy:
connectionPool:
http:
http1MaxPendingRequests: 100
maxRequestsPerConnection: 50
outlierDetection:
consecutive5xxErrors: 5
interval: 5s
baseEjectionTime: 30s- Push work off the critical path: enqueue non-essential writes (emails, analytics, thumbnails) to a queue; return to user fast.
Measured outcomes we’ve repeated across clients:
- Redis + index: checkout p95 1.2s → 720ms; DB CPU -35%; CPR -$0.0011.
- Brotli + image resizing: LCP 3.1s → 2.2s on 3G; egress -22%.
- Aggregation endpoint: tail p99 4.8s → 1.6s; HPA max replicas -40%.
Scale smart: autoscaling and right-sizing without surprises
Autoscaling isn’t a strategy; it’s a control loop. Feed it the right signals and it will save you real money.
- HPA on meaningful metrics (RPS or concurrency, not CPU alone):
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: checkout-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: checkout
minReplicas: 4
maxReplicas: 20
metrics:
- type: Pods
pods:
metric:
name: requests_per_second
target:
type: AverageValue
averageValue: "40"
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 20
periodSeconds: 60- VPA for right-sizing memory hogs, capped to avoid thrash:
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: checkout-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: checkout
updatePolicy:
updateMode: Auto
resourcePolicy:
containerPolicies:
- containerName: app
minAllowed:
cpu: "300m"
memory: "256Mi"
maxAllowed:
cpu: "2000m"
memory: "2Gi"- Karpenter/Cluster Autoscaler with consolidation to pack nodes efficiently:
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
name: default
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
limits:
resources:
cpu: "2000"
consolidation:
enabled: trueDB right-sizing: if you’re on Aurora/RDS, use
Performance Insightsto bound max connections and enableAurora Serverless v2for bursty workloads. Gate instance class changes behind the same latency/cost canary.Guardrails: if p95 is under budget and CPR is rising, scale down aggressively; if p95 is over budget and CPR is flat, allow scaling up.
Instrument cost next to latency
If you can’t see dollars next to milliseconds, you’ll optimize the wrong things.
- Tag everything (
team,service,env) and pipe Cloud CUR + Kubecost to Prometheus. - Compute Cost per Request in PromQL and chart it with p95.
-- approximate CPR using Kubecost + request rate
sum(rate(kubecost_container_cpu_cost{namespace="checkout"}[5m])
+ rate(kubecost_container_memory_cost{namespace="checkout"}[5m]))
/
sum(rate(http_requests_total{namespace="checkout",route="/checkout"}[5m]))- Add cost to canary analysis (see earlier
cost_per_reqmetric) and to your SLO burn alerts. - Correlate latency/cost with business KPIs in one board: p95, CPR, conversion, AOV, error rate. No more swivel-chair analytics.
Make it stick: governance without the bureaucracy
The teams that sustain wins do three boring things:
- Error and latency budgets: treat them as spendable. If you burn budget, freeze features and fix. If you’re green, go ship.
- Change policy: no performance change without a flag, a canary, and a rollback. Use
ArgoCD/GitOpsso diffs tell the story. - Quarterly calibration: revisit budgets. As features accrete, 750ms may become 850ms, or you may buy 100ms with image CDNs.
If your SLOs don’t drive a go/no-go decision in a deploy, they’re just wall art.
What it looks like when it works
At a subscription marketplace, we:
- Set a 700ms p95 budget for checkout and a $0.0028 CPR target.
- Added Redis read-through, two Postgres indexes, Brotli, and an aggregate cart endpoint.
- Right-sized with HPA-on-RPS and VPA caps; enabled Karpenter consolidation.
- Wired CPR + p95 into Flagger canaries and blocked three regressions before they hit 100%.
Results in 5 weeks:
- Checkout p95: 1.6s → 820ms.
- Infra cost: -38% for the services involved.
- Conversion: +2.1% overall, +3.4% on mobile.
- MTTR unchanged; fewer p99 spikes due to circuit breakers.
This is the playbook we run at GitPlumbers. No mysticism, no moonshots — just clear contracts, safe changes, and ruthless measurement.
Key takeaways
- Tie latency targets to business outcomes and cost per request. No more chasing vanity p99s.
- Define per-journey latency budgets and a cost SLO. Treat both as first-class constraints.
- Optimize with an explicit loop: baseline, target, change, canary, verify, codify.
- Focus on techniques that actually move user-facing metrics: cache, index, compress, precompute, and remove tail latency.
- Right-size automatically with HPA/VPA/Karpenter, but gate scale with latency and cost signals.
- Instrument cost next to latency in the same dashboard. Make trade-offs visible in dollars.
Implementation checklist
- Map top user journeys and set a p95 latency budget per path.
- Set a cost per request target alongside the latency SLO.
- Add Prometheus/Kubecost metrics for cost, and wire them into canary analysis.
- Right-size workloads with HPA/VPA; enable consolidation with Karpenter or cluster autoscaler.
- Ship optimizations behind flags, then canary with Flagger/Argo Rollouts against latency and cost.
- Codify wins (indexes, TTLs, cache keys) as migrations with rollback paths.
- Review latency budget and cost SLOs quarterly; adjust targets as product and infra evolve.
Questions we hear from teams
- How do I pick a realistic latency target?
- Start from observed conversion elasticity and device mix. If mobile dominates, target lower p95s on API endpoints (600–800ms) and 2–2.5s LCP for primary pages. Use current p95 as a baseline and improve in 10–20% increments rather than jumping to unrealistic p99s.
- Should we optimize p95 or p99?
- Optimize p95 for day-to-day experience and use p99 to detect tail risk. Most revenue correlates with p95; p99 is a stability signal. If p99 drives paging, add circuit breakers, timeouts, and retries to cap tails without gold-plating the entire stack.
- Do we need a FinOps tool to measure cost per request?
- You need cost allocation data, not necessarily a pricey tool. Start with Cloud CUR + tags and Kubecost. Export to Prometheus and compute CPR alongside your latency metrics. Upgrade tooling later if needed.
- Will autoscaling fix my latency?
- Only if compute is the bottleneck. In practice, it often isn’t. Autoscaling amplifies whatever you have: slow queries, chatty calls, cache misses. Fix those first, then let HPA/VPA keep you right-sized.
- Is serverless cheaper for this?
- Sometimes. Serverless shines for spiky, IO-heavy tasks where you can keep functions warm and avoid cross-service chatter. But for hot, high-RPS paths, containerized services with tuned autoscaling and caching are typically cheaper at scale.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
