The Latency Budget That Cut Our Cloud Bill 38% Without Slowing Users

A battle-tested framework to tie p95s to dollars, not dashboards — and ship faster without torching your cloud budget.

If your SLOs don’t drive a go/no-go decision in a deploy, they’re just wall art.
Back to all posts

The Black Friday “AutoScale” that doubled our bill — and didn’t save conversion

We got called into a retailer whose checkout p95 spiked to 2.1s during promos. Their response was the classic “throw nodes at it” HPA policy — throughput stabilized, but cloud costs doubled in a day and conversion didn’t budge. I’ve seen this movie at marketplaces, adtech, and fintech: chasing raw throughput with autoscaling while the real issue is unindexed queries, chatty services, and image bloat.

What actually turned it around wasn’t a shinier instance class. It was a latency budget per user journey and a cost per request target. We optimized until both were green, then stopped. Result: checkout p95 dropped from 1.8s to 900ms, infra cost fell 38%, and conversion ticked up 2.3%. No heroics, just discipline.

Define a performance-to-cost contract per user journey

If you don’t make targets explicit, engineers will optimize the fun parts and ignore the expensive parts.

  • Pick the top 3–5 user journeys: e.g., home → PDP → cart → checkout, or search → detail → add-to-cart.
  • For each, set a p95 latency budget based on observed conversion elasticity. I like 750ms p95 for checkout API and 2.5s LCP for the UI.
  • Add a cost per request (CPR) target. Example: ≤ $0.003 per checkout API call all-in (compute + DB + CDN egress).
  • Write them down as SLOs you can alert on, not aspirational wiki pages.

Example using OpenSLO for latency plus a simple custom CostBudget (we use a CRD backed by Kubecost/Cloud CUR):

# latency-slo.yaml
apiVersion: openslo/v1
kind: SLO
metadata:
  name: checkout-latency-slo
spec:
  service: checkout
  timeWindow:
    duration: 30d
  objectives:
  - displayName: p95 latency under 750ms
    target: 0.99
    op: lte
    indicator:
      datasource: prometheus
      metric: |
        histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{route="/checkout"}[5m])) by (le))
---
# cost-budget.yaml (custom CRD backed by Kubecost)
apiVersion: finops.gitplumbers.dev/v1alpha1
kind: CostBudget
metadata:
  name: checkout-cost-budget
spec:
  window: 30d
  targetCostPerRequestUsd: 0.003
  selector:
    matchLabels:
      app: checkout

Treat the latency budget and cost budget as a contract: any change that improves one while violating the other needs a product or finance sign-off.

The optimization loop: measure, target, verify

Here’s the loop we run with clients. It’s boring; it works.

  1. Baseline: capture 14 days of p50/p95/p99 per key route, error rate, throughput, and CPR. Segment by device, region, and plan.
  2. Prioritize: find the 3 endpoints where reducing p95 by 200ms will move revenue (e.g., checkout, search autocomplete, auth). Ignore the rest.
  3. Hypothesize: choose techniques with quantified expectation (e.g., cache warm product details → -300ms p95, +$0.0002 CPR).
  4. Protect: wrap changes in a feature flag and set canary analysis gates on latency AND cost.
  5. Verify: promote if both SLOs are met; rollback if latency/cost regressions exceed thresholds.
  6. Codify: lock the win via infra as code — indexes, HPA/VPA, TTLs, rate limits, dashboards.

Canary analysis using Flagger with custom metrics for latency and CPR:

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: checkout
  namespace: web
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: checkout
  analysis:
    interval: 1m
    threshold: 10
    stepWeight: 10
    maxWeight: 50
    metrics:
    - name: p95_latency
      templateRef:
        name: latency
      thresholdRange:
        max: 0.75 # seconds
    - name: error_rate
      thresholdRange:
        max: 0.01
    - name: cost_per_req
      templateRef:
        name: cpr
      thresholdRange:
        max: 0.003 # USD

Techniques that move metrics (and dollars)

Skip the exotic and start with the boring winners.

  • Cache the hot path (5–10x ROI):
    • Read-through Redis/Memcached for product, pricing, feature flags.
    • TTL 300–900s; invalidate on writes.
// Node.js read-through cache for product details
const cacheKey = `product:${id}`;
const cached = await redis.get(cacheKey);
if (cached) return JSON.parse(cached);

const product = await db.getProduct(id);
await redis.set(cacheKey, JSON.stringify(product), { EX: 300, NX: true });
return product;
  • Index where your query planner hurts (often 30–60% p95 drop):
-- Speed up "open orders by user" lookups
CREATE INDEX CONCURRENTLY idx_orders_user_created_at
ON orders (user_id, created_at DESC)
WHERE status = 'open';
  • Kill chatty services (tail latency killer):

    • Collapse N round trips into 1 aggregate endpoint.
    • Use gRPC/HTTP/2 for multiplexing if you must keep calls separate.
  • Compress and resize assets at the edge (LCP wins without CPU burns):

# NGINX/Ingress compression (add Brotli if supported)
gzip on;
gzip_types text/css application/javascript application/json;
brotli on;
brotli_types text/plain text/css application/javascript application/json;
  • Precompute the expensive: materialize top 1% queries hourly (e.g., search facets, personalized recommendations) and serve from a KV store.

  • Add circuit breakers and timeouts to protect the p95 from one slow dependency:

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: payments-dr
spec:
  host: payments
  trafficPolicy:
    connectionPool:
      http:
        http1MaxPendingRequests: 100
        maxRequestsPerConnection: 50
    outlierDetection:
      consecutive5xxErrors: 5
      interval: 5s
      baseEjectionTime: 30s
  • Push work off the critical path: enqueue non-essential writes (emails, analytics, thumbnails) to a queue; return to user fast.

Measured outcomes we’ve repeated across clients:

  • Redis + index: checkout p95 1.2s → 720ms; DB CPU -35%; CPR -$0.0011.
  • Brotli + image resizing: LCP 3.1s → 2.2s on 3G; egress -22%.
  • Aggregation endpoint: tail p99 4.8s → 1.6s; HPA max replicas -40%.

Scale smart: autoscaling and right-sizing without surprises

Autoscaling isn’t a strategy; it’s a control loop. Feed it the right signals and it will save you real money.

  • HPA on meaningful metrics (RPS or concurrency, not CPU alone):
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: checkout-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: checkout
  minReplicas: 4
  maxReplicas: 20
  metrics:
  - type: Pods
    pods:
      metric:
        name: requests_per_second
      target:
        type: AverageValue
        averageValue: "40"
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 20
        periodSeconds: 60
  • VPA for right-sizing memory hogs, capped to avoid thrash:
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: checkout-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: checkout
  updatePolicy:
    updateMode: Auto
  resourcePolicy:
    containerPolicies:
    - containerName: app
      minAllowed:
        cpu: "300m"
        memory: "256Mi"
      maxAllowed:
        cpu: "2000m"
        memory: "2Gi"
  • Karpenter/Cluster Autoscaler with consolidation to pack nodes efficiently:
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: default
spec:
  requirements:
  - key: karpenter.sh/capacity-type
    operator: In
    values: ["spot", "on-demand"]
  limits:
    resources:
      cpu: "2000"
  consolidation:
    enabled: true
  • DB right-sizing: if you’re on Aurora/RDS, use Performance Insights to bound max connections and enable Aurora Serverless v2 for bursty workloads. Gate instance class changes behind the same latency/cost canary.

  • Guardrails: if p95 is under budget and CPR is rising, scale down aggressively; if p95 is over budget and CPR is flat, allow scaling up.

Instrument cost next to latency

If you can’t see dollars next to milliseconds, you’ll optimize the wrong things.

  • Tag everything (team, service, env) and pipe Cloud CUR + Kubecost to Prometheus.
  • Compute Cost per Request in PromQL and chart it with p95.
-- approximate CPR using Kubecost + request rate
sum(rate(kubecost_container_cpu_cost{namespace="checkout"}[5m])
  + rate(kubecost_container_memory_cost{namespace="checkout"}[5m]))
/
sum(rate(http_requests_total{namespace="checkout",route="/checkout"}[5m]))
  • Add cost to canary analysis (see earlier cost_per_req metric) and to your SLO burn alerts.
  • Correlate latency/cost with business KPIs in one board: p95, CPR, conversion, AOV, error rate. No more swivel-chair analytics.

Make it stick: governance without the bureaucracy

The teams that sustain wins do three boring things:

  • Error and latency budgets: treat them as spendable. If you burn budget, freeze features and fix. If you’re green, go ship.
  • Change policy: no performance change without a flag, a canary, and a rollback. Use ArgoCD/GitOps so diffs tell the story.
  • Quarterly calibration: revisit budgets. As features accrete, 750ms may become 850ms, or you may buy 100ms with image CDNs.

If your SLOs don’t drive a go/no-go decision in a deploy, they’re just wall art.

What it looks like when it works

At a subscription marketplace, we:

  • Set a 700ms p95 budget for checkout and a $0.0028 CPR target.
  • Added Redis read-through, two Postgres indexes, Brotli, and an aggregate cart endpoint.
  • Right-sized with HPA-on-RPS and VPA caps; enabled Karpenter consolidation.
  • Wired CPR + p95 into Flagger canaries and blocked three regressions before they hit 100%.

Results in 5 weeks:

  • Checkout p95: 1.6s → 820ms.
  • Infra cost: -38% for the services involved.
  • Conversion: +2.1% overall, +3.4% on mobile.
  • MTTR unchanged; fewer p99 spikes due to circuit breakers.

This is the playbook we run at GitPlumbers. No mysticism, no moonshots — just clear contracts, safe changes, and ruthless measurement.

Related Resources

Key takeaways

  • Tie latency targets to business outcomes and cost per request. No more chasing vanity p99s.
  • Define per-journey latency budgets and a cost SLO. Treat both as first-class constraints.
  • Optimize with an explicit loop: baseline, target, change, canary, verify, codify.
  • Focus on techniques that actually move user-facing metrics: cache, index, compress, precompute, and remove tail latency.
  • Right-size automatically with HPA/VPA/Karpenter, but gate scale with latency and cost signals.
  • Instrument cost next to latency in the same dashboard. Make trade-offs visible in dollars.

Implementation checklist

  • Map top user journeys and set a p95 latency budget per path.
  • Set a cost per request target alongside the latency SLO.
  • Add Prometheus/Kubecost metrics for cost, and wire them into canary analysis.
  • Right-size workloads with HPA/VPA; enable consolidation with Karpenter or cluster autoscaler.
  • Ship optimizations behind flags, then canary with Flagger/Argo Rollouts against latency and cost.
  • Codify wins (indexes, TTLs, cache keys) as migrations with rollback paths.
  • Review latency budget and cost SLOs quarterly; adjust targets as product and infra evolve.

Questions we hear from teams

How do I pick a realistic latency target?
Start from observed conversion elasticity and device mix. If mobile dominates, target lower p95s on API endpoints (600–800ms) and 2–2.5s LCP for primary pages. Use current p95 as a baseline and improve in 10–20% increments rather than jumping to unrealistic p99s.
Should we optimize p95 or p99?
Optimize p95 for day-to-day experience and use p99 to detect tail risk. Most revenue correlates with p95; p99 is a stability signal. If p99 drives paging, add circuit breakers, timeouts, and retries to cap tails without gold-plating the entire stack.
Do we need a FinOps tool to measure cost per request?
You need cost allocation data, not necessarily a pricey tool. Start with Cloud CUR + tags and Kubecost. Export to Prometheus and compute CPR alongside your latency metrics. Upgrade tooling later if needed.
Will autoscaling fix my latency?
Only if compute is the bottleneck. In practice, it often isn’t. Autoscaling amplifies whatever you have: slow queries, chatty calls, cache misses. Fix those first, then let HPA/VPA keep you right-sized.
Is serverless cheaper for this?
Sometimes. Serverless shines for spiky, IO-heavy tasks where you can keep functions warm and avoid cross-service chatter. But for hot, high-RPS paths, containerized services with tuned autoscaling and caching are typically cheaper at scale.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to GitPlumbers about a latency and cost budget sprint Download the latency + cost SLO templates (OpenSLO + PromQL)

Related resources