The Cache Stack That Halved p95 TTFB and Cut Our Cloud Bill by 38%

Multi‑tier caching that speeds up user experiences and takes pressure off your infra bill—without waking PagerDuty.

A boring cache that always serves something beats a clever system that occasionally serves nothing.
Back to all posts

The outage that sold me on boring caches

Black Friday at a mid-market retailer. Frontend looked fine in staging. Then traffic spiked. p95 TTFB jumped from 400ms to 2.8s, LCP went red, and the database did its best impression of a SpaceX static fire. We had a Redis cluster, a CDN, and a dozen microservices. But the cache keys were sloppy, TTLs were all over the place, and thundering herds stampeded the origin whenever something expired.

We fixed it in a week by building a boring, multi-tier cache with predictable keys, soft/hard TTLs, and edge rules that actually match user behavior. p95 TTFB dropped 51% and infra spend fell 38% the following month. No heroics. Just a cache that does its job.

Measure what users feel (and what finance will thank you for)

If your cache strategy isn’t tied to user-facing metrics and dollars, you’re accumulating technical debt.

  • User metrics: p95 TTFB, p75 LCP, Apdex, and error rate during cache misses.
  • Cache efficiency: cache hit ratio (CHR), miss penalty (origin latency on miss), revalidation rate, and offload percentage at each tier.
  • Cost signals: origin QPS, DB CPU/IO, egress from origin, CDN egress (watch regional anomalies), and autoscale events.

Translate improvements into dollars. Example from that retailer:

  • p95 TTFB: 2.8s → 1.37s (51% faster)
  • Home/product listing CHR: 42% → 89%
  • DB QPS: 14k → 4.1k (able to downsize RDS by one class)
  • Cloud bill: -38% month-over-month (less origin egress, smaller DB, fewer app nodes)

The three-tier cache that actually works

You need layers that align with data volatility and personalization.

  1. Edge CDN (Cloudflare/Fastly/Akamai)

    • Cache public, semi-static responses. Use s-maxage, stale-while-revalidate, and stale-if-error.
    • Tag content with surrogate keys for precise invalidation.
    • Serve stale on errors to protect your SLOs.
  2. Shared in-memory (Redis/Memcached)

    • Cache JSON fragments and API responses behind your gateway.
    • Use cache-aside with soft/hard TTLs.
    • Request coalescing to avoid stampedes.
  3. Per-process in-memory (Caffeine for Java, lru-cache for Node, Guava for legacy)

    • Micro-caches for hot, tiny lookups (feature flags, config, small reference data) with 10–60s TTL.
    • Keep the miss penalty low even if Redis blips.

Example: NGINX as a shield with Redis behind

The gateway caches safe GETs for 5–10 minutes and serves stale on errors. Redis holds finer-grained API/classic page fragments for 1–3 minutes with soft refresh.

# NGINX edge-ish cache in front of app tier
proxy_cache_path /var/cache/nginx levels=1:2 keys_zone=api_cache:200m inactive=60m max_size=20g;

server {
  listen 443 ssl;
  location /api/ {
    proxy_cache api_cache;
    proxy_cache_key "$request_method|$host|$uri|$args";
    proxy_cache_valid 200 10m;
    proxy_cache_bypass $http_cache_control;  # allow Cache-Control: no-cache to bypass
    proxy_cache_use_stale error timeout updating http_500 http_502 http_503 http_504;
    proxy_cache_lock on;                     # request coalescing
    proxy_cache_lock_timeout 10s;
    proxy_cache_background_update on;        # stale-while-revalidate
    add_header X-Cache-Status $upstream_cache_status;
    proxy_pass http://app_upstream;
  }
}

Keys, TTLs, and invalidation without pager fatigue

If you can’t explain your cache key in one sentence, you don’t have one.

  • Key normalization: Only include parameters that change variants. Hash the rest.
    • Example key format: GET|shop.example.com|/product/123?currency=USD&country=US
  • Versioned assets: /main.9f3c2.js and /styles.a1b2.css end the “purge all” habit.
  • Surrogate keys/tags: Group objects you want to purge together without touching everything.
  • Conditional GET: Support ETag/If-None-Match and Last-Modified/If-Modified-Since to reduce bandwidth and load.

Fastly and Cloudflare both support surrogate keys/tags:

HTTP/1.1 200 OK
Cache-Control: public, s-maxage=600, stale-while-revalidate=30, stale-if-error=86400
Surrogate-Control: max-age=600
Surrogate-Key: product-123 product-listing category-42

Cloudflare Workers can add Cache-Tag for precise purge:

// worker.mjs
export default {
  async fetch(req, env, ctx) {
    const cache = caches.default
    const cached = await cache.match(req)
    if (cached) return cached

    const originRes = await fetch(req)
    const res = new Response(originRes.body, originRes)
    res.headers.set('Cache-Control', 'public, s-maxage=600, stale-while-revalidate=30, stale-if-error=86400')
    res.headers.set('Cache-Tag', 'product-123 product-listing')

    ctx.waitUntil(cache.put(req, res.clone()))
    return res
  }
}

Stop the thundering herd: coalesce, jitter, and go stale

I’ve seen well-meaning teams take Redis from 20% CPU to 95% just by lining up expirations on the minute. Don’t do synchronized cache misses.

  • Request coalescing: One in-flight recompute per key; everyone else waits or gets stale.
  • Soft TTL + hard TTL: After soft expiry, serve stale and refresh in background until hard expiry.
  • Jitter: Add randomized ±10–20% to TTLs to de-sync.
  • Probabilistic early expiration: With increasing probability as TTL nears zero, trigger refresh.

Cache-aside with soft/hard TTL in Node/TypeScript:

import Redis from 'ioredis'
const redis = new Redis(process.env.REDIS_URL!)

const SOFT_TTL_MS = 60_000
const HARD_TTL_MS = 300_000

type CacheEntry<T> = { v: T; softExp: number; hardExp: number }

async function getWithCache<T>(key: string, loader: () => Promise<T>): Promise<T> {
  const now = Date.now()
  const raw = await redis.get(key)
  if (raw) {
    const parsed = JSON.parse(raw) as CacheEntry<T>
    if (parsed.hardExp > now) {
      if (parsed.softExp > now) return parsed.v
      // soft expired: serve stale and refresh in background
      refreshInBackground(key, loader).catch(() => {})
      return parsed.v
    }
  }
  // miss or hard expired
  const val = await loader()
  await put(key, val)
  return val
}

async function put<T>(key: string, v: T) {
  const now = Date.now()
  const jitter = Math.floor(SOFT_TTL_MS * (0.9 + Math.random() * 0.2))
  const entry: CacheEntry<T> = { v, softExp: now + jitter, hardExp: now + HARD_TTL_MS }
  await redis.set(key, JSON.stringify(entry), 'PX', HARD_TTL_MS)
}

const inflight = new Map<string, Promise<void>>()
async function refreshInBackground<T>(key: string, loader: () => Promise<T>) {
  if (inflight.has(key)) return inflight.get(key)
  const p = (async () => { const val = await loader(); await put(key, val) })()
  inflight.set(key, p)
  try { await p } finally { inflight.delete(key) }
}

Result: during a cold start or popular product refresh, the origin sees one recompute instead of 1,000.

Edge rules that save milliseconds and money

Your CDN is the cheapest, fastest cache you’ll ever buy. Use it fully.

  • Set s-maxage, stale-while-revalidate, stale-if-error on cacheable routes. Favor edge TTLs of 5–15 minutes for semi-static pages.
  • Vary only when needed: Vary: Accept-Encoding, Accept-Language is common; avoid varying on cookies unless truly personalized.
  • Purge with tags/keys instead of “purge all.” Fastly’s Surrogate Keys and Cloudflare Cache Tags are your friends.
  • Image optimization at edge: AVIF/WEBP plus width-based variants. This reduces origin CPU and bandwidth dramatically.
  • GraphQL reality check: Use persisted queries and a gateway that attaches cache hints (@cacheControl in Apollo). Cache lists separately from item detail.

Prometheus snippet to watch CHR and miss penalty trend:

// 5m CHR at the CDN (example labels will vary)
sum(rate(http_requests_total{cache="HIT"}[5m]))
/ clamp_min(sum(rate(http_requests_total[5m])), 1)

// Miss penalty p95 at gateway
histogram_quantile(0.95, sum by (le) (rate(gateway_upstream_latency_seconds_bucket{cache="MISS"}[5m])))

Turn cache hit ratio into dollar savings

A simple way to communicate value to finance without hand-waving:

  • Baseline: 10M requests/day; origin CHR 45%; origin QPS avg ~115; DB at 60% CPU; egress 6 TB/mo.
  • After: CHR 85%; origin QPS avg ~60; DB at 30% CPU; egress 2.3 TB/mo.

Rough impact:

  • App nodes: 30–40% fewer autoscale events (steady-state savings + fewer spikes)
  • DB: one instance class down (or postpone the shard you’re dreading)
  • Bandwidth: 60% less origin egress; CDN bill up slightly, but net spend down

For the retailer, this mapped to ~38% infra savings. More importantly, p95 TTFB halved, cart abandonment fell ~7%, and the team slept through the next promo.

Rollout playbook and guardrails (the part that saves your weekend)

You don’t flip this on everywhere. You test it like you would a risky schema migration.

  1. Map cacheability: tag endpoints as static, semi-static, personalized, or dynamic.
  2. Canary by route and region: start with 5–10% traffic on one route in one region.
  3. Dashboards: CHR, miss penalty, origin QPS, p95 TTFB, error rate; compare canary vs control.
  4. Feature flags: flags per tier—edge, shared, local. Roll back layer-by-layer.
  5. Game days: simulate PURGE of hot keys and an origin brownout; verify stale is served and error budgets hold.
  6. Runbooks: include commands for invalidation, bypass headers, and emergency TTL overrides.

Example emergency header to bypass cache during an incident:

curl -H "Cache-Control: no-cache" -H "Pragma: no-cache" https://api.example.com/product/123

And an origin override of TTL via gateway config:

# Envoy route config (snippet)
response_headers_to_add:
  - header:
      key: Cache-Control
      value: public, s-maxage=60, stale-while-revalidate=30, stale-if-error=86400
    append_action: OVERWRITE_IF_EXISTS

What I’d do differently next time

  • Don’t trust defaults. Most frameworks either don’t cache or cache dangerously.
  • Normalize keys up front. We lost days just chasing accidental variant explosions.
  • Bake in coalescing from day one. It’s harder to retrofit when traffic scales.
  • Budget time for dashboards and alerts. You can’t fix what you can’t see.
  • If AI-generated code touched your data layer, assume no caching, N+1 queries, and zero invalidation strategy. Do a quick vibe code cleanup before scaling traffic.

If this sounds like the mess you’re inheriting, GitPlumbers has a playbook. We’ll pair with your team, fix the hot paths, and leave you with dashboards, runbooks, and guardrails—no black boxes.

Related Resources

Key takeaways

  • Design caches to serve users first: optimize p95 TTFB/LCP and Apdex, then translate offload to dollars.
  • Use a three-tier cache: edge CDN, shared Redis, and per-process in‑memory with soft/hard TTLs.
  • Prevent thundering herds with request coalescing, jitter, and stale‑while‑revalidate.
  • Make invalidation boring: versioned URLs, surrogate keys/tags, conditional GET, and cache-aware releases.
  • Instrument hit ratio, miss penalty, and origin QPS; alert on trend deltas, not absolute values.
  • Roll out with canaries and feature flags; run game days for cache flush and origin brownouts.

Implementation checklist

  • Map endpoints to cacheability; categorize as static, semi-static, user-personalized, and dynamic.
  • Define cache keys explicitly; normalize query params and headers that change variants.
  • Set soft/hard TTL policy per resource; enable stale-while-revalidate and stale-if-error at edge.
  • Implement request coalescing for hot keys; add jitter and probabilistic early expiration.
  • Tag content with surrogate keys; use versioned URLs for assets and bust only what changed.
  • Instrument CHR, miss penalty, origin QPS, p95 TTFB/LCP; wire dashboards and alerts.
  • Create a rollback plan: disable cache layer by layer via flags; document runbooks.
  • Pilot on 5–10% of traffic; compare deltas; then scale by region.

Questions we hear from teams

How do we cache personalized pages without leaking data?
Separate truly personalized content from cacheable shells. Cache the shell (HTML or JSON) at the edge and hydrate user-specific data via client-side or server-side includes with strict `Vary` rules. Use `Vary: Cookie` only when necessary and prefer signed, scoped tokens for per-user fragments in Redis with short TTLs.
Should we use Redis or Memcached?
Redis for most cases: richer data types, Lua/ACLs, Streams, and easier clustering. Memcached is fine for simple key/value with very low latency and huge object counts. We usually standardize on Redis 6+ with cluster mode, latency monitoring, and eviction policy `allkeys-lru` for fragment caches.
Is write-through better than cache-aside?
For reads, cache-aside is simplest and keeps cache failure from taking down writes. For hot, frequently updated resources (e.g., inventory counts), write-through or write-behind can work—paired with short TTLs and idempotent updates. Measure miss penalty and staleness tolerance before choosing.
How do we prevent cache poisoning at the edge?
Whitelist which headers/params affect cache keys. Strip untrusted headers, ignore query params by default, and sign any keys that depend on user input. Use WAF rules and content validation for HTML/JS. Enable response size limits to avoid oversized cache entries.
What’s a good hit ratio target?
Depends on your mix. Static assets should be 95%+. Semi-static pages (home, category, PDP) 70–90%. API fragments 60–85%. Track hit ratio alongside miss penalty; a high CHR with brutal misses still feels bad to users.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Get a cache architecture review Download the cache rollout checklist

Related resources