Stop Recomputing the Same Bytes: Caching Architectures That Cut p95 In Half and Your Cloud Bill by a Third

Design a cache stack that hits user-facing SLOs and trims your infra bill—without gambling on consistency.

Cache is the cheapest compute you’ll ever buy—if you design it, measure it, and invalidate it on purpose.
Back to all posts

The $120k Month We Could Have Avoided

A retail client was paying through the nose for “dynamic” pages that were 95% identical across users. Fastly passed everything because a cookie set by a marketing pixel nuked cacheability. p95 TTFB hovered at 1.2s on traffic spikes, and origin egress/compute hit a painful peak. We fixed three headers, added a surrogate key, and pushed stale-while-revalidate.

Thirty days later: edge hit ratio 92%, origin offload 78%, p95 TTFB down to 450ms, and cloud spend dropped 34%. Same app. Same code. Just smarter caching.

Start With the Metrics That Move the Business

If you can’t tie caching to business outcomes, it’s a hobby. Anchor on:

  • User-facing SLOs: p95 TTFB, p95/p99 API latency, LCP (Core Web Vitals). If you’re e-comm, every 100ms off TTFB often correlates with measurable conversion lift (ask Shopify/Amazon).
  • Origin offload: target % of requests served from edge/service cache. 70–90% is common for content/product APIs.
  • Miss penalty: p95 time to serve a cache miss (including downstream). Budget it.
  • Cost: compute-hours, DB QPS, and egress. Cache wins show up as fewer container-hours and smaller DBs.

Set explicit targets per surface. Example goals:

  • Product detail API: p95 < 150ms, offload ≥ 85%, miss penalty < 350ms
  • Category page HTML: p95 TTFB < 500ms, offload ≥ 75%, zero “global purge” incidents per quarter

A Layered Cache Architecture That Actually Works

Stop arguing “Redis vs CDN.” You need layers, each with a job and owner:

  • Browser: honor Cache-Control, ETag, Last-Modified. Small TTLs on static assets; immutable where safe.
  • CDN/Edge (Fastly/Cloudflare/Akamai): cache shared, anonymous-friendly responses. Use surrogate keys for precise purges. Enable stale-while-revalidate and stale-if-error.
  • Gateway/Proxy (Nginx/Envoy/Varnish): coalesce requests, normalize headers/cookies, and cache authenticated-but-public data (e.g., catalog).
  • Service cache (Redis/Memcached): cache expensive query results and API aggregations with cache-aside. Keep hot sets in memory.
  • In-process LRU (Caffeine/Ristretto/Guava): micro-caches for micro-latency (5–50ms savings) and as a buffer when Redis blips.

Ownership matters: Platform/SRE owns edge/gateway policy; service teams own keys/TTLs and invalidation for their domains.

Keys, TTLs, and Invalidation You Can Live With

Caching fails when invalidation is a rumor. Here’s the playbook:

  • Key design: resource:{version}:{tenant}:{id}?{normalized-query}. Version keys when schema/logic changes; rotate during deploys.
  • TTL strategy: pick a base TTL (e.g., 10m), add jitter (±10–20%) to avoid synchronized expiry. Critical pages: 1–5m at edge, 10–30m in Redis, seconds in-process.
  • Cache-aside (read-through by code): on miss, fetch from origin, set cache, return. Simple, explicit ownership.
  • Write-through: on write, update DB and cache key synchronously. Good for leaderboards/hot objects.
  • Write-behind: buffer writes, update DB async. Use carefully; needs durability guarantees.
  • Stale-While-Revalidate (SWR): serve stale content for X seconds while refreshing in background. Users stay fast; origin breathes.
  • Negative caching: cache 404/empty results briefly (e.g., 30–60s) to squelch repeated misses.
  • Purge precisely: use surrogate keys/tags. Never global purge as a habit.
  • HTTP semantics: send ETag and handle If-None-Match. Conditional GETs cut bytes and time.

Headers that tend to work:

Cache-Control: public, max-age=300, s-maxage=600, stale-while-revalidate=300, stale-if-error=86400
ETag: v2-<content-hash>
Vary: Accept-Encoding, Accept-Language
Surrogate-Key: product:123 catalog:summer

Stampede Control and Consistency: Don’t Melt Your Origin

Everything is fine until a hot key expires during peak. Techniques that work in production:

  • Request coalescing at the proxy: one miss triggers one origin fetch; others wait or get stale.
  • Background refresh: refresh soon-to-expire keys asynchronously. Emit metrics to cap concurrency.
  • Locking: Redis SET lock:key 1 NX PX 30000 before refresh; if lock fails, serve stale.
  • Grace/stale: stale-if-error=86400; when downstream is flaky, keep users fast.
  • Pre-warm on deploy: hit top N keys after a cold cache (load test or worker job).
  • Canary TTL changes: roll TTL policy to 5–10% of traffic and watch miss penalty and error budget.

Set SLO guardrails: if hit ratio drops by >10 points and miss penalty > target for 5m, alert and auto-widen SWR grace.

Concrete Configs You Can Copy-Paste

A few proven snippets that save real money.

Nginx as API cache with stale-while-revalidate

proxy_cache_path /var/cache/nginx levels=1:2 keys_zone=api_cache:100m max_size=5g inactive=60m use_temp_path=off;

map $http_authorization $cache_bypass {
  default 1;
  "" 0;
}

server {
  listen 443 ssl;
  location /api/ {
    proxy_pass http://upstream;

    proxy_cache api_cache;
    proxy_cache_key $request_method|$scheme://$host$request_uri;
    proxy_ignore_headers Set-Cookie;

    proxy_cache_valid 200 301 302 10m;
    proxy_cache_background_update on; # SWR-like behavior
    proxy_cache_use_stale error timeout updating http_500 http_502 http_503 http_504;
    proxy_no_cache $cache_bypass;

    add_header X-Cache-Status $upstream_cache_status always;
  }
}

Cloudflare Worker: set SWR and purge by tag

export default {
  async fetch(req: Request, env: any) {
    const url = new URL(req.url);

    if (req.method === 'PURGE' && url.searchParams.has('tag')) {
      const resp = await fetch(`https://api.cloudflare.com/client/v4/zones/${env.ZONE_ID}/purge_cache`, {
        method: 'POST',
        headers: {
          'Authorization': `Bearer ${env.API_TOKEN}`,
          'Content-Type': 'application/json'
        },
        body: JSON.stringify({ tags: [url.searchParams.get('tag')] })
      });
      return new Response(await resp.text(), { status: 200 });
    }

    const cache = caches.default;
    let res = await cache.match(req);
    if (!res) {
      res = await fetch(req);
      const headers = new Headers(res.headers);
      headers.set('Cache-Control', 'public, max-age=600, stale-while-revalidate=3600, stale-if-error=86400');
      headers.set('Cache-Tag', 'product:123');
      const cached = new Response(res.body, { status: res.status, headers });
      await cache.put(req, cached.clone());
      return cached;
    }
    return res;
  }
};

Cache-aside with TTL jitter and lock (TypeScript + Redis)

import { createClient } from 'redis';
const redis = createClient();

async function getProduct(id: string) {
  const key = `product:v3:${id}`;
  const cached = await redis.get(key);
  if (cached) return JSON.parse(cached);

  // Stampede lock
  const lockKey = `lock:${key}`;
  const gotLock = await redis.set(lockKey, '1', { NX: true, PX: 30000 });

  if (!gotLock) {
    // Another worker is refreshing; serve stale if present
    const stale = await redis.get(key + ':stale');
    if (stale) return JSON.parse(stale);
  }

  const fresh = await fetchOrigin(id); // your DB/service call

  const baseTtlSec = 600; // 10m
  const jitter = Math.floor(Math.random() * 0.2 * baseTtlSec);
  const ttl = baseTtlSec - jitter;

  await redis.setEx(key, ttl, JSON.stringify(fresh));
  // Keep a longer-lived stale copy
  await redis.setEx(key + ':stale', 3600, JSON.stringify(fresh));
  await redis.del(lockKey);

  return fresh;
}

Varnish: grace + surrogate keys

sub vcl_backend_response {
  set beresp.ttl = 10m;
  set beresp.grace = 1h; # serve stale if origin is slow or failing
  if (beresp.http.Surrogate-Key) {
    set beresp.http.Surrogate-Control = "max-age=600, stale-while-revalidate=3600, stale-if-error=86400";
  }
}

Quick verification

# Expect HIT after first request
curl -I https://api.example.com/products/123 | grep -i x-cache-status

Proving ROI: How to Measure and Socialize the Win

Cache work pays for itself fast when you measure the right things:

  • Edge hit ratio and origin offload: from CDN logs (Fastly real-time stats, Cloudflare analytics). Target ≥ 80% for cacheable surfaces.
  • Service cache hit ratio: counters in Prometheus: cache_hits_total, cache_misses_total per keyspace.
  • Miss penalty: histogram of miss latencies; show p95 improvements after SWR/locking.
  • Downstream QPS: DB and microservice calls before/after; you want 30–70% reductions on hot paths.
  • User metrics: LCP/TTFB (RUM via Boomerang/SpeedCurve/Datadog RUM). Conversion/retention change if you’re B2C.

Real numbers we’ve delivered at GitPlumbers in the last year:

  • SaaS analytics vendor: API p95 from 780ms → 290ms, origin offload 82%, BigQuery costs -41%.
  • DTC retailer: HTML TTFB p95 from 1.2s → 450ms, infra spend -34%, CVR +0.8pp.
  • Fintech dashboard: service cache added to expensive aggregations; DB QPS -63%, p99 tail slashed by 52%.

Tell the story with joined dashboards: edge → proxy → service → DB. Executives love a single chart that shows latency down and dollars saved.

Pitfalls to Dodge (I’ve Seen These Take Down Launches)

  • Auth leakage: don’t cache personalized content in shared caches. Use Cache-Control: private or vary on auth/tenant. Strip unneeded cookies at the edge.
  • Cache fragmentation: wild Vary headers and marketing cookies nuke hit ratios. Normalize/whitelist.
  • Poisoning: validate hosts, strip hop-by-hop headers, pin to upstreams; restrict Purge to CI tokens.
  • GraphQL: naive caching fails due to POST and mixed fields. Cache resolvers’ data at service layer; add GET for idempotent queries when possible.
  • Multi-tenant bleed: include tenant/org in keys. Don’t rely on headers alone.
  • Overlong TTLs: product prices and inventory need fast purges. Use surrogate-key purges wired to your PIM/ERP events.
  • Global purges: last resort only. If you need them weekly, your invalidation design is broken.

Cache is a contract, not a best-effort hint. Treat it like an API: version it, test it, monitor it, and roll it back when it misbehaves.

Related Resources

Key takeaways

  • Design caching around user-facing SLOs (p95/p99) and origin offload targets, not just CPU graphs.
  • Layer caches: browser → CDN/edge → gateway/proxy → service/Redis → in-process LRU.
  • Use sane keys/TTLs: versioned keys, jitter, surrogate keys, and SWR to buy consistency and avoid stampedes.
  • Measure miss penalty and hit ratio per layer; budget for cache misses in your SLOs.
  • Purge precisely (tags/keys), not globally; automate via CI/CD and release pipelines.
  • Harden against stampedes with request coalescing, locks, and stale-on-error.
  • Keep auth/PII out of shared caches; use Vary and cookies sparingly to prevent cache fragmentation.

Implementation checklist

  • Define SLOs: p95 TTFB/LCP and origin offload% per surface (pages, APIs).
  • Map your cache layers and ownership: edge, gateway, service, in-proc.
  • Standardize headers: Cache-Control, ETag/If-None-Match, Surrogate-Key, SWR.
  • Pick strategies per data class: cache-aside for reads, write-through for hot keys.
  • Implement stampede controls: background refresh, request coalescing, Redis locks, stale-if-error.
  • Version keys and add TTL jitter to avoid synchronized expiry.
  • Add precise purging (tags/keys) wired to your deploy pipeline.
  • Instrument hit/miss, miss penalty, and downstream QPS; ship dashboards and alerts.
  • Run an A/B or canary for cache policy changes; watch p95 and error budgets.
  • Document cacheability contracts and ownership; run quarterly “cache fire drills.”

Questions we hear from teams

How do I choose TTLs without breaking freshness?
Classify data. For content that changes infrequently (marketing pages), 5–30 minutes at edge is safe with surrogate-key purges on publish. For semi-dynamic data (catalog, non-personalized pricing), 1–10 minutes edge, 10–30 minutes service cache, with SWR 10–60 minutes. For highly dynamic or personalized data (cart, balances), avoid shared edge caching; use short in-process/Redis TTLs and ETags for conditional requests.
What’s the fastest way to see ROI from caching?
Start at the edge: standardize `Cache-Control`, strip noisy cookies, enable SWR, and add surrogate keys. We routinely see 25–40% cost reductions and 2–3x p95 improvements in under 2 weeks. Then move to service-level cache-aside for your most expensive aggregations.
How do I avoid cache stampedes during traffic spikes?
Enable request coalescing at the proxy (Varnish/Envoy/Nginx), add Redis locks (`SET NX PX`) around refresh, keep a stale copy for 30–60 minutes, and add TTL jitter. Canary TTL/policy changes and monitor miss penalty. Pre-warm hot keys on deploy.
Is caching safe with authenticated users and GDPR/PII?
Yes, with boundaries. Don’t put PII or personalized responses in shared caches. Use `Cache-Control: private` for personalized content, segment keys by tenant/org, and encrypt at rest for Redis. Audit keys/values for sensitive fields and set short TTLs for any token-bearing entries. Document a data retention policy.
Can GraphQL be cached effectively?
Edge caching of GraphQL is hard because most clients POST and responses mix fields. Cache at the resolver/service layer: hot entities and lists keyed by args. Support GET for idempotent queries with persisted query hashes; normalize arguments; then cache GETs at edge/proxy with short TTLs and SWR.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Run a 60-minute Cache Triage with GitPlumbers Download the Cache Headers Cheat Sheet

Related resources