How do I choose TTLs without breaking freshness?

Classify data. For content that changes infrequently (marketing pages), 5–30 minutes at edge is safe with surrogate-key purges on publish. For semi-dynamic data (catalog, non-personalized pricing), 1–10 minutes edge, 10–30 minutes service cache, with SWR 10–60 minutes. For highly dynamic or personalized data (cart, balances), avoid shared edge caching; use short in-process/Redis TTLs and ETags for conditional requests.

What’s the fastest way to see ROI from caching?

Start at the edge: standardize `Cache-Control`, strip noisy cookies, enable SWR, and add surrogate keys. We routinely see 25–40% cost reductions and 2–3x p95 improvements in under 2 weeks. Then move to service-level cache-aside for your most expensive aggregations.

How do I avoid cache stampedes during traffic spikes?

Enable request coalescing at the proxy (Varnish/Envoy/Nginx), add Redis locks (`SET NX PX`) around refresh, keep a stale copy for 30–60 minutes, and add TTL jitter. Canary TTL/policy changes and monitor miss penalty. Pre-warm hot keys on deploy.

Is caching safe with authenticated users and GDPR/PII?

Yes, with boundaries. Don’t put PII or personalized responses in shared caches. Use `Cache-Control: private` for personalized content, segment keys by tenant/org, and encrypt at rest for Redis. Audit keys/values for sensitive fields and set short TTLs for any token-bearing entries. Document a data retention policy.

Can GraphQL be cached effectively?

Edge caching of GraphQL is hard because most clients POST and responses mix fields. Cache at the resolver/service layer: hot entities and lists keyed by args. Support GET for idempotent queries with persisted query hashes; normalize arguments; then cache GETs at edge/proxy with short TTLs and SWR.

Performance-optimization · Nov 9, 2025 · 9 minute read

Stop Recomputing the Same Bytes: Caching Architectures That Cut p95 In Half and Your Cloud Bill by a Third

Design a cache stack that hits user-facing SLOs and trims your infra bill—without gambling on consistency.

Ari Kaplan

Partner & Principal Engineer, GitPlumbers

Two decades across CDNs (Akamai/Fastly), adtech at scale, and fintech SRE. Led cache architectures that served billions of requests/day and cut cloud bills by seven figures. I break thundering herds for fun.

Cache is the cheapest compute you’ll ever buy—if you design it, measure it, and invalidate it on purpose.

Back to all posts

The $120k Month We Could Have Avoided

A retail client was paying through the nose for “dynamic” pages that were 95% identical across users. Fastly passed everything because a cookie set by a marketing pixel nuked cacheability. p95 TTFB hovered at 1.2s on traffic spikes, and origin egress/compute hit a painful peak. We fixed three headers, added a surrogate key, and pushed stale-while-revalidate.

Thirty days later: edge hit ratio 92%, origin offload 78%, p95 TTFB down to 450ms, and cloud spend dropped 34%. Same app. Same code. Just smarter caching.

Start With the Metrics That Move the Business

If you can’t tie caching to business outcomes, it’s a hobby. Anchor on:

User-facing SLOs: p95 TTFB, p95/p99 API latency, LCP (Core Web Vitals). If you’re e-comm, every 100ms off TTFB often correlates with measurable conversion lift (ask Shopify/Amazon).
Origin offload: target % of requests served from edge/service cache. 70–90% is common for content/product APIs.
Miss penalty: p95 time to serve a cache miss (including downstream). Budget it.
Cost: compute-hours, DB QPS, and egress. Cache wins show up as fewer container-hours and smaller DBs.

Set explicit targets per surface. Example goals:

Product detail API: p95 < 150ms, offload ≥ 85%, miss penalty < 350ms
Category page HTML: p95 TTFB < 500ms, offload ≥ 75%, zero “global purge” incidents per quarter

A Layered Cache Architecture That Actually Works

Stop arguing “Redis vs CDN.” You need layers, each with a job and owner:

Browser: honor Cache-Control, ETag, Last-Modified. Small TTLs on static assets; immutable where safe.
CDN/Edge (Fastly/Cloudflare/Akamai): cache shared, anonymous-friendly responses. Use surrogate keys for precise purges. Enable stale-while-revalidate and stale-if-error.
Gateway/Proxy (Nginx/Envoy/Varnish): coalesce requests, normalize headers/cookies, and cache authenticated-but-public data (e.g., catalog).
Service cache (Redis/Memcached): cache expensive query results and API aggregations with cache-aside. Keep hot sets in memory.
In-process LRU (Caffeine/Ristretto/Guava): micro-caches for micro-latency (5–50ms savings) and as a buffer when Redis blips.

Ownership matters: Platform/SRE owns edge/gateway policy; service teams own keys/TTLs and invalidation for their domains.

Keys, TTLs, and Invalidation You Can Live With

Caching fails when invalidation is a rumor. Here’s the playbook:

Key design: resource:{version}:{tenant}:{id}?{normalized-query}. Version keys when schema/logic changes; rotate during deploys.
TTL strategy: pick a base TTL (e.g., 10m), add jitter (±10–20%) to avoid synchronized expiry. Critical pages: 1–5m at edge, 10–30m in Redis, seconds in-process.
Cache-aside (read-through by code): on miss, fetch from origin, set cache, return. Simple, explicit ownership.
Write-through: on write, update DB and cache key synchronously. Good for leaderboards/hot objects.
Write-behind: buffer writes, update DB async. Use carefully; needs durability guarantees.
Stale-While-Revalidate (SWR): serve stale content for X seconds while refreshing in background. Users stay fast; origin breathes.
Negative caching: cache 404/empty results briefly (e.g., 30–60s) to squelch repeated misses.
Purge precisely: use surrogate keys/tags. Never global purge as a habit.
HTTP semantics: send ETag and handle If-None-Match. Conditional GETs cut bytes and time.

Headers that tend to work:

Cache-Control: public, max-age=300, s-maxage=600, stale-while-revalidate=300, stale-if-error=86400
ETag: v2-<content-hash>
Vary: Accept-Encoding, Accept-Language
Surrogate-Key: product:123 catalog:summer

Stampede Control and Consistency: Don’t Melt Your Origin

Everything is fine until a hot key expires during peak. Techniques that work in production:

Request coalescing at the proxy: one miss triggers one origin fetch; others wait or get stale.
Background refresh: refresh soon-to-expire keys asynchronously. Emit metrics to cap concurrency.
Locking: Redis SET lock:key 1 NX PX 30000 before refresh; if lock fails, serve stale.
Grace/stale: stale-if-error=86400; when downstream is flaky, keep users fast.
Pre-warm on deploy: hit top N keys after a cold cache (load test or worker job).
Canary TTL changes: roll TTL policy to 5–10% of traffic and watch miss penalty and error budget.

Set SLO guardrails: if hit ratio drops by >10 points and miss penalty > target for 5m, alert and auto-widen SWR grace.

Concrete Configs You Can Copy-Paste

A few proven snippets that save real money.

Nginx as API cache with stale-while-revalidate

proxy_cache_path /var/cache/nginx levels=1:2 keys_zone=api_cache:100m max_size=5g inactive=60m use_temp_path=off;

map $http_authorization $cache_bypass {
  default 1;
  "" 0;
}

server {
  listen 443 ssl;
  location /api/ {
    proxy_pass http://upstream;

    proxy_cache api_cache;
    proxy_cache_key $request_method|$scheme://$host$request_uri;
    proxy_ignore_headers Set-Cookie;

    proxy_cache_valid 200 301 302 10m;
    proxy_cache_background_update on; # SWR-like behavior
    proxy_cache_use_stale error timeout updating http_500 http_502 http_503 http_504;
    proxy_no_cache $cache_bypass;

    add_header X-Cache-Status $upstream_cache_status always;
  }
}

Cloudflare Worker: set SWR and purge by tag

export default {
  async fetch(req: Request, env: any) {
    const url = new URL(req.url);

    if (req.method === 'PURGE' && url.searchParams.has('tag')) {
      const resp = await fetch(`https://api.cloudflare.com/client/v4/zones/${env.ZONE_ID}/purge_cache`, {
        method: 'POST',
        headers: {
          'Authorization': `Bearer ${env.API_TOKEN}`,
          'Content-Type': 'application/json'
        },
        body: JSON.stringify({ tags: [url.searchParams.get('tag')] })
      });
      return new Response(await resp.text(), { status: 200 });
    }

    const cache = caches.default;
    let res = await cache.match(req);
    if (!res) {
      res = await fetch(req);
      const headers = new Headers(res.headers);
      headers.set('Cache-Control', 'public, max-age=600, stale-while-revalidate=3600, stale-if-error=86400');
      headers.set('Cache-Tag', 'product:123');
      const cached = new Response(res.body, { status: res.status, headers });
      await cache.put(req, cached.clone());
      return cached;
    }
    return res;
  }
};

Cache-aside with TTL jitter and lock (TypeScript + Redis)

import { createClient } from 'redis';
const redis = createClient();

async function getProduct(id: string) {
  const key = `product:v3:${id}`;
  const cached = await redis.get(key);
  if (cached) return JSON.parse(cached);

  // Stampede lock
  const lockKey = `lock:${key}`;
  const gotLock = await redis.set(lockKey, '1', { NX: true, PX: 30000 });

  if (!gotLock) {
    // Another worker is refreshing; serve stale if present
    const stale = await redis.get(key + ':stale');
    if (stale) return JSON.parse(stale);
  }

  const fresh = await fetchOrigin(id); // your DB/service call

  const baseTtlSec = 600; // 10m
  const jitter = Math.floor(Math.random() * 0.2 * baseTtlSec);
  const ttl = baseTtlSec - jitter;

  await redis.setEx(key, ttl, JSON.stringify(fresh));
  // Keep a longer-lived stale copy
  await redis.setEx(key + ':stale', 3600, JSON.stringify(fresh));
  await redis.del(lockKey);

  return fresh;
}

Varnish: grace + surrogate keys

sub vcl_backend_response {
  set beresp.ttl = 10m;
  set beresp.grace = 1h; # serve stale if origin is slow or failing
  if (beresp.http.Surrogate-Key) {
    set beresp.http.Surrogate-Control = "max-age=600, stale-while-revalidate=3600, stale-if-error=86400";
  }
}

Quick verification

# Expect HIT after first request
curl -I https://api.example.com/products/123 | grep -i x-cache-status

Proving ROI: How to Measure and Socialize the Win

Cache work pays for itself fast when you measure the right things:

Edge hit ratio and origin offload: from CDN logs (Fastly real-time stats, Cloudflare analytics). Target ≥ 80% for cacheable surfaces.
Service cache hit ratio: counters in Prometheus: cache_hits_total, cache_misses_total per keyspace.
Miss penalty: histogram of miss latencies; show p95 improvements after SWR/locking.
Downstream QPS: DB and microservice calls before/after; you want 30–70% reductions on hot paths.
User metrics: LCP/TTFB (RUM via Boomerang/SpeedCurve/Datadog RUM). Conversion/retention change if you’re B2C.

Real numbers we’ve delivered at GitPlumbers in the last year:

SaaS analytics vendor: API p95 from 780ms → 290ms, origin offload 82%, BigQuery costs -41%.
DTC retailer: HTML TTFB p95 from 1.2s → 450ms, infra spend -34%, CVR +0.8pp.
Fintech dashboard: service cache added to expensive aggregations; DB QPS -63%, p99 tail slashed by 52%.

Tell the story with joined dashboards: edge → proxy → service → DB. Executives love a single chart that shows latency down and dollars saved.

Pitfalls to Dodge (I’ve Seen These Take Down Launches)

Auth leakage: don’t cache personalized content in shared caches. Use Cache-Control: private or vary on auth/tenant. Strip unneeded cookies at the edge.
Cache fragmentation: wild Vary headers and marketing cookies nuke hit ratios. Normalize/whitelist.
Poisoning: validate hosts, strip hop-by-hop headers, pin to upstreams; restrict Purge to CI tokens.
GraphQL: naive caching fails due to POST and mixed fields. Cache resolvers’ data at service layer; add GET for idempotent queries when possible.
Multi-tenant bleed: include tenant/org in keys. Don’t rely on headers alone.
Overlong TTLs: product prices and inventory need fast purges. Use surrogate-key purges wired to your PIM/ERP events.
Global purges: last resort only. If you need them weekly, your invalidation design is broken.

Cache is a contract, not a best-effort hint. Treat it like an API: version it, test it, monitor it, and roll it back when it misbehaves.

Related Resources

Key takeaways

Design caching around user-facing SLOs (p95/p99) and origin offload targets, not just CPU graphs.
Layer caches: browser → CDN/edge → gateway/proxy → service/Redis → in-process LRU.
Use sane keys/TTLs: versioned keys, jitter, surrogate keys, and SWR to buy consistency and avoid stampedes.
Measure miss penalty and hit ratio per layer; budget for cache misses in your SLOs.
Purge precisely (tags/keys), not globally; automate via CI/CD and release pipelines.
Harden against stampedes with request coalescing, locks, and stale-on-error.
Keep auth/PII out of shared caches; use Vary and cookies sparingly to prevent cache fragmentation.

Implementation checklist

Define SLOs: p95 TTFB/LCP and origin offload% per surface (pages, APIs).
Map your cache layers and ownership: edge, gateway, service, in-proc.
Standardize headers: Cache-Control, ETag/If-None-Match, Surrogate-Key, SWR.
Pick strategies per data class: cache-aside for reads, write-through for hot keys.
Implement stampede controls: background refresh, request coalescing, Redis locks, stale-if-error.
Version keys and add TTL jitter to avoid synchronized expiry.
Add precise purging (tags/keys) wired to your deploy pipeline.
Instrument hit/miss, miss penalty, and downstream QPS; ship dashboards and alerts.
Run an A/B or canary for cache policy changes; watch p95 and error budgets.
Document cacheability contracts and ownership; run quarterly “cache fire drills.”

Questions we hear from teams

How do I choose TTLs without breaking freshness?: Classify data. For content that changes infrequently (marketing pages), 5–30 minutes at edge is safe with surrogate-key purges on publish. For semi-dynamic data (catalog, non-personalized pricing), 1–10 minutes edge, 10–30 minutes service cache, with SWR 10–60 minutes. For highly dynamic or personalized data (cart, balances), avoid shared edge caching; use short in-process/Redis TTLs and ETags for conditional requests.
What’s the fastest way to see ROI from caching?: Start at the edge: standardize `Cache-Control`, strip noisy cookies, enable SWR, and add surrogate keys. We routinely see 25–40% cost reductions and 2–3x p95 improvements in under 2 weeks. Then move to service-level cache-aside for your most expensive aggregations.
How do I avoid cache stampedes during traffic spikes?: Enable request coalescing at the proxy (Varnish/Envoy/Nginx), add Redis locks (`SET NX PX`) around refresh, keep a stale copy for 30–60 minutes, and add TTL jitter. Canary TTL/policy changes and monitor miss penalty. Pre-warm hot keys on deploy.
Is caching safe with authenticated users and GDPR/PII?: Yes, with boundaries. Don’t put PII or personalized responses in shared caches. Use `Cache-Control: private` for personalized content, segment keys by tenant/org, and encrypt at rest for Redis. Audit keys/values for sensitive fields and set short TTLs for any token-bearing entries. Document a data retention policy.
Can GraphQL be cached effectively?: Edge caching of GraphQL is hard because most clients POST and responses mix fields. Cache at the resolver/service layer: hot entities and lists keyed by args. Support GET for idempotent queries with persisted query hashes; normalize arguments; then cache GETs at edge/proxy with short TTLs and SWR.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Run a 60-minute Cache Triage with GitPlumbers Download the Cache Headers Cheat Sheet