Scale Out Without Melting Down: Horizontal Strategies for Stateless and Stateful Services That Actually Move the Needle

The practical playbook for scaling web stacks and data systems without torching SLOs or blowing up your AWS bill.

> Scaling stateless is an ops problem. Scaling stateful is a design decision. Get the second one wrong and the first won’t save you.
Back to all posts

The outage that taught us (again)

We had a flash sale at a consumer marketplace. The web tier autoscaled from 20 to 120 pods in five minutes (nice). The cart database? Single primary Postgres at 60% CPU, 95% I/O, replication lag creeping. p95 jumped from 420ms to 1.9s, error rate hit 5%, and checkout conversion dropped 3.7% in 30 minutes. The stateless layer did its job; the stateful layer face-planted. I’ve seen this movie at startups and at FAANG scale. Horizontal scaling only works when you scale the right things, in the right order, tied to the metrics that actually matter.

Start with the only metrics that matter

If you don’t anchor scaling to user-facing outcomes, you’ll optimize the wrong graphs.

  • SLOs per journey: define p95 latency, p99 latency, and error budgets for home, search, add-to-cart, checkout, and auth. E.g., checkout p95 <= 600ms, p99 <= 1.2s, error rate < 0.5%.
  • Business KPIs: track conversion, revenue per session, retention, and customer support contacts per 1k sessions.
  • Cost guardrails: cost per successful request, infra cost as % of revenue, and scaling efficiency (requests per vCPU).

Tie autoscaling and capacity policies to these, not just CPU%. If p95 is blowing the budget while CPU is 30%, you have a concurrency or downstream bottleneck, not a compute shortage.

Stateless: scale wide, keep it boring

Stateless services should scale like Legos. If they don’t, they’re not stateless yet.

  • Externalize state: sessions in Redis or Memcached. Kill sticky sessions at the LB. In Nginx/Envoy, ensure no ip_hash or cookie-based pinning unless it’s for canary traffic.
  • Right autoscaling signals: HPA on CPU is lazy. Add concurrency and queue length.
    • keda.sh for event-driven scaling from SQS/Kafka lag.
    • HPA with custom.metrics.k8s.io for requests_in_flight or latency_p95 from Prometheus.
  • Concurrency limits: set GOMAXPROCS, UV_THREADPOOL_SIZE, or worker_connections based on load tests. Cap per-pod concurrency to avoid head-of-line blocking.
  • Idempotency + retries: generate idempotency keys at the edge for writes; use HTTP 429/503 + jittered backoff. It keeps autoscaling smooth under spikes.
  • Edge and CDN: move hot reads to the edge with CloudFront/Fastly. Cache product pages with 30–120s TTL; purge on change via webhook. It’s the cheapest horizontal scale you’ll buy.
  • Bulkheads and circuit breakers: in Envoy/Istio, define per-route timeouts, outlier detection, and connection pools. It stops one slow dependency from dragging the fleet.

Measurable outcomes we’ve delivered:

  • Static + API cache at the edge cut origin RPS by 38% and p95 from 550ms to 320ms, improving conversion by 2.4% on mobile.
  • Adding HPA signal on inflight_requests instead of CPU reduced 99th tail by 27% during ad-driven spikes without adding nodes.

Stateful: pick your poison and design for it

Stateful systems don’t scale like web pods. You need to choose a scaling strategy up front.

  • Read replicas (cheap wins)

    • Good for read-heavy traffic. For PostgreSQL, set up async replicas and route SELECTs via PgBouncer transaction pooling.
    • Watch replication_lag and set read timeouts if lag > 250ms to avoid stale reads in critical paths.
    • Business impact: we’ve offloaded 60–80% read traffic from primaries, freeing headroom for writes.
  • Sharding/partitioning

    • Use Vitess for MySQL or native Postgres partitioning by tenant/region/time.
    • Consistent hashing at the app layer with a shard map in Redis or a config repo. Keep shard rebalancing automated via a GitOps pipeline (ArgoCD).
    • Measurable win: a B2B SaaS moved from a single 48-core Postgres to 8 shards; p95 on tenant-heavy queries dropped from 1.4s to 380ms, cost went down 22%.
  • Distributed SQL

    • CockroachDB/YugabyteDB for multi-region writes with Raft. Great for availability, but you pay in write latency if your spans cross regions.
    • Pin data to locality with ALTER TABLE ... LOCALITY REGIONAL BY ROW and align your service topology to keep hot paths single-region.
  • CQRS/event sourcing

    • Split write models (normalized, consistent) from read models (denormalized, fast). Project events to a read store (Elastic/DynamoDB) for the hot UX.
    • This decouples UX p95 from write bottlenecks, at the cost of eventual consistency you must explain to Product.
  • Queues and backpressure

    • Put Kafka/SQS in front of write bursts. Scale consumers horizontally with idempotent processors.
    • Autoscale consumers off lag, cap producer concurrency, and drop non-critical messages under brownout policies.
  • Operational guardrails

    • Throttle at the top: per-tenant rate limits in Envoy with redis_rate_limit and an emergency kill switch.
    • Schema change discipline: online migrations (gh-ost, pt-osc), write fences, and dual-writes during cutovers.

Anti-patterns that keep me employed

Every incident review has one of these greatest hits.

  • Sticky sessions: hides uneven load and kills resilience. Fix by externalizing session and enabling round-robin.
  • Autoscaling on CPU only: web pods scale while DB keels over. Add inflight_requests and latency to HPA, and a scale-out cap tied to DB health.
  • Unbounded fan-out: one request triggering N downstream calls. Use request collapsing, caches, and bulk endpoints.
  • Chatty N+1: ORM issuing 200 queries per request. Preload/JOIN intentionally or move to a read model.
  • Synchronous writes in the hot path: move to async pipelines with user-visible progress and idempotency.
  • Global transactions across services: if you hear “distributed 2PC,” run. Use sagas with compensations and accept eventual consistency.

A rollout that doesn’t tank Friday revenue

Here’s the sequence we run for clients when we need scale now without surprises.

  1. Baseline and SLOs
    • Instrument RED/USE metrics with Prometheus and traces with Jaeger. Freeze the current p95, p99, error rate, saturation, and cost/request.
    • Define SLOs per user journey and set budgets in Grafana.
  2. Load model
    • Recreate production traffic mix with k6 or Locust. Include burst patterns, cache misses, and cross-region calls.
  3. Stateless first
    • Externalize session, add HPA with concurrency signals, implement circuit breakers. Push to 2x expected peak under load test.
  4. Stateful scale decision
    • Prototype read replicas vs shard vs distributed SQL on a prod-like dataset. Pick one based on write amplification, operator maturity, and failover needs.
  5. Safety rails
    • Rate limits, bulkheads, and brownout features in Envoy/Istio. Error budgets wired to feature flags to auto-dial down expensive features.
  6. Canary + GitOps
    • Use Argo Rollouts for 1%, 10%, 25%, 50% with automatic rollback on SLO regression. Configs in Git; rollbacks are git revert.
  7. Chaos and failover
    • Kill primaries during business hours with chaos-mesh or Gremlin. Validate MTTR and ensure the customer doesn’t notice.

Results we expect and have seen:

  • 30–50% reduction in p95 during spikes, with 15–35% infra cost reduction per request.
  • MTTR under 10 minutes for DB failover events with zero data loss (RPO=0) on Vitess/Cockroach.
  • 2–4% lift in checkout conversion due to faster p95 and fewer errors.

Real-world examples (numbers included)

  • Retail checkout (Kubernetes, Postgres, Redis, Envoy)

    • Before: p95 850ms, 2.1% error under promo spikes, $0.012 cost/request.
    • After: Redis-backed session, HPA with requests_in_flight, Postgres read replicas for catalog, write queue for orders.
    • Result: p95 420ms (-50%), errors 0.4%, cost/request $0.008 (-33%), conversion +3.1%.
  • Multi-tenant SaaS (Golang, MySQL→Vitess, Kafka)

    • Before: single MySQL primary at 70% CPU, 1.2s p95 on tenant-heavy queries.
    • After: Vitess sharding by tenant_id, Kafka for audit/event stream, per-tenant rate limits.
    • Result: p95 380ms (-68%), zero downtime schema changes, new tenant onboarding with linear capacity planning.
  • Streaming analytics (Node.js, Kafka, ClickHouse)

    • Before: consumer lag during peak; dashboards stale by 5–10 minutes.
    • After: KEDA autoscaling on lag, partition expansion, ClickHouse replicas by region.
    • Result: 99th ingest latency from 4.2s → 900ms; freshness under 60s at p99, ad spend optimization improved ROAS by 6%.

What I’d do differently (so you don’t have to)

  • Don’t let stateless autoscale outrun your database. Put a governor that ties app scale-out to DB health (lag, locks, active_tx).
  • Always add an idempotency layer before adding retries. Otherwise you amplify outages.
  • Size pods for failure: if one node dies, the remaining nodes should handle 1/N extra load without violating SLOs.
  • Keep one migration path hot. If you pick sharding, invest in tooling early (move/reshard automation). If distributed SQL, practice region isolation drills.
  • Track cost alongside latency in the same dashboard. Engineers optimize what they see.

TL;DR checklist you can paste into Jira

  • Define SLOs and budgets per user journey; align autoscaling to p95, not CPU.
  • Externalize session; remove sticky LB behavior; enable HPA with concurrency.
  • Add caches and edge TTLs; implement circuit breakers, bulkheads, and rate limits.
  • Pick a stateful strategy: replicas, shards (Vitess/Postgres partitions), or distributed SQL.
  • Add queues for bursty writes; make processors idempotent; autoscale off lag.
  • Canary with Argo Rollouts; observe with Prometheus/Grafana/Jaeger.
  • Drill chaos and failovers during work hours; measure MTTR and customer impact.
  • Report conversion, revenue/session, and cost/request weekly alongside SLOs.

Related Resources

Key takeaways

  • Stateless and stateful scale differently—treat them like different animals or you’ll starve one while overfeeding the other.
  • Design to your SLOs: tie p95 latency, error rate, and availability targets directly to autoscaling policies.
  • Externalize state and session—then shard, replicate, or partition the real stateful systems that remain.
  • Capacity isn’t a number; it’s a pattern. Use queues, backpressure, and bulkheads to shape traffic.
  • Measure business impact (conversion, retention, cost/request), not just CPU graphs.
  • Roll out with canaries and load models, not hope. Bake in chaos and failure testing from day one.

Implementation checklist

  • Set SLOs for p95/p99 latency and error rates per user journey before touching autoscaling.
  • Classify every service as stateless, stateful, or state-aware and document scaling limits.
  • Enable `HPA` for stateless services with `requests/limits` and `Concurrency` signals (not just CPU%).
  • Externalize session to `Redis` or `Memcached`; remove sticky sessions at the LB.
  • For databases, choose a strategy: read replicas, sharding (e.g., `Vitess`), or a distributed DB (`CockroachDB`).
  • Add queues (`SQS`,`Kafka`,`NATS`) and implement idempotency keys for retries.
  • Introduce circuit breakers, rate limits, and backpressure at Envoy/Istio.
  • Load test with production traffic models; canary with `Argo Rollouts`; observe with `Prometheus` + `Grafana` + `Jaeger`.
  • Run chaos experiments during business hours to validate SLOs and failover paths.
  • Track business KPIs (conversion, AOV, cost/request) alongside tech metrics in a single dashboard.

Questions we hear from teams

How do I know if I should shard or add read replicas?
If you’re write-bound or have tenant/region data affinity, shard. If you’re read-heavy with acceptable staleness, add replicas. Prototype both with prod-like data and measure: write latency, replica lag, failover MTTR, and operational complexity.
Can I keep sticky sessions for A/B testing or canaries?
Use header- or cookie-based routing at `Envoy`/`Istio` for canaries, not sticky sessions for state. Keep the application stateless and route explicitly for experiments.
What’s the fastest win for p95 latency without rewriting the database layer?
Edge caching of hot reads (30–120s TTL), HPA on concurrency/inflight metrics, and circuit breakers. These typically yield 20–50% p95 reductions in days, not months.
Do distributed SQL databases remove the need for sharding?
They shift where you do it. You still need to think about data locality, hotspots, and schema design. You’ll trade manual sharding for placement rules and transaction boundaries.
How do I prevent autoscaling from crushing my database?
Tie app HPA max replicas to DB health (replication lag, active connections, lock wait). Add a governor service or `HPA` external metric that slows scale-out when downstream is red.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to GitPlumbers about your scaling plan Download our SLO-first scaling checklist

Related resources