Stop Treating Everything as Stateless: Designing Horizontal Scaling That Won’t Melt Under Real Traffic

The playbooks we use to scale web + data planes without wrecking user experience, SLOs, or your cloud bill.

Scale stateless by pods; scale state by topology. Confuse the two and your p95 will tell on you.
Back to all posts

Related Resources

Key takeaways

  • Scale stateless and stateful differently—sessions, reads, writes, and durability need separate lanes.
  • Autoscaling on CPU alone is a trap; use RPS, queue length, concurrency, and SLO-aware safeguards.
  • Cut tail latency with connection pooling, circuit breakers, and aggressive caching—this moves conversion.
  • Stateful scaling is topology work: read replicas, partitioning/sharding, and backpressure—not just bigger nodes.
  • Measure what users feel: p95/p99, error rate, TTFB. Tie improvements to revenue, not just node counts.
  • Ship safely: canary + load testing in prod-like conditions; chaos test failover and queue backlogs.

Implementation checklist

  • Define SLOs around p95/p99 latency, error rate, and availability before touching infra.
  • Eliminate sticky sessions; use `JWT` or `Redis` for session state.
  • Set HPA signals on RPS/concurrency; use `KEDA` for queue length.
  • Introduce circuit breakers and sane connection pools (`Envoy`/`Istio`).
  • Split read/write paths; use read replicas for scale, leader for writes.
  • Partition where it hurts (tenant_id, user_id); document shard keys.
  • Make writes idempotent and add backpressure on queues.
  • Canary with `Argo Rollouts`; run `k6` load tests; run chaos on failover paths.
  • Track business impact: conversion, abandonment, infra $/request, MTTR.

Questions we hear from teams

Do I need microservices to scale horizontally?
No. You can scale a well-factored monolith extremely far with correct autoscaling signals, connection pooling, and data topology. The pain arrives when a single schema or write path becomes the choke point—then you partition or carve out services along clear consistency boundaries.
Should I shard or just add read replicas?
Start with read replicas—fastest ROI. If your write QPS or hot rows still crush the primary, partition first (same DB) and measure. If you’re still constrained, adopt a sharding layer (Vitess) or a distributed SQL store. Jumping straight to sharding without ops maturity is how outages get born.
How do I keep consistency with replicas?
Route critical read-after-write to the primary (or use `SET LOCAL synchronous_commit` for specific transactions). For replicas, monitor replication lag and expose a hint: if lag > N ms, degrade reads or switch to primary for those flows.
How do I control autoscaling cost?
Cap max replicas per service, use stabilization windows, and enforce $/request SLOs in your dashboards. Pre-provision a small base, scale out on RPS/queue depth, and scale in slowly. Warm pools or cheap spot pools for bursty traffic can reduce cost without hurting p95.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Fix your scaling bottleneck Schedule an architecture review

Related resources