Horizontal Scale Without Regret: Stateless vs Stateful, What Actually Works
What I learned scaling a checkout from 15 rps to 2,500 rps without blowing up p95 or the database.
Scale stateless by default; scale state by topology. If CPU is your north star, you’re buying hardware to keep slow code slow.Back to all posts
The spike that broke checkout—and how we fixed it
Black Friday, years ago. Retail client. Marketing turned the firehose on and our stateless API tier scaled like a champ. Then p95 blew past 1.8s because the “stateless” tier was masking a single Postgres primary wheezing under 35k connections from vibey, AI-glued code calling SELECT * in loops. We didn’t need more pods. We needed topology. Two weeks later we shipped: session externalization, idempotency keys, read replicas, partitioned hot tables, and HPA signals tied to concurrency and queue depth. p95 went from 1.8s to 220ms at 2,500 rps sustained, with error rate <0.2%. Revenue/minute up 14%. That’s what horizontal scale looks like when you tie it to customer experience.
Measure what the user feels, not what the node feels
If you scale to CPU, you’ll buy a lot of hardware to keep slow code slow. Tie everything to user-facing metrics and business impact.
- Primary SLOs: p95 latency by endpoint, error rate, availability, Apdex. Track p99 for tail pain.
- Business signals: conversion rate, revenue/min, cart abandonment, DAU retention, LTV churn indicators.
- Golden signals: latency, traffic, errors, saturation—per hop (edge, service, DB, queue).
- Per-key/tier metrics: tenant keyed p95, partition lag, replica lag. The tail hides in hotspots.
Example SLO alert (Prometheus):
groups:
- name: slos
rules:
- record: service:latency_p95
expr: histogram_quantile(0.95, sum by (le, route) (rate(http_request_duration_seconds_bucket[5m])))
- alert: LatencySLOViolation
expr: service:latency_p95{route="/checkout"} > 0.3
for: 10m
labels:
severity: page
annotations:
summary: p95 > 300ms on /checkoutStateless that really scales: externalize, idempotize, and cap fan-out
“Stateless” is table stakes, but teams still trip on two things: hidden state and unbounded work per request.
- Externalize session: No sticky sessions. Use
Redis/Memcached. Keep it tiny and TTL’d. - Idempotency keys: Prevent duplicate writes when autoscaling and retries pile up.
- Bound fan-out: Cap downstream calls per request. Batch, coalesce, and cache.
- Connection pools: Limit DB and HTTP clients to sane pool sizes; shard pools per host.
- Autoscale to concurrency: CPU lies. Concurrency and latency tell the truth.
Kubernetes HPA on concurrency via nginx_ingress_controller_requests or service-level in_flight_requests:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api
minReplicas: 6
maxReplicas: 200
metrics:
- type: Pods
pods:
metric:
name: in_flight_requests
target:
type: AverageValue
averageValue: "40"
behavior:
scaleUp:
policies:
- type: Percent
value: 100
periodSeconds: 60
stabilizationWindowSeconds: 0
scaleDown:
stabilizationWindowSeconds: 300Protect downstreams with Envoy circuit breakers and timeouts:
clusters:
- name: checkout
connect_timeout: 0.2s
circuit_breakers:
thresholds:
- max_connections: 2000
max_pending_requests: 1000
max_requests: 4000
outlier_detection:
consecutive_5xx: 5
interval: 2s
base_ejection_time: 30sCache at the edge to cut p95 in half:
location /catalog {
proxy_cache catalog_cache;
proxy_cache_valid 200 60s;
proxy_cache_use_stale error timeout updating;
}Result we repeatedly see: API tier p95 drops 30–60%, error budgets stop burning during marketing spikes, and infra spend stays flat because you’re avoiding database thrash.
Stateful without tears: partition, replicate, and route by key
Horizontal scale for stateful systems is topology: partitions, replicas, and the router that knows where to send traffic.
- Databases
- Read replicas for fan-out reads, primary for writes. Route with
read/writesplit in the app or via a proxy. - Partitioning/sharding for hot tables. Use natural keys (tenant, region) or consistent hashing for balance.
- Connection budget: Keep client pools small; let a proxy (e.g.,
pgbouncer) multiplex.
PostgreSQL native partitioning to isolate hot tenants:
CREATE TABLE orders (
tenant_id text NOT NULL,
order_id uuid PRIMARY KEY,
created_at timestamptz NOT NULL,
...
) PARTITION BY LIST (tenant_id);
CREATE TABLE orders_us_east PARTITION OF orders FOR VALUES IN ('us-east');
CREATE TABLE orders_eu_west PARTITION OF orders FOR VALUES IN ('eu-west');
CREATE INDEX ON orders_us_east (created_at DESC);Connection pooling with pgbouncer:
[databases]
app = host=db-primary port=5432 dbname=app
[pgbouncer]
max_client_conn = 10000
default_pool_size = 50
pool_mode = transaction- Caches
- Move to
Redis Clusterwhen keyspace > single-node memory or you see eviction churn. - Use hash tags to co-locate related keys:
cart:{user123}:items.
- Queues/Streams
- Size partition count to consumer parallelism headroom; monitor lag not just throughput.
- Use consumer groups; keep
max.poll.interval.mssane to avoid rebalances causing tail spikes.
Kafka topic with headroom for 10x consumers:
kafka-topics --create --topic events.checkout \
--partitions 48 --replication-factor 3 \
--config min.insync.replicas=2 \
--config retention.ms=604800000Routing by key (Node.js example) to keep affinity with shards:
function shardFor(tenantId: string, shardCount = 16) {
const hash = murmurhash(tenantId);
return hash % shardCount; // use consistent hashing in prod
}Trade-offs you must make explicit:
- Consistency vs availability: Can your cart be eventually consistent for 1–2s? Say it out loud.
- Repairability: Rebalancing shards will hurt. Budget for it. Automate with checksums and backfills.
- Failover capacity: Keep replicas warm and sized for N-1 failures; otherwise failover is just a slower outage.
Control planes that prevent 3 AM pages
Spikes are normal; uncontrolled spikes kill margins and SLOs. Put governors in place.
- Rate limiting at the edge (
Envoy,Kong,NGINX, or cloud ALB/WAF). Allocate per-tenant budgets. - Backpressure & queue depth scaling: Use
KEDAto scale consumers on lag. - Load shedding: Drop non-critical work first (recommendations, analytics) to save checkout.
- Retries with jitter and budgets: Bounded retries; otherwise you DDoS yourself.
KEDA scaling on Kafka lag:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: checkout-consumer
spec:
scaleTargetRef:
name: checkout-consumer
pollingInterval: 5
cooldownPeriod: 60
minReplicaCount: 2
maxReplicaCount: 200
triggers:
- type: kafka
metadata:
bootstrapServers: kafka:9092
consumerGroup: checkout
topic: events.checkout
lagThreshold: "5000"Istio outlier detection to auto-eject bad pods and reduce tail latency:
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: api-dr
spec:
host: api
trafficPolicy:
outlierDetection:
consecutive5xxErrors: 5
interval: 2s
baseEjectionTime: 30s
maxEjectionPercent: 50Result: during flash sales we’ve seen 40–70% lower p99 without increasing spend, purely from ejecting slow instances and scaling consumers to lag instead of CPU.
Ship changes safely: canaries, SLO gates, and capacity tests
You don’t discover scale in staging. You earn it in prod—safely.
- GitOps with canaries: Use
Argo Rolloutsto shift by weight while watching p95 and error budget. - SLO gates: Roll forward only if SLOs stay green; auto-abort on regression.
- Capacity tests: Weekly
k6/Locustruns to 2x expected load, with DB and queue metrics on screen. - Chaos drills: Kill a replica, degrade network by 200ms, watch autoscale and circuit breakers.
Argo Rollouts with Prometheus-based analysis:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: api
spec:
strategy:
canary:
canaryService: api-canary
stableService: api-stable
trafficRouting:
istio: { virtualService: api-vs, weight: 10 }
steps:
- setWeight: 10
- pause: { duration: 300 }
- analysis:
templates:
- templateName: p95-check
args:
- name: route
value: /checkoutAnalysis template snippet:
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: p95-check
spec:
metrics:
- name: p95-latency
provider:
prometheus:
query: histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket{route="{{args.route}}"}[2m])))
successCondition: result < 0.3
failureLimit: 1What moves the business: numbers we’ve actually hit
A few anonymized results we shipped at GitPlumbers:
Retail checkout (K8s/EKS, Node.js, Postgres, Redis, Kafka):
- p95: 1.8s -> 220ms, p99: 6.4s -> 700ms
- Error rate: 3.2% -> 0.18%
- Conversion: +14% during peak hour
- Infra: +8% cost with 11x throughput (autoscaled consumers and read replicas)
B2B analytics ingest (Golang, NATS, ClickHouse):
- Throughput: 120k -> 1.3M events/sec
- Backlog drain time: 4h -> 11m after KEDA + partitioning
- Support tickets: -62% (tail latency and retries fixed)
Fintech ledger (Java, PostgreSQL, Debezium):
- Write p95: 90ms -> 28ms via partitioning +
pgbouncer - MTTR: 42m -> 11m with outlier ejection and circuit breakers
- Revenue at risk during incident: -73%
- Write p95: 90ms -> 28ms via partitioning +
These weren’t “rewrite it in Rust” wins. They were topology, autoscaling signals, and ruthless control planes. We also cleaned up a lot of AI-generated “vibe code” that hammered databases with N+1s. Don’t scale chaos; fix it first.
A pragmatic plan you can run this quarter
- Define SLOs by endpoint, plus business KPIs. Set budgets in dollars and error budgets in time.
- Make stateless real: externalize session, implement idempotency, bound fan-out, cap pools.
- Autoscale to load: HPA on concurrency/latency, KEDA on queue depth, enable
Cluster Autoscaler/Karpenter. - Protect state: read replicas, partition hot paths, configure
pgbouncer/ProxySQL, size caches. - Install governors: edge rate limits, circuit breakers, outlier detection, load shedding.
- Prove it: canaries with SLO gates, weekly load tests to 2x expected, quarterly chaos drills.
- Watch the tail: p99 per tenant/partition, queue lag, replica lag; fix hotspots before you add nodes.
If you want seasoned hands to sanity-check the plan—and rescue any AI-coded footguns—GitPlumbers can help. We’ll meet you where you are and make it boringly fast.
structuredSections':[{
Key takeaways
- Scale stateless first, but make it truly stateless—externalize sessions, idempotency, and cache aggressively to protect downstreams.
- Stateful scale is about topology, not pods—partition, replicate, and route by key; measure p95 and tail at each hop.
- Autoscale to user-facing SLOs, not CPU—use request concurrency, queue depth, and latency as signals.
- Control the blast radius—rate-limit at the edge, apply circuit breakers and outlier detection, and shed lowest-value work first.
- Prove capacity before you need it—canary, load test in prod-like, and budget for failover capacity.
- Tie improvements to business metrics—conversion rate, revenue per minute, and support costs, not just system metrics.
Implementation checklist
- Define SLOs: p95 < X ms, error rate < Y%, availability Z9s, and Apdex threshold.
- Make services stateless: externalize session to `Redis`/`Memcached`, implement idempotency keys, decouple with queues.
- Autoscale to load: `HPA` on concurrency/latency, `KEDA` on queue depth, enable `Cluster Autoscaler`/`Karpenter`.
- Protect state: shard/partition, add read replicas, tune connection pools, enforce timeouts and retries with jitter.
- Control traffic: edge rate limiting, `Envoy`/`Istio` circuit breakers and outlier detection, implement load shedding.
- Capacity test: `k6`/`Locust` tests, canary with `Argo Rollouts`, chaos drills on replica loss and network jitter.
- Observe: golden signals in `Prometheus`/`Grafana`, SLO alerts, per-tenant and per-key tail latency tracking.
- Plan failure domains: AZ-aware routing, partition-aware hashing, runbooks with clear abort thresholds.
Questions we hear from teams
- How do I know if I should shard my database or add read replicas first?
- Add read replicas if reads dominate and you can route read-only traffic safely; monitor replica lag. If write p95 is high, or a few tenants/keys are hot, partition by a stable key (tenant, region) and keep writes local. Sharding is a topology choice—don’t do it until you’ve fixed N+1s, indexing, and connection pooling.
- Can I rely on CPU-based HPA for stateless services?
- Not for user latency SLOs. CPU is a lagging, noisy signal. Use concurrency (in-flight requests), request rate, and p95 latency as HPA inputs. For consumers, scale on queue depth or lag via KEDA.
- What’s the minimal setup to avoid cache stampedes?
- Enable per-key locks or request coalescing, use `proxy_cache_use_stale updating`/`stale-while-revalidate`, jitter TTLs, and set reasonable upstream timeouts. A small amount of coordination prevents a thundering herd during invalidations.
- How do I scale AI-generated code that’s already in prod?
- First, stop the bleeding: circuit breakers, rate limits, and bounded retries. Then perform a vibe code cleanup—remove N+1s, add connection pools, parameterize queries, and write SLO-driven tests. We’ve rescued several teams in under two weeks by pairing those fixes with better autoscaling signals.
- What tooling do you recommend to keep infra costs in check while scaling?
- Use `Prometheus`/`Grafana` for cost-aware dashboards, `Cluster Autoscaler`/`Karpenter` for bin-packing, `Terraform` to right-size instance classes, and load-based scaling to avoid idle capacity. Measure cost per 1k requests alongside p95.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
