How do I know if a service is safe to horizontally scale as stateless?

Check for externalized state (sessions, cache, files), idempotent handlers, and zero dependency on local disk or in-memory sticky data. If a single request can land on any replica and succeed, you’re stateless enough.

What’s the first move to scale Postgres without a full re-architecture?

Introduce PgBouncer in transaction mode, add read replicas, and route read-only queries there. This resolves connection storms and offloads 30–60% of read traffic in most CRUD apps.

Should I scale on CPU or latency?

Use CPU as a guardrail, but scale primary on load proxies users feel: RPS, in-flight requests, and backlog/lag. Latency is noisy for direct autoscaling but perfect for SLOs and alerting.

When do I need to shard writes?

Only after you’ve exhausted read replicas, pooling, and query/index tuning. If write QPS saturates a single primary or you need multi-region writes with low RTO, consider Vitess/Citus or a distributed SQL option like CockroachDB.

Performance-optimization · Oct 16, 2025 · 10 minute read

The Autoscaler That Blew Our SLO: Horizontal Scale for Stateless vs Stateful That Actually Works

If your p95 slipped while the cluster doubled in size, you scaled the wrong thing. Here’s how we design horizontal scale that protects user-facing latency and revenue—across stateless frontends and the stateful beasts behind them.

Alex Rivera

Principal Engineer, GitPlumbers

Alex has led scale-ups at marketplaces, banks, and SaaS platforms for 20 years—through monoliths, microservices, and now AI-fueled traffic spikes. Ex-AWS, early SRE at a unicorn you’ve definitely doomscrolled about.

If your autoscaler isn’t anchored to SLOs, it’s just an expensive random number generator.

Back to all posts

When scale looks like success—until the SLO alarms fire

A few summers back, a consumer marketplace turned on their shiny HPA and watched pods triple during a promo push. Grafana looked like a ski slope. Finance cheered. Users? Not so much. p95 checkout latency slipped from 380ms to 1.2s, and conversion cratered 5%. We’d scaled CPU. The stateful parts—Postgres, Redis locks, Kafka consumers—flatlined. We traded infra spend for worse UX.

I’ve seen this fail at unicorns and banks alike. Horizontal scaling only works when you split the world into stateless paths you can replicate and stateful systems you scale by capacity, partitioning, and pooling. And you wire both to user-facing SLOs—not vanity cluster metrics.

Stateless: make replicas boring

Stateless is where you should be greedy with replicas. But you have to make it truly stateless and scale on the right signals.

Externalize state: sessions to Redis/Memcached, feature flags via Unleash/LaunchDarkly, file uploads to S3/GCS.
Idempotency and retries: safe to re-run after HPA churns pods.
Scale on traffic, not just CPU: RPS, in-flight requests, and queue length map to user pain better than CPU percent.
Protect tail: timeouts, budgets, circuit breakers (Envoy/NGINX), and per-endpoint limits.
Spread risk: PDBs/anti-affinity so one node loss doesn’t nuke capacity.

Kubernetes HPA v2 with custom metrics (via prometheus-adapter) is your friend:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-frontend
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-frontend
  minReplicas: 6
  maxReplicas: 60
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
      - type: Percent
        value: 100
        periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "50"  # target 50 rps per pod

And wire in connection budgets and timeouts at the edge. With Envoy:

static_resources:
  clusters:
  - name: web-backend
    connect_timeout: 0.25s
    circuit_breakers:
      thresholds:
      - max_connections: 2000
        max_pending_requests: 1000
        max_retries: 3
    outlier_detection:
      consecutive_5xx: 5
      interval: 5s
      base_ejection_time: 30s

Outcome you can bank: when we scaled a Node/Express API this way, p95 dropped 32% under 3x traffic, and we cut autoscale-to-steady-state time to under 45s.

Stateful: scale without sharding yourself into a corner

Stateful isn’t about replicas; it’s about capacity, partitioning, and being honest about consistency.

Postgres/MySQL: start with read replicas and connection pooling. Graduate to Vitess (MySQL) or Citus (Postgres) if you truly need horizontal writes. If you can live with weaker consistency, CockroachDB or Yugabyte can simplify multi-region writes.
Redis: for scale-out, use Redis Cluster (hash slots). For HA without partitioning, Sentinel + R/W split. Don’t spray thousands of client conns; pool.
Kafka: your throughput is partitions × consumer instances. Scale partitions before you scale consumers.
Cassandra/Scylla: add nodes and tune compaction_throughput_mb_per_sec, but model your partitions to avoid hotspots.

Postgres: add a pooler before anything else. 1000 pods × 50 connections is how you DDoS your DB.

# pgbouncer.ini
[databases]
appdb = host=postgres-primary.example port=5432 dbname=appdb pool_size=500

[pgbouncer]
listen_port = 6432
pool_mode = transaction
max_client_conn = 5000
default_pool_size = 50
timeout = 60
server_idle_timeout = 30

Promote reads to replicas via driver/router. For Kafka, increase partitions safely:

kafka-topics.sh \
  --bootstrap-server broker:9092 \
  --alter --topic orders --partitions 48

Note: increasing partitions doesn’t reshuffle existing keys—plan your keying to avoid hot shards, and rebalance consumers.

Redis Cluster remap example (off-peak, supervised):

redis-cli --cluster reshard 10.0.0.2:6379 \
  --cluster-from <source-node-id> \
  --cluster-to <target-node-id> \
  --cluster-slots 4096

Measured outcomes we’ve implemented:

Moving a checkout’s read-heavy flows to replicas + PgBouncer: p95 from 420ms to 280ms, DB CPU -35%, checkout completion +2.3%.
Kafka partitions from 12 → 48 with KEDA autoscaling consumers: order ingest p99 from 2.1s → 650ms at 4x burst.

User-facing metrics first: what to watch and how to autoscale on it

Infrastructure metrics are lagging indicators. Scale on the stuff users feel.

p95/p99 latency per critical endpoint (e.g., /api/checkout, /search).
Error rate on those endpoints (HTTP 5xx, gRPC non-OK).
RPS and in-flight requests per pod.
Queue depth / consumer lag for async paths.

Prometheus recording rule for p95 (for SLO checks, not direct HPA input):

# prometheus-rule.yaml
record: http_server_request_duration_seconds:p95
expr: histogram_quantile(0.95, sum(rate(http_server_request_duration_seconds_bucket{job="web-frontend"}[2m])) by (le))

Use HPA for CPU/RAM and request rate. Use KEDA for event-driven autoscaling on queues and Kafka lag:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: orders-consumer
spec:
  scaleTargetRef:
    name: orders-consumer
  minReplicaCount: 2
  maxReplicaCount: 80
  cooldownPeriod: 120
  triggers:
  - type: kafka
    metadata:
      bootstrapServers: broker-1:9092,broker-2:9092
      consumerGroup: orders-cg
      topic: orders
      lagThreshold: "5000"   # scale when lag > 5k
      activationLagThreshold: "500" # don’t flap on tiny spikes

For web in-flight autoscaling, expose a gauge like inflight_requests and use HPA Pods metric. Keep scale-up stabilization 0–30s; scale-down 5–10 minutes.

Load shedding, backpressure, and queues: survive the stampede

If everything scales, nothing does. You need brakes.

Queue hot paths you can tolerate being async (webhooks, image processing, email). Scale workers on backlog/lag.
Rate limit at the edge per IP/API key; hard-fail early with a friendly 429.
Deadlines: pass budgets through calls; cancel work you can’t finish in time.
Circuit breakers: let the slowest dependency fail fast; you’re protecting your p99.

NGINX rate limiting example:

limit_req_zone $binary_remote_addr zone=perip:10m rate=10r/s;
server {
  location /api/checkout {
    limit_req zone=perip burst=20 nodelay;
    proxy_read_timeout 0.8s;
    proxy_connect_timeout 0.25s;
  }
}

Java with Resilience4j for downstream calls:

var cb = CircuitBreaker.ofDefaults("payments");
var rt = RateLimiter.ofDefaults("payments");
var decorated = Decorators.ofSupplier(() -> client.charge(req))
    .withCircuitBreaker(cb)
    .withRateLimiter(rt)
    .withTimeLimiter(TimeLimiter.of(Duration.ofMillis(800)), executor)
    .get();

We’ve seen 40–60% p99 improvements on spiky traffic simply by shedding non-critical requests and pushing slow work to a queue.

Real results: what “10 ms faster” buys you

At a B2C fintech, shaving 120ms off p95 on auth and balance reads increased session length +7% and reduced support chats by 18%—an actual OpEx reduction, not just a nicer graph.
For a retail app’s Black Friday, we moved to KEDA on Kafka lag and added Envoy circuit breakers. Outcome: 0 budget burn on the cart SLO; infra spend +28% instead of +110% YoY; conversion up 3.1% under peak.
A media API adopted Redis Cluster and per-key hashing for trending content. Cache hit rate +14 points; egress cost -22%; p50 20ms → 9ms, p99 600ms → 210ms.

Small latency wins compound. As Amazon and Google have both published, +100ms delays cost measurable revenue. You don’t need to beat physics—just protect tail latency where it impacts money.

The playbook we run at GitPlumbers

Baseline with user-first SLOs: define p95/p99 and error budgets per key flow. Wire SLO burn alerts.
Trace and profile: OpenTelemetry+Jaeger, eBPF (parca, pixie) to find hot code, lock contention, and syscalls.
Split the graph: list stateless vs stateful, synchronous vs async. Identify which stateful constraints are primary (IO, lock, partition).
Right-size stateless: HPA on RPS and in-flight; PDBs; tune readiness; keep startup <5s so HPA can actually help.
Secure stateful capacity: add PgBouncer; promote reads; scale Kafka partitions; plan Redis Cluster slots.
Add backpressure: queues, deadlines, circuit breakers, and rate limits. Prove tail protection in load tests.
Automate and watch: GitOps via ArgoCD; progressive deploys via Argo Rollouts/Flagger; dashboards for SLO, backlog/lag, in-flight.
Chaos and game days: kill nodes, add 200ms RTT, drop a replica. Verify SLO holds and autoscalers react <60s.

We’ll pair with your team to ship this in 2–6 weeks, not quarters.

What I’d do differently next time

Design for pooling on day one. PgBouncer (or RDS Proxy) would have saved three incidents.
Pick one sharding key and live with it. We lost weeks re-keying Kafka and Redis because a “balanced” key ignored access patterns.
Autoscale on fewer, better signals. RPS + in-flight + lag beat the 17-metric Frankenstein.
Set budgets, not just limits. Timeouts without deadlines just move the pain.
Make scale a release gate. No feature ships without a load test proving SLO under 2x baseline traffic.

Related Resources

Key takeaways

Scale stateless paths on signals users feel (p95 latency, in-flight requests, RPS), not just CPU.
Stateful scaling ≠ replicas. Use read replicas, partitioning, and connection pooling to avoid hot primaries.
Autoscaling must be SLO-backed and include load shedding/backpressure to protect tail latency.
Use queues to decouple bursts and scale workers on backlog/lag with KEDA.
Measure the business, not just the cluster: track conversion, retention, and error budgets alongside infra metrics.

Implementation checklist

Define SLOs for p95/p99 and error budgets per critical user path.
Instrument RPS, in-flight requests, queue depth, and consumer lag as autoscaling inputs.
Make stateless services idempotent; externalize sessions and caches.
Introduce read replicas and a pooler (PgBouncer) before sharding.
Set circuit breakers, timeouts, and queues to protect tail latency.
Use KEDA for event-driven scaling; use HPA for CPU/RAM but bias toward request signals.
Run load tests (k6/vegeta) and game days; verify scale-ups in <60s and tail latency protection.

Questions we hear from teams

How do I know if a service is safe to horizontally scale as stateless?: Check for externalized state (sessions, cache, files), idempotent handlers, and zero dependency on local disk or in-memory sticky data. If a single request can land on any replica and succeed, you’re stateless enough.
What’s the first move to scale Postgres without a full re-architecture?: Introduce PgBouncer in transaction mode, add read replicas, and route read-only queries there. This resolves connection storms and offloads 30–60% of read traffic in most CRUD apps.
Should I scale on CPU or latency?: Use CPU as a guardrail, but scale primary on load proxies users feel: RPS, in-flight requests, and backlog/lag. Latency is noisy for direct autoscaling but perfect for SLOs and alerting.
When do I need to shard writes?: Only after you’ve exhausted read replicas, pooling, and query/index tuning. If write QPS saturates a single primary or you need multi-region writes with low RTO, consider Vitess/Citus or a distributed SQL option like CockroachDB.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Book a Scale Review Download the SLO→Autoscaling Playbook