How do I know if I should shard my database or add read replicas first?

Add read replicas if reads dominate and you can route read-only traffic safely; monitor replica lag. If write p95 is high, or a few tenants/keys are hot, partition by a stable key (tenant, region) and keep writes local. Sharding is a topology choice—don’t do it until you’ve fixed N+1s, indexing, and connection pooling.

Can I rely on CPU-based HPA for stateless services?

Not for user latency SLOs. CPU is a lagging, noisy signal. Use concurrency (in-flight requests), request rate, and p95 latency as HPA inputs. For consumers, scale on queue depth or lag via KEDA.

What’s the minimal setup to avoid cache stampedes?

Enable per-key locks or request coalescing, use `proxy_cache_use_stale updating`/`stale-while-revalidate`, jitter TTLs, and set reasonable upstream timeouts. A small amount of coordination prevents a thundering herd during invalidations.

How do I scale AI-generated code that’s already in prod?

First, stop the bleeding: circuit breakers, rate limits, and bounded retries. Then perform a vibe code cleanup—remove N+1s, add connection pools, parameterize queries, and write SLO-driven tests. We’ve rescued several teams in under two weeks by pairing those fixes with better autoscaling signals.

What tooling do you recommend to keep infra costs in check while scaling?

Use `Prometheus`/`Grafana` for cost-aware dashboards, `Cluster Autoscaler`/`Karpenter` for bin-packing, `Terraform` to right-size instance classes, and load-based scaling to avoid idle capacity. Measure cost per 1k requests alongside p95.

Performance-optimization · Nov 17, 2025 · 10 minute read

Horizontal Scale Without Regret: Stateless vs Stateful, What Actually Works

What I learned scaling a checkout from 15 rps to 2,500 rps without blowing up p95 or the database.

Samir Desai

Partner, Performance & Reliability

20 years building and fixing distributed systems at scale. Ex-AWS, ex-Etsy SRE, led migrations on EKS, Istio, and ArgoCD before it was cool. At GitPlumbers, I help teams turn flaky, AI-glued systems into boringly fast platforms.

Scale stateless by default; scale state by topology. If CPU is your north star, you’re buying hardware to keep slow code slow.

Back to all posts

The spike that broke checkout—and how we fixed it

Black Friday, years ago. Retail client. Marketing turned the firehose on and our stateless API tier scaled like a champ. Then p95 blew past 1.8s because the “stateless” tier was masking a single Postgres primary wheezing under 35k connections from vibey, AI-glued code calling SELECT * in loops. We didn’t need more pods. We needed topology. Two weeks later we shipped: session externalization, idempotency keys, read replicas, partitioned hot tables, and HPA signals tied to concurrency and queue depth. p95 went from 1.8s to 220ms at 2,500 rps sustained, with error rate <0.2%. Revenue/minute up 14%. That’s what horizontal scale looks like when you tie it to customer experience.

Measure what the user feels, not what the node feels

If you scale to CPU, you’ll buy a lot of hardware to keep slow code slow. Tie everything to user-facing metrics and business impact.

Primary SLOs: p95 latency by endpoint, error rate, availability, Apdex. Track p99 for tail pain.
Business signals: conversion rate, revenue/min, cart abandonment, DAU retention, LTV churn indicators.
Golden signals: latency, traffic, errors, saturation—per hop (edge, service, DB, queue).
Per-key/tier metrics: tenant keyed p95, partition lag, replica lag. The tail hides in hotspots.

Example SLO alert (Prometheus):

groups:
- name: slos
  rules:
  - record: service:latency_p95
    expr: histogram_quantile(0.95, sum by (le, route) (rate(http_request_duration_seconds_bucket[5m])))
  - alert: LatencySLOViolation
    expr: service:latency_p95{route="/checkout"} > 0.3
    for: 10m
    labels:
      severity: page
    annotations:
      summary: p95 > 300ms on /checkout

Stateless that really scales: externalize, idempotize, and cap fan-out

“Stateless” is table stakes, but teams still trip on two things: hidden state and unbounded work per request.

Externalize session: No sticky sessions. Use Redis/Memcached. Keep it tiny and TTL’d.
Idempotency keys: Prevent duplicate writes when autoscaling and retries pile up.
Bound fan-out: Cap downstream calls per request. Batch, coalesce, and cache.
Connection pools: Limit DB and HTTP clients to sane pool sizes; shard pools per host.
Autoscale to concurrency: CPU lies. Concurrency and latency tell the truth.

Kubernetes HPA on concurrency via nginx_ingress_controller_requests or service-level in_flight_requests:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 6
  maxReplicas: 200
  metrics:
  - type: Pods
    pods:
      metric:
        name: in_flight_requests
      target:
        type: AverageValue
        averageValue: "40"
  behavior:
    scaleUp:
      policies:
      - type: Percent
        value: 100
        periodSeconds: 60
      stabilizationWindowSeconds: 0
    scaleDown:
      stabilizationWindowSeconds: 300

Protect downstreams with Envoy circuit breakers and timeouts:

clusters:
- name: checkout
  connect_timeout: 0.2s
  circuit_breakers:
    thresholds:
    - max_connections: 2000
      max_pending_requests: 1000
      max_requests: 4000
  outlier_detection:
    consecutive_5xx: 5
    interval: 2s
    base_ejection_time: 30s

Cache at the edge to cut p95 in half:

location /catalog {
  proxy_cache catalog_cache;
  proxy_cache_valid 200 60s;
  proxy_cache_use_stale error timeout updating;
}

Result we repeatedly see: API tier p95 drops 30–60%, error budgets stop burning during marketing spikes, and infra spend stays flat because you’re avoiding database thrash.

Stateful without tears: partition, replicate, and route by key

Horizontal scale for stateful systems is topology: partitions, replicas, and the router that knows where to send traffic.

Databases

Read replicas for fan-out reads, primary for writes. Route with read/write split in the app or via a proxy.
Partitioning/sharding for hot tables. Use natural keys (tenant, region) or consistent hashing for balance.
Connection budget: Keep client pools small; let a proxy (e.g., pgbouncer) multiplex.

PostgreSQL native partitioning to isolate hot tenants:

CREATE TABLE orders (
  tenant_id text NOT NULL,
  order_id  uuid PRIMARY KEY,
  created_at timestamptz NOT NULL,
  ...
) PARTITION BY LIST (tenant_id);

CREATE TABLE orders_us_east PARTITION OF orders FOR VALUES IN ('us-east');
CREATE TABLE orders_eu_west PARTITION OF orders FOR VALUES IN ('eu-west');

CREATE INDEX ON orders_us_east (created_at DESC);

Connection pooling with pgbouncer:

[databases]
app = host=db-primary port=5432 dbname=app

[pgbouncer]
max_client_conn = 10000
default_pool_size = 50
pool_mode = transaction

Caches

Move to Redis Cluster when keyspace > single-node memory or you see eviction churn.
Use hash tags to co-locate related keys: cart:{user123}:items.

Queues/Streams

Size partition count to consumer parallelism headroom; monitor lag not just throughput.
Use consumer groups; keep max.poll.interval.ms sane to avoid rebalances causing tail spikes.

Kafka topic with headroom for 10x consumers:

kafka-topics --create --topic events.checkout \
  --partitions 48 --replication-factor 3 \
  --config min.insync.replicas=2 \
  --config retention.ms=604800000

Routing by key (Node.js example) to keep affinity with shards:

function shardFor(tenantId: string, shardCount = 16) {
  const hash = murmurhash(tenantId);
  return hash % shardCount; // use consistent hashing in prod
}

Trade-offs you must make explicit:

Consistency vs availability: Can your cart be eventually consistent for 1–2s? Say it out loud.
Repairability: Rebalancing shards will hurt. Budget for it. Automate with checksums and backfills.
Failover capacity: Keep replicas warm and sized for N-1 failures; otherwise failover is just a slower outage.

Control planes that prevent 3 AM pages

Spikes are normal; uncontrolled spikes kill margins and SLOs. Put governors in place.

Rate limiting at the edge (Envoy, Kong, NGINX, or cloud ALB/WAF). Allocate per-tenant budgets.
Backpressure & queue depth scaling: Use KEDA to scale consumers on lag.
Load shedding: Drop non-critical work first (recommendations, analytics) to save checkout.
Retries with jitter and budgets: Bounded retries; otherwise you DDoS yourself.

KEDA scaling on Kafka lag:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: checkout-consumer
spec:
  scaleTargetRef:
    name: checkout-consumer
  pollingInterval: 5
  cooldownPeriod: 60
  minReplicaCount: 2
  maxReplicaCount: 200
  triggers:
  - type: kafka
    metadata:
      bootstrapServers: kafka:9092
      consumerGroup: checkout
      topic: events.checkout
      lagThreshold: "5000"

Istio outlier detection to auto-eject bad pods and reduce tail latency:

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: api-dr
spec:
  host: api
  trafficPolicy:
    outlierDetection:
      consecutive5xxErrors: 5
      interval: 2s
      baseEjectionTime: 30s
      maxEjectionPercent: 50

Result: during flash sales we’ve seen 40–70% lower p99 without increasing spend, purely from ejecting slow instances and scaling consumers to lag instead of CPU.

Ship changes safely: canaries, SLO gates, and capacity tests

You don’t discover scale in staging. You earn it in prod—safely.

GitOps with canaries: Use Argo Rollouts to shift by weight while watching p95 and error budget.
SLO gates: Roll forward only if SLOs stay green; auto-abort on regression.
Capacity tests: Weekly k6/Locust runs to 2x expected load, with DB and queue metrics on screen.
Chaos drills: Kill a replica, degrade network by 200ms, watch autoscale and circuit breakers.

Argo Rollouts with Prometheus-based analysis:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: api
spec:
  strategy:
    canary:
      canaryService: api-canary
      stableService: api-stable
      trafficRouting:
        istio: { virtualService: api-vs, weight: 10 }
      steps:
      - setWeight: 10
      - pause: { duration: 300 }
      - analysis:
          templates:
          - templateName: p95-check
          args:
          - name: route
            value: /checkout

Analysis template snippet:

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: p95-check
spec:
  metrics:
  - name: p95-latency
    provider:
      prometheus:
        query: histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket{route="{{args.route}}"}[2m])))
    successCondition: result < 0.3
    failureLimit: 1

What moves the business: numbers we’ve actually hit

A few anonymized results we shipped at GitPlumbers:

Retail checkout (K8s/EKS, Node.js, Postgres, Redis, Kafka):
- p95: 1.8s -> 220ms, p99: 6.4s -> 700ms
- Error rate: 3.2% -> 0.18%
- Conversion: +14% during peak hour
- Infra: +8% cost with 11x throughput (autoscaled consumers and read replicas)
B2B analytics ingest (Golang, NATS, ClickHouse):
- Throughput: 120k -> 1.3M events/sec
- Backlog drain time: 4h -> 11m after KEDA + partitioning
- Support tickets: -62% (tail latency and retries fixed)
Fintech ledger (Java, PostgreSQL, Debezium):
- Write p95: 90ms -> 28ms via partitioning + pgbouncer
- MTTR: 42m -> 11m with outlier ejection and circuit breakers
- Revenue at risk during incident: -73%

These weren’t “rewrite it in Rust” wins. They were topology, autoscaling signals, and ruthless control planes. We also cleaned up a lot of AI-generated “vibe code” that hammered databases with N+1s. Don’t scale chaos; fix it first.

A pragmatic plan you can run this quarter

Define SLOs by endpoint, plus business KPIs. Set budgets in dollars and error budgets in time.
Make stateless real: externalize session, implement idempotency, bound fan-out, cap pools.
Autoscale to load: HPA on concurrency/latency, KEDA on queue depth, enable Cluster Autoscaler/Karpenter.
Protect state: read replicas, partition hot paths, configure pgbouncer/ProxySQL, size caches.
Install governors: edge rate limits, circuit breakers, outlier detection, load shedding.
Prove it: canaries with SLO gates, weekly load tests to 2x expected, quarterly chaos drills.
Watch the tail: p99 per tenant/partition, queue lag, replica lag; fix hotspots before you add nodes.

If you want seasoned hands to sanity-check the plan—and rescue any AI-coded footguns—GitPlumbers can help. We’ll meet you where you are and make it boringly fast.

structuredSections':[{

Related Resources

Key takeaways

Scale stateless first, but make it truly stateless—externalize sessions, idempotency, and cache aggressively to protect downstreams.
Stateful scale is about topology, not pods—partition, replicate, and route by key; measure p95 and tail at each hop.
Autoscale to user-facing SLOs, not CPU—use request concurrency, queue depth, and latency as signals.
Control the blast radius—rate-limit at the edge, apply circuit breakers and outlier detection, and shed lowest-value work first.
Prove capacity before you need it—canary, load test in prod-like, and budget for failover capacity.
Tie improvements to business metrics—conversion rate, revenue per minute, and support costs, not just system metrics.

Implementation checklist

Define SLOs: p95 < X ms, error rate < Y%, availability Z9s, and Apdex threshold.
Make services stateless: externalize session to `Redis`/`Memcached`, implement idempotency keys, decouple with queues.
Autoscale to load: `HPA` on concurrency/latency, `KEDA` on queue depth, enable `Cluster Autoscaler`/`Karpenter`.
Protect state: shard/partition, add read replicas, tune connection pools, enforce timeouts and retries with jitter.
Control traffic: edge rate limiting, `Envoy`/`Istio` circuit breakers and outlier detection, implement load shedding.
Capacity test: `k6`/`Locust` tests, canary with `Argo Rollouts`, chaos drills on replica loss and network jitter.
Observe: golden signals in `Prometheus`/`Grafana`, SLO alerts, per-tenant and per-key tail latency tracking.
Plan failure domains: AZ-aware routing, partition-aware hashing, runbooks with clear abort thresholds.

Questions we hear from teams

How do I know if I should shard my database or add read replicas first?: Add read replicas if reads dominate and you can route read-only traffic safely; monitor replica lag. If write p95 is high, or a few tenants/keys are hot, partition by a stable key (tenant, region) and keep writes local. Sharding is a topology choice—don’t do it until you’ve fixed N+1s, indexing, and connection pooling.
Can I rely on CPU-based HPA for stateless services?: Not for user latency SLOs. CPU is a lagging, noisy signal. Use concurrency (in-flight requests), request rate, and p95 latency as HPA inputs. For consumers, scale on queue depth or lag via KEDA.
What’s the minimal setup to avoid cache stampedes?: Enable per-key locks or request coalescing, use `proxy_cache_use_stale updating`/`stale-while-revalidate`, jitter TTLs, and set reasonable upstream timeouts. A small amount of coordination prevents a thundering herd during invalidations.
How do I scale AI-generated code that’s already in prod?: First, stop the bleeding: circuit breakers, rate limits, and bounded retries. Then perform a vibe code cleanup—remove N+1s, add connection pools, parameterize queries, and write SLO-driven tests. We’ve rescued several teams in under two weeks by pairing those fixes with better autoscaling signals.
What tooling do you recommend to keep infra costs in check while scaling?: Use `Prometheus`/`Grafana` for cost-aware dashboards, `Cluster Autoscaler`/`Karpenter` for bin-packing, `Terraform` to right-size instance classes, and load-based scaling to avoid idle capacity. Measure cost per 1k requests alongside p95.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to an engineer at GitPlumbers Download the Scaling Design Checklist