How do I know if I’m autoscaling on the wrong signal?

If CPU-based HPA scales but P95 won’t budge—or gets worse—you’ve picked the wrong signal. Switch to RPS, queue depth, or saturation (e.g., DB connection wait time). Tie it to user-facing latency, not host metrics.

What’s the first stateful change to make if I can only do one?

Add `pgbouncer` (transaction pooling) in front of Postgres or a connection pooler for your DB, then cap application connection pools. This alone often cuts P95 100–300ms under load.

Do I need sharding right now?

Probably not. Exhaust pooling, read replicas, and partitioning first. Shard when write amplification or working-set size still crushes the primary, or when a single noisy tenant impacts others.

How do I prove business impact from these changes?

Run a controlled canary with load generation. Track P95, error rate, throughput-at-SLO, and cost per successful request. Tie conversion/retention deltas to the canary window. If you can’t measure it, you didn’t ship it.

Performance-optimization · Dec 1, 2025 · 10 minute read

Kubernetes Added 200 Pods. Postgres Added 600ms: Horizontal Scale That Holds at P95

Your stateless tier will scale until your state makes it cry. Here’s how to design horizontal scale for both—and prove the business impact with user-facing metrics.

Alex Ramirez

Partner, Performance & Reliability at GitPlumbers

20 years keeping high-traffic systems honest—from bare metal at ad-tech unicorns to Kubernetes at public retailers. Ex-SRE lead at a payment processor; still carries a pager and a healthy distrust of silver bullets.

“Your pods will happily scale you straight into a database outage. Design for state, or enjoy your 3 a.m. page.”

Back to all posts

The incident you’ve lived through

We shipped a “stateless” microservice, dialed the Deployment replicas from 20 to 220, and watched the dashboard go green. Then checkout P95 jumped from 280ms to 900ms. Why? Postgres went from 1.5k to 9k connections, CPU pegged, WAL backlog grew, and vacuum got starved. Requests timed out, retries cascaded, carts double-charged. I’ve seen this movie at two unicorns and one public retailer. Your pods scale. Your data doesn’t—unless you design for it.

This is the guide I wish I’d handed those teams before they learned the hard way. It’s pragmatic, tool-specific, and tied to user-facing metrics and business impact.

The only metrics that matter to the business

If you can’t tie scaling to these, you’re just burnishing dashboard art:

P95/P99 latency of key user journeys (e.g., “Place Order”) — every 100ms at checkout usually moves conversion 1–3%.
Error rate (HTTP 5xx/4xx for eligible endpoints) — cart abandonment spikes with >1% errors.
Throughput at SLO — sustained RPS while keeping P95 under SLO thresholds.
MTTR during incidents — can you shed load and recover in minutes, not hours?
Cost per successful request — infra + licenses divided by good requests. This is how CFOs judge you.

You’ll optimize for these using active load tests (not “it felt fast on staging”) and guard rollouts with SLOs, not vibes.

Stateless scaling that doesn’t torch your state

If your web/API tier is truly stateless, it should be boring to scale. Here’s what actually works:

Nuke sticky sessions. Ingress cookies feel harmless until HPA shifts traffic and half your pods go cold. Set sessionAffinity: None and move state to Redis or stateless JWTs.
Idempotency first. Expect retries from Envoy/Nginx, client SDKs, and message queues. Use Idempotency-Key or deterministic request IDs when writing to state.
Throttle at the edge. Per-client rate limits stop noisy neighbors from stampeding your DB.
Connection and concurrency caps. Your app should never be able to create more DB connections than the DB can handle. Cap worker pools and HTTP client pools.
Autoscale on demand, not CPU. CPU lies for IO-bound services. Use RPS, queue depth, or latency as the signal.

Example: autoscale a go-api Deployment by RPS with Prometheus Adapter in Kubernetes:

aPiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: go-api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: go-api
  minReplicas: 10
  maxReplicas: 300
  metrics:
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "50"  # target 50 RPS per pod

If you don’t have a clean per-pod RPS metric, use KEDA with queue depth:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: orders-consumer
spec:
  scaleTargetRef:
    kind: Deployment
    name: orders-consumer
  minReplicaCount: 2
  maxReplicaCount: 200
  triggers:
  - type: aws-sqs-queue
    metadata:
      queueURL: https://sqs.us-east-1.amazonaws.com/123/orders
      queueLength: "500"  # 1 replica per 500 messages

Also, cache like an adult:

CDN edge cache for GETs with correct Cache-Control and Vary headers.
Request coalescing (single flight) to stop cache misses from stampeding origins.

What this buys you: we’ve taken P95 from 800ms to 220ms on a Node.js API just by fixing sticky sessions, adding pgbouncer, and scaling on RPS instead of CPU—conversion went up 3.1%, and cost/request dropped 18%.

Stateful scaling: reads are easy, writes are honest

Datastores don’t horizontally scale by vibes. There’s a progression that works in the real world:

Connection pooling in front of the DB (e.g., pgbouncer transaction pooling for Postgres) to keep connection storms from killing it.
Read replicas with correct routing (read-mostly traffic off primary) and bounded replication lag.
Partitioning/sharding by tenant or time to keep hot sets small.
Queue the write path when possible (CQRS) and project read models.

Postgres example: add pgbouncer and stop the connection explosion.

# pgbouncer.ini
[databases]
app = host=postgres-primary port=5432 dbname=app

[pgbouncer]
listen_port = 6432
pool_mode = transaction
max_client_conn = 2000
default_pool_size = 50
server_reset_query = DISCARD ALL

Then point apps at pgbouncer and cap their pools below default_pool_size.

Partition a hot table by tenant to keep indexes small and vacuums fast:

CREATE TABLE events (
  tenant_id bigint NOT NULL,
  occurred_at timestamptz NOT NULL,
  payload jsonb NOT NULL
) PARTITION BY HASH (tenant_id);

CREATE TABLE events_p0 PARTITION OF events FOR VALUES WITH (MODULUS 8, REMAINDER 0);
-- ... p1..p7

Kafka example: if your consumer lag spikes under load, you probably don’t have enough partitions or your keying is skewed.

# Increase partitions safely (note: ordering guarantees per key only)
kafka-topics --bootstrap-server :9092 \
  --alter --topic orders --partitions 48

Redis example: switch to Redis Cluster for horizontal keyspace scaling and avoid global locks. Use EVALSHA for atomic hot-path operations.

Write scaling patterns that actually work:

CQRS + outbox. Write once to a local table, a job ships it to a log (Kafka). Consumers build read models. Your primary write path stays short and reliable.
Sagas for multi-entity consistency. Compensating actions beat distributed 2PC in 2025.
Tenant isolation. Separate noisy enterprise tenants into their own shard/cluster.

Measured wins we’ve seen:

Adding pgbouncer + read routing cut primary CPU 40% and dropped checkout P95 by 300ms.
Hash partitioning a 2B-row audit table reduced index bloat by 60% and vacuum time by 70%.
Increasing Kafka partitions from 12 to 48 lifted sustained throughput from 30k to 110k msgs/s at P99 < 200ms end-to-end.

Backpressure, circuit breakers, and the art of not falling over together

Horizontal scale without control is a DDoS on your own state.

Queue everything that can wait. Scale consumers by queue depth (KEDA works). Keep the sync path thin.
Circuit breakers to stop hammering dying dependencies. Envoy/Istio make this boring.
Retries with jitter and caps plus idempotency keys. No unbounded retries.
Admission control: return 429/503 with Retry-After under saturation; better a fast “no” than a slow “maybe”.

Envoy example: trip the breaker when upstreams misbehave.

clusters:
- name: payments
  type: STRICT_DNS
  connect_timeout: 0.25s
  lb_policy: ROUND_ROBIN
  circuit_breakers:
    thresholds:
      max_connections: 1000
      max_pending_requests: 2000
      max_retries: 3
  outlier_detection:
    consecutive_5xx: 5
    interval: 2s
    base_ejection_time: 30s
    max_ejection_percent: 50

Result: in one fintech, this held P95 < 400ms during a third-party PSP brownout while error rate stayed <0.8% and MTTR was 9 minutes instead of a 2-hour meltdown.

Test like prod, roll out with SLO guardrails

If your “perf test” is 10 users on staging, you’re writing fiction. Make it boringly real:

Load test with production traffic models and data sizes. We use k6 and vegeta.
SLOs that map to customer journeys (e.g., checkout: P95<500ms, errors<1%).
GitOps + canaries via Argo Rollouts, roll back automatically on SLO burn.

Minimal k6 script to validate P95 and errors:

// k6 run script.js
import http from 'k6/http';
import { check, sleep } from 'k6';

export let options = {
  vus: 300,
  duration: '10m',
  thresholds: {
    http_req_failed: ['rate<0.01'],
    http_req_duration: ['p(95)<500'],
  },
};

export default function () {
  const res = http.post(`${__ENV.API}/checkout`, JSON.stringify({ cartId: 'demo' }), {
    headers: { 'Content-Type': 'application/json', 'Idempotency-Key': __ITER },
  });
  check(res, { 'status is 2xx': (r) => r.status >= 200 && r.status < 300 });
  sleep(1);
}

Wire rollouts to metrics, not gut feel. With Argo Rollouts + Prometheus, you can halt if P95 or error rate trips.

Cost sanity: track $/success alongside P95. We’ve killed “optimizations” that made P95 5% better but cost/request 40% worse.

A pragmatic playbook you can run next sprint

Map journeys to SLOs. Pick 3 endpoints that make money. Set P95 and error budgets.
Kill sticky sessions. Move session state to Redis or JWT. Verify with a rolling restart.
Autoscale on demand. Add HPA/KEDA using RPS or queue depth.
Install pgbouncer. Cap app pools. Add at least one read replica and route reads.
Introduce backpressure. Circuit breakers and bounded retries at the edge.
Partition the hottest table. Tenant or time; prove it with index size/scan time drops.
Load test and canary. k6 script + Argo Rollouts tied to P95/error thresholds.

What to expect when done right:

20–60% P95 reduction on critical paths.
2–5x sustained RPS at the same or lower error rate.
10–30% lower cost per successful request.
MTTR measured in minutes because breakers and queues isolate failure domains.

And please: watch for AI-generated “vibe code” that spawns N+1 queries and per-request new DB clients. We’re rescuing more of that this year than anything else. A one-hour review catches most of it before you scale the blast radius.

Related Resources

Key takeaways

Stateless tiers scale trivially—until they blow up your stateful dependencies. Design both together.
Optimize for user-facing metrics first: P95 latency, error rate, and cost per successful request.
Autoscale on real demand signals (RPS, queue depth, saturation), not just CPU.
For databases: start with connection pooling and read replicas, then move to partitioning/sharding when writes hurt.
Backpressure, circuit breakers, and idempotency are mandatory to prevent cascading failures.
Prove impact with load tests and SLOs; wire rollouts to metrics, not vibes.

Implementation checklist

Kill sticky sessions; store session in Redis or use stateless JWT.
Set HPA/KEDA based on RPS or queue length, not CPU alone.
Add pgbouncer (transaction pooling) in front of Postgres; cap app pool <= pgbouncer max_client_conn.
Introduce read replicas and route read-heavy traffic accordingly.
Partition tables by tenant or time; avoid hot shards with consistent hashing.
Scale consumers on queue depth; implement circuit breakers and retries with backoff + idempotency keys.
Load test with k6; guard rollouts with P95 and error-rate SLOs via Argo Rollouts.
Watch cost-per-success; prune noisy dependencies and unnecessary fan-out.

Questions we hear from teams

How do I know if I’m autoscaling on the wrong signal?: If CPU-based HPA scales but P95 won’t budge—or gets worse—you’ve picked the wrong signal. Switch to RPS, queue depth, or saturation (e.g., DB connection wait time). Tie it to user-facing latency, not host metrics.
What’s the first stateful change to make if I can only do one?: Add `pgbouncer` (transaction pooling) in front of Postgres or a connection pooler for your DB, then cap application connection pools. This alone often cuts P95 100–300ms under load.
Do I need sharding right now?: Probably not. Exhaust pooling, read replicas, and partitioning first. Shard when write amplification or working-set size still crushes the primary, or when a single noisy tenant impacts others.
How do I prove business impact from these changes?: Run a controlled canary with load generation. Track P95, error rate, throughput-at-SLO, and cost per successful request. Tie conversion/retention deltas to the canary window. If you can’t measure it, you didn’t ship it.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to an engineer about your scaling bottleneck Get our SLO/SLI template