Stop Buying Bigger Boxes: Database Optimizations That Actually Scale With User Growth

User-facing performance is a database problem 8 times out of 10. Here’s the playbook we use to cut p95s and lift conversion without rewriting your app.

“Most ‘database problems’ are really product problems disguised as missing indexes and too many connections.”
Back to all posts

The moment it hurts

Two Novembers ago, a retail client’s Black Friday preview doubled traffic overnight. App servers had headroom, but p95 checkout latency spiked from 780 ms to 1.8 s. CPU on the DB hovered at 35%, so infra said “we’re fine.” They weren’t. Connection spikes, N+1s, and an unindexed JSONB filter were quietly kneecapping throughput. We didn’t migrate, didn’t re-architect, and didn’t buy bigger RDS. We fixed the database.

  • Result in 10 days: p95 checkout from 1.8 s to 420 ms, error rate −73%, conversion +3.1%. Same instance size.

Start with the metric users feel

You don’t optimize the database; you optimize user outcomes.

  • KPIs that matter: p95/p99 endpoint latency, Apdex, TTFB, error rate, queue time (DB wait), and per-flow SLAs (search < 300 ms, checkout < 500 ms).
  • Translate to DB: top slow queries, lock wait time, rows read vs. returned, buffer cache hit ratio, replication lag.

Instrument before you touch anything:

-- PostgreSQL: find the real culprits
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
SELECT query, calls, mean, stddev, total_time
FROM pg_stat_statements
ORDER BY total_time DESC
LIMIT 20;
# Prometheus examples
# p95 DB time by service (assuming client-side histogram)
sum(rate(http_request_duration_seconds_bucket{le="0.95",route="/checkout"}[5m]))
/
sum(rate(http_request_duration_seconds_count{route="/checkout"}[5m]))

# Postgres lock waits
rate(pg_locks_wait_event[5m])

If you can’t tie a query to a user path and a dollar number, don’t optimize it first.

Kill N+1s and index for actual filters

N+1s and mismatched indexes are the 80/20. I’ve seen teams spend months on shards while a single composite index would’ve dropped p95 by half.

  1. Fix N+1s where they show up in traces
  • Enable DB call spans. You’ll see 20–60 tiny selects per request.
  • Batch and prefetch related data in the ORM.
// TypeORM / Node: batch load line items instead of per-row fetch
const orders = await repo.find({ where: { userId }, relations: { lineItems: true } });
// Hibernate: JOIN FETCH to avoid N+1
@Query("SELECT o FROM Order o JOIN FETCH o.lineItems WHERE o.user.id = :userId")
List<Order> findWithItems(@Param("userId") Long userId);
  1. Create composite/covering indexes that match hot paths
  • If queries filter on (tenant_id, status, created_at) and return a small projection, build exactly that, in that order.
-- Postgres composite + covering index for list page
CREATE INDEX CONCURRENTLY idx_orders_tenant_status_created
ON orders (tenant_id, status, created_at DESC)
INCLUDE (total_amount, currency);
  • For JSONB filters, use GIN with appropriate operators.
CREATE INDEX CONCURRENTLY idx_products_gin_tags ON products USING GIN (tags jsonb_path_ops);
  1. Rewrite “SELECT *” and wide joins
  • Only select columns you render. Smaller tuples mean fewer page reads.
EXPLAIN (ANALYZE, BUFFERS)
SELECT id, total_amount, currency
FROM orders
WHERE tenant_id = $1 AND status = 'OPEN'
ORDER BY created_at DESC
LIMIT 50;
  • Track “rows read per row returned” (aim for < 10 for most endpoints). If it’s 1,000+, you’re scanning.

Expectation: These changes typically cut read latency 30–70% and reduce CPU/IO 25–50% on medium traffic APIs. We’ve seen list endpoints drop from 600 ms p95 to ~180 ms with one index and two JOIN rewrites.

Pool connections and prepare statements

Unbounded connections kill throughput; each one has memory overhead and context switching. I’ve watched teams scale app pods to 50 and accidentally open 2,500 DB connections on a db.t3.medium.

  • Use PgBouncer in transaction mode (Postgres) or ProxySQL (MySQL) to multiplex.
  • Keep app pool sizes realistic: (CPU cores * 2) + small buffer per service, not per pod.
  • Reuse prepared statements to cut parse/plan time and lock churn.
# pgbouncer.ini
[databases]
app = host=db.internal port=5432 dbname=appdb pool_mode=transaction

[pgbouncer]
max_client_conn = 2000
default_pool_size = 50
min_pool_size = 10
reserve_pool_size = 20
server_reset_query = DISCARD ALL
# Spring Boot example
spring:
  datasource:
    hikari:
      maximum-pool-size: 20
      minimum-idle: 5
      leak-detection-threshold: 20000
// Node pg: prepared statements
await client.query({
  name: 'get_open_orders_v1',
  text: 'SELECT id, total_amount FROM orders WHERE tenant_id=$1 AND status=$2 LIMIT $3',
  values: [tenantId, 'OPEN', 50],
});

Expectation: Tail latency (p99) often drops 20–40% after right-sizing pools and enabling pooling proxies. Errors from too many connections disappear. CPU steadies.

Cache with intent (and correctness)

Caching works when you design invalidation first. “Just throw Redis in front” is how carts go stale.

  • Cache read-mostly views with short TTLs and per-tenant keys.
  • Invalidate on write using CDC (Debezium + Kafka) for correctness at scale.
  • Prefer idempotent endpoints; cache keys include version/feature flags when needed.
// Redis read-through cache with TTL
const key = `orders:list:${tenantId}:status=open:v2`;
let cached = await redis.get(key);
if (!cached) {
  const rows = await db.query('SELECT id, total_amount FROM ...');
  await redis.set(key, JSON.stringify(rows), { EX: 30 }); // 30s TTL
  cached = JSON.stringify(rows);
}
return JSON.parse(cached);
# Debezium connector for orders table (CDC)
name: orders-connector
config:
  connector.class: io.debezium.connector.postgresql.PostgresConnector
  database.hostname: db.internal
  database.dbname: appdb
  table.include.list: public.orders
  tombstones.on.delete: false
# Consumer invalidates cache keys on updates
kafka-consume orders.public.orders | ./invalidate-cache --pattern "orders:list:*:v2"
  • For heavy aggregations, use materialized views or precompute to Redis/ClickHouse.
CREATE MATERIALIZED VIEW mv_sales_by_day AS
SELECT date_trunc('day', created_at) AS day, sum(total_amount) AS revenue
FROM orders
GROUP BY 1;

CREATE INDEX CONCURRENTLY ON mv_sales_by_day (day);
-- Refresh off-peak or incrementally
REFRESH MATERIALIZED VIEW CONCURRENTLY mv_sales_by_day;

Expectation: Well-targeted caches reduce DB read QPS 30–60% and improve TTFB on list/search endpoints by 100–300 ms. Conversion lifts of 1–3% are common on browse-heavy flows.

Partition and isolate hot data

One giant table ages like milk. Vacuum times grow, FK checks thrash, and every query drags history along for the ride.

  • Partition by time for event-style data; by tenant for multi-tenant apps.
  • Keep hot partitions in fast storage; archive cold partitions cheaply.
  • Move noisy neighbors (e.g., audits) into their own tablespace or database.
-- Postgres: range partitioning by month
CREATE TABLE events (
  id bigserial primary key,
  occurred_at timestamptz not null,
  tenant_id uuid not null,
  payload jsonb
) PARTITION BY RANGE (occurred_at);

CREATE TABLE events_2025_10 PARTITION OF events
FOR VALUES FROM ('2025-10-01') TO ('2025-11-01');

-- Default future partition to avoid runtime errors
CREATE TABLE events_default PARTITION OF events DEFAULT;
-- MySQL 8: HASH partition by tenant
ALTER TABLE customers PARTITION BY HASH (tenant_id) PARTITIONS 16;
  • Use partial indexes per partition for tighter stats and smaller btrees.

Expectation: Partitioning takes p95s from “randomly bad” to consistent and cuts vacuum/maintenance windows by 60–90%. We’ve seen log-heavy tables drop from 1.2 TB to 300 GB hot set after archiving, with 40% cost savings.

Scale reads without footguns

Read replicas look easy until a user complains their just-updated profile isn’t reflected. Be deliberate.

  • Route only read-safe endpoints to replicas; keep writes and read-after-write on primary.
  • Monitor replica lag and auto-fallback when lag exceeds threshold.
  • Make routing explicit in code or via SQL hints.
-- Postgres app: safe function wrapper
CREATE OR REPLACE FUNCTION read_consistent() RETURNS void AS $$
  SELECT pg_sleep(0); -- marker to signal replica-safe context
$$ LANGUAGE SQL;
// Hibernate read-only hint
entityManager.createQuery("...")
  .setHint("org.hibernate.readOnly", true)
  .unwrap(org.hibernate.query.Query.class)
  .setComment("ROUTE:REPLICA");
-- MySQL ProxySQL routing rule (simplified)
INSERT INTO mysql_query_rules (rule_id, match_pattern, destination_hostgroup)
VALUES (10, 'SELECT.*FROM orders', 20); -- HG20 = replicas
LOAD MYSQL QUERY RULES TO RUNTIME;
  • If you must serve read-after-write on replicas, use WAIT_FOR_REPLAY_LSN (Aurora) / pg_last_wal_replay_lsn() fences, but expect tail latency cost.

Expectation: Offloading 40–70% of reads to replicas frees primary IO for writes and flattens p99 under peak. Plan for ~10–30 ms added latency per replica hop in most clouds.

When to shard (and when not to)

If you can still fit hot data + indexes in RAM on a single node with replicas, you probably don’t need sharding. When you do:

  • Use battle-tested systems: Vitess (MySQL), Citus (Postgres), CockroachDB/Yugabyte (NewSQL).
  • Shard by tenant or user ID with a stable hash; avoid range shards that create hot spots.
  • Build cross-shard query discipline: no scatter-gather for synchronous user-facing paths.
# Vitess keyspace example (tenant-id based)
keyspaces:
  - name: app
    shards:
      - name: -80   # 0x00-0x80
      - name: 80-   # 0x80-0xff
    vschema:
      tables:
        orders:
          column_vindexes:
            - column: tenant_id
              name: hash

Expectation: Sharding changes your ops life. Done right, it unlocks linear write scaling; done wrong, it doubles your complexity. Exhaust simpler moves first: partitioning, better indexes, archiving, replicas.

What this looks like in a quarter

A realistic 6–10 week sequence we run at GitPlumbers:

  1. Week 1–2: Metrics and query census. Turn on pg_stat_statements, add traces, define SLOs. Quick wins on top 10 queries.
  2. Week 3–4: Index + query shaping + N+1 cleanup. Roll PgBouncer/ProxySQL. p95s drop 30–50%.
  3. Week 5–6: Add targeted caches and materialized views. Partition hot tables.
  4. Week 7–8: Introduce read replicas with lag-aware routing. Archive cold data.
  5. Week 9–10: Capacity test, chaos drills on failover, and cost optimization.

Outcomes you can take to the business:

  • p95 checkout: 1.2 s → 280 ms
  • Error rate: 1.8% → 0.4%
  • Infra cost: −22% (smaller DB instance; fewer app nodes)
  • Conversion: +2–4% on browse → add-to-cart

Related Resources

Key takeaways

  • Treat p95/p99 user latency as your north star; back into DB metrics from there.
  • Fix N+1s and mismatched indexes first—cheapest wins with biggest impact.
  • Pool connections and use prepared statements to slash tail latencies.
  • Cache with intent: per-key TTLs + CDC-driven invalidation for correctness.
  • Partition by time or tenant to avoid hot tables and long vacuums.
  • Scale reads via replicas with lag-aware routing; don’t serve stale data blindly.
  • Sharding is a last resort; try logical partitioning and vertical split first.

Implementation checklist

  • Define user-facing SLOs (e.g., p95 page < 300 ms, checkout < 500 ms).
  • Turn on `pg_stat_statements` or MySQL Performance Schema; sample top 20 queries.
  • Kill N+1s in ORM and add composite/covering indexes for dominant endpoints.
  • Set up PgBouncer/ProxySQL; right-size pools and enable prepared statements.
  • Add Redis cache with explicit TTLs and CDC-driven invalidation where needed.
  • Partition large tables by time or tenant; archive cold partitions.
  • Introduce read replicas with lag-aware routing; ensure idempotency.
  • Track before/after metrics and tie to revenue/retention changes.

Questions we hear from teams

How do I choose index order for composite indexes?
Match the most selective AND most frequently filtered columns first, then add columns used for sorting. For example, if 95% of queries filter by tenant and 80% by status and sort by created_at, use (tenant_id, status, created_at DESC). Use INCLUDE (Postgres) or covering indexes (MySQL) to add non-filter columns that are returned but not part of the predicate.
When should I shard?
Only after you’ve exhausted: correct indexing, query shaping, connection pooling, targeted caching, partitioning, read replicas, and archiving. Shard when write QPS or dataset size can’t be handled by a single primary with replicas even on the largest practical instance. Plan for 3–6 months of platform work, not a sprint.
Aren’t ORMs the problem?
ORMs aren’t evil; unconfigured ORMs are. Turn on lazy/eager loading consciously, add query logging, and codify query hints. Keep critical paths in hand-written SQL with `EXPLAIN (ANALYZE, BUFFERS)` in CI for guardrails. The combo works.
Postgres or MySQL—does it matter?
Both can scale into the tens of thousands of TPS with good schemas and ops. Postgres has richer indexing (GIN/GIST), CTEs, and JSONB; MySQL has rock-solid replication and great support via Aurora and Vitess. Choose based on your team’s operational comfort and ecosystem tools.
What’s the fastest way to prove value to the exec team?
Pick one high-traffic endpoint (search or list), ship a composite index + query rewrite, and show p95 → conversion delta in a weekly report. We routinely show 20–50% latency cuts translating to 1–3% conversion lifts within two weeks.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Schedule a database performance assessment See how we cut p95 by 60% for a marketplace

Related resources