The Database Tune-Up That Cut p95 Latency in Half Without Rewriting a Line of App Code

User growth shouldn’t mean slower checkouts, angry PMs, and ballooning RDS bills. Here’s what actually scales when your database becomes the bottleneck.

You don’t need a rewrite. You need to stop asking your database to do dumb things at scale.
Back to all posts

The moment growth outpaces your database

We walked into a team where user growth was a champagne problem—daily actives up 3x after a marketing push. The cork: p95 API latency jumped from 280ms to 1.4s, checkout errors spiked to 3.2%, and AWS RDS Aurora spend was up 42% month over month. The app code hadn’t changed much; the database path did. We’ve seen this movie across Shopify-like storefronts, B2B SaaS with analytic dashboards, and fintech ledgers. The plot twist is always the same: you don’t need a total rewrite—you need to fix how your app asks the database to work.

Goal: Shrink user-visible latency and error rates while controlling cost—as in real numbers you can defend in a QBR—without taking a six-month platform detour.

Measure what the user feels, not just what the DB reports

Before touching a single index, wire the metrics that leadership cares about.

  • User-facing SLOs:
    • p95 API latency for login, search, and checkout under 300ms; p99 under 800ms.
    • Error rate under 0.5% (HTTP 5xx + DB timeouts).
    • For revenue flows, track conversion vs. latency; we’ve seen 300ms gains yield 1–3% conversion lift.
  • Correlate app and DB:
    • Trace IDs in X-Request-ID propagated to your DB client. Tie Jaeger/Tempo/X-Ray spans to query fingerprints.
    • Enable pg_stat_statements (Postgres) or the MySQL slow query log with long_query_time=0.2. Aggregate top N by total time, not count.
  • Baseline dashboards (Prometheus + Grafana or Datadog):
    • Per-endpoint p95/p99 latency, error rate, saturation (in-flight requests).
    • DB: connections used, active vs. idle, lock wait time, buffer cache hit ratio, replica lag.
  • Budget: Agree on “good enough” targets with PMs. This stops you from gold-plating a non-bottleneck.

If you can’t tie a query to a slow user flow with a trace, you’re guessing. Guessing is how you spend three weeks tuning shared_buffers for a 1% gain while checkout still times out.

Fix the query path first: index, simplify, delete work

Nine out of ten wins come from making the database do less work per request.

  1. Identify the top offenders
    • Postgres:
      CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
      SELECT queryid, calls, total_exec_time, mean_exec_time, query
      FROM pg_stat_statements
      ORDER BY total_exec_time DESC
      LIMIT 10;
    • MySQL:
      SET GLOBAL slow_query_log = 1; SET GLOBAL long_query_time = 0.2;
      SELECT * FROM mysql.slow_log ORDER BY query_time DESC LIMIT 10;
  2. Prove with EXPLAIN
    • Postgres:
      EXPLAIN (ANALYZE, BUFFERS)
      SELECT ... WHERE user_id = $1 AND created_at > now() - interval '7 days';
    • Look for sequential scans on hot paths, misestimates, and row rechecks.
  3. Add the right index (covering > composite > partial)
    • Covering example (Postgres):
      CREATE INDEX CONCURRENTLY idx_orders_user_created
        ON orders (user_id, created_at DESC) INCLUDE (status, total_amount);
    • Partial index to avoid bloat:
      CREATE INDEX CONCURRENTLY idx_orders_open
        ON orders (merchant_id, status)
        WHERE status IN ('open','awaiting_payment');
    • For time-series scans, consider BRIN on very large append-only tables:
      CREATE INDEX CONCURRENTLY idx_events_brin ON events USING brin(created_at);
  4. Simplify queries and N+1s
    • Replace ORM-generated monsters with explicit SQL; use SELECT columns, not SELECT *.
    • Batch lookups: collapse 100 N+1 calls into one WHERE id = ANY($1).
    • Move non-critical joins to async pipelines (outbox/table + worker) when they aren’t needed inline.
  5. Delete work with precomputation
    • Materialized views for expensive aggregations; refresh on schedule or event-triggered.
    • Denormalize cautiously for read-heavy paths; guard with data contracts and tests.

What this looks like in numbers:

  • A B2B dashboard cut its p95 from 2.1s to 540ms by adding two covering indexes and moving a COUNT DISTINCT to a materialized view refreshed every 5 minutes.
  • A consumer app dropped checkout timeouts from 2.7% to 0.4% by killing N+1s and adding INCLUDE columns to match response fields.

Control concurrency: pool or die

The easiest way to take down a database is to “scale” your app tier and unleash 4,000 connections. Don’t. Control the firehose.

  • Use a real pool:
    • Postgres: pgBouncer in transaction mode for web traffic; consider SESSION mode for long-running transactions.
    • MySQL: ProxySQL to cap per-user concurrency and route read traffic.
  • Cap it at the edge:
    • Set max_connections on Postgres sanely (often < 500). Let the pool queue; it’s cheaper than backpressure at the DB.
    • Configure app-level concurrency budgets (gunicorn workers, puma threads, node cluster) to match pool size.
  • Example pgbouncer.ini:
    [databases]
    app = host=db-primary port=5432 dbname=app
    
    [pgbouncer]
    pool_mode = transaction
    max_client_conn = 4000
    default_pool_size = 80
    min_pool_size = 20
    reserve_pool_size = 20
    server_reset_query = DISCARD ALL
    server_lifetime = 1800
    server_idle_timeout = 60
  • Protect the primary:
    • Set statement timeouts: SET LOCAL statement_timeout = '300ms' for hot endpoints.
    • Kill zombies: idle_in_transaction_session_timeout = '15s'.
  • Lambda/serverless gotchas:
    • Use RDS Proxy or Cloud SQL Auth Proxy; cold starts can DDoS your DB with connections otherwise.

What this buys you:

  • A fintech client saw connection storms drop 95% and p95 latency improve 40% after moving to pgBouncer with a strict concurrency budget. Infra cost fell 18% because they stopped over-provisioning CPU just to handle spikes.

Scale reads with replicas and cache the obvious

Get reads off your primary so writes stay fast and predictable.

  • Read replicas done right:
    • Route GET endpoints that tolerate slight staleness to replicas; tag via readOnly=true in the data access layer.
    • Watch lag. If replica_lag > 1s, automatically fall back to primary for latency-sensitive reads.
    • For Aurora, use aurora_replica_read_consistency = session when needed.
  • Add a cache you can trust:
    • Redis with short TTLs (30–120s) for list pages and product catalogs.
    • Use key-versioning on invalidation to avoid thundering herds; prewarm on deploys.
    • Example pattern:
      key = f"user:{id}:orders:v{schema_version}"
      ttl = 60
  • Precompute metrics:
    • Materialized views (REFRESH MATERIALIZED VIEW CONCURRENTLY) or summary tables updated via CDC/outbox processed by a worker (Debezium, Kafka, Airflow).
  • CQRS-lite:
    • Keep writes on primary; serve queries from a read model built asynchronously. This is often a simple table, not Kafka-and-microservices-the-movie.

Result patterns we routinely see:

  • 25–60% reduction in primary CPU after routing 40–70% of reads to replicas.
  • p99 improved from 1.1s to 320ms on search endpoints with Redis + partial index.

Partition before you shard (and archive aggressively)

Massive tables eventually hurt: index bloat, vacuum pain, slow scans. Partitioning keeps hot data hot and cold data out of the way.

  • Time-based partitioning (Postgres):
    • Native declarative partitions on created_at by month with a retention policy.
    • Example:
      CREATE TABLE events (
        id bigserial primary key,
        created_at timestamptz not null,
        account_id bigint not null,
        payload jsonb not null
      ) PARTITION BY RANGE (created_at);
      
      CREATE TABLE events_2025_10 PARTITION OF events
        FOR VALUES FROM ('2025-10-01') TO ('2025-11-01');
    • Use pg_partman to auto-create/drop partitions.
  • Hash or list partition on hot keys (tenant, account) if time doesn’t cluster well.
  • Archive cold data:
    • Move >180-day partitions to cheaper storage (S3 via UNLOAD, Parquet) and query with Trino/Athena for analytics.
    • Keep only the last N partitions indexed for OLTP.
  • When to introduce horizontal scale:
    • Postgres: Citus for distributed tables when single-node CPU or IOPS is at the wall and you’ve already optimized queries.
    • MySQL: Vitess for online sharding and connection pooling at scale (YouTube did this for a reason).
    • Plan the key space and routing early; implement a dual-write + backfill to migrate safely.

What this avoids:

  • A marketplace with a 1.2B-row events table shaved 70% off index maintenance and cut p95 by 35% after moving to monthly partitions and archiving 12+ month data to S3.

Make changes safely: migrations, canaries, and rollbacks

Schema changes and indexes can be production weapons—use a safety harness.

  • Online indexes:
    • Postgres CREATE INDEX CONCURRENTLY; MySQL ALGORITHM=INPLACE, LOCK=NONE (version-dependent).
  • Gate with feature flags:
    • Ship the index first; flip the ORM to use it only for canary traffic (e.g., 5%). Watch p95 and DB CPU.
  • Canary and dual-read:
    • For read-model refactors, dual-read old and new paths behind a flag; compare results for a subset of traffic.
  • Migrations with rollback scripts:
    • Use Sqitch, Flyway, or Liquibase. Always write a down script, even if it’s partial (e.g., drop index concurrently later).
  • Observability as a deployment gate:
    • Block rollout if replica lag > N seconds, lock wait spikes, or error budget burn > 2%/hour.

We’ve watched more outages from indexing without CONCURRENTLY than from actual traffic spikes. Slow is smooth, smooth is fast.

What you can expect in 2–4 weeks

If you follow this sequence, here’s the boringly repeatable impact we see at clients after a focused engagement:

  • p95 latency on top 3 endpoints down 35–65%.
  • Error rate from DB timeouts/cancelled queries down 60–90%.
  • Primary CPU down 25–50%; infra cost down 15–30% by right-sizing.
  • MTTR down because traces point to specific queries, not vibes.
  • Business wins: 0.5–2.5% conversion lift on checkout/search; fewer abandoned sessions; happier finance team.

Execution plan we use at GitPlumbers (steal this):

  1. Instrument: turn on pg_stat_statements/slow log, wire traces, set clear SLOs.
  2. Triage top 10 queries by total time; run EXPLAIN (ANALYZE, BUFFERS) on each.
  3. Index and simplify; kill N+1; batch and denormalize where safe.
  4. Introduce pooling; enforce concurrency budgets.
  5. Route reads to replicas; add Redis for obvious wins; precompute heavy aggregates.
  6. Partition large tables; archive cold data; plan Citus/Vitess only if necessary.
  7. Automate safety: online migrations, canaries, and easy rollbacks.
  8. Prove it: before/after dashboards tied to dollars.

The quiet signals you’re winning

  • On-call pages shift from “primary CPU 95%” to “one bad query slipped to prod.”
  • Releases stop being roulette.
  • The CFO asks why your RDS bill went down while DAUs went up. You show a chart instead of a novel.

Related Resources

Key takeaways

  • Tie every optimization to user-facing metrics (p95 latency, error rate, conversion).
  • Start with the query path: measure, index, and simplify before touching infra.
  • Control concurrency with pooling—most outages are self-inflicted connection storms.
  • Use replicas and caches for reads; keep writes hot and simple.
  • Partition before you shard; make data movement boring and reversible.
  • Automate rollouts and rollback paths for schema changes and index builds.
  • Prove the business impact: faster flows, lower cloud spend, higher conversion.

Implementation checklist

  • Define SLOs for p95/p99 latency on key flows (login, search, checkout).
  • Enable `pg_stat_statements` or MySQL slow query log; baseline top offenders.
  • Use `EXPLAIN (ANALYZE, BUFFERS)` to confirm index coverage and row estimates.
  • Add covering/partial indexes; remove unused ones; verify with canary traffic.
  • Introduce `pgBouncer`/`ProxySQL` with transaction pooling; cap max concurrency.
  • Route read-heavy endpoints to replicas; add cache with strict TTLs and invalidation.
  • Partition large tables (time or hash); archive cold data; consider Citus/Vitess when needed.
  • Automate migrations with `Sqitch`/`Flyway`/`Liquibase`; add rollback scripts.
  • Track cost and performance deltas in the same dashboard; ship a post-incident doc.

Questions we hear from teams

How do I know if I need partitioning or full-blown sharding?
If your single-node primary still has CPU/IOPS headroom after indexing and caching, and most slow queries are long table scans, start with partitioning and archiving. Move to Citus/Vitess when you hit physical limits (CPU > 80% sustained, IOPS saturated) or multi-tenant isolation becomes a requirement.
Will read replicas hurt consistency for critical flows?
Only route flows that tolerate staleness; keep write-after-read paths on primary or use session-level consistency where available. Monitor replica lag and fail back to primary if it exceeds your tolerance.
Is Redis mandatory?
No. Start with indexes and query fixes. Redis is a force multiplier for read-heavy endpoints and computed aggregates, but bad invalidation can cause correctness issues. Keep TTLs short and use versioned keys.
How do I roll back a bad index?
Create indexes concurrently and gate usage via flags. If the index hurts, flip the flag off and drop the index concurrently during low traffic. Always keep a rollback script in your migration tool.
What about tuning Postgres parameters?
It’s usually the last 10%: `work_mem` for sorts/joins, `effective_cache_size` to match instance memory, realistic `random_page_cost` for SSDs. But don’t expect miracles—fixing queries and concurrency gives bigger wins.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to a GitPlumbers architect See our performance case studies

Related resources