We Cut p95 Checkout Latency from 1.2s to 220ms by Fixing Three Queries—Here’s the Playbook
User growth isn’t what breaks your product—expensive queries do. This is the pragmatic database optimization plan that moves real product metrics without a rewrite.
“You don’t scale a query by adding nodes; you scale it by making it cheap.”Back to all posts
The day your database became your sales funnel bottleneck
I’ve watched a p95 checkout endpoint drift from 350ms to 1.2s over a single quarter—not because of “hypergrowth,” but because three queries went pathological as the user table crossed 50M rows. Marketing blamed iOS; product blamed a feature flag. It was the database. We cut p95 to 220ms in two weeks without a rewrite. This is the playbook we use at GitPlumbers when the DB is choking revenue.
You don’t scale a query by adding nodes; you scale it by making it cheap.
If you’ve been burned by “let’s shard” or “let’s move to serverless” pitches, you’ll appreciate this: we start with user-facing metrics, then fix the cheapest root causes first. No heroics, no magic. Just the work.
What to measure (and report to the business)
Optimizing databases without user metrics is vibe coding. Tie every change to something a PM and CFO care about:
- p95/p99 endpoint latency (TTFB): checkout, search, feed load
- Apdex/SLI vs SLO: e.g., p95 < 250ms for /checkout
- Error rate: timeouts, deadlocks, 5xx
- Throughput: requests/s during peak; DB QPS
- Conversion/retention deltas: 100ms on checkout often moves conversion 1–3%
- Infra cost/query: $/1M requests (before/after)
Quick wins for visibility:
- Add OpenTelemetry spans around ORM and raw SQL. Surface DB time in APM (Datadog/New Relic/Grafana Tempo).
- Create a Grafana dashboard with Prometheus queries:
# p95 latency per route (histogram)
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{route=~"/checkout|/search"}[5m])) by (le, route))- For Postgres, enable extensions:
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;- For MySQL 8:
UPDATE performance_schema.setup_consumers SET ENABLED='YES' WHERE NAME='events_statements_history';Find the pain before touching knobs
I’ve seen teams crank shared_buffers and flip to bigger instances before asking “which query?” Don’t. Start with evidence.
- Pull the top offenders:
- Postgres:
SELECT query, calls, total_time, mean_time FROM pg_stat_statements ORDER BY total_time DESC LIMIT 20; - MySQL:
pt-query-digeston the slow log;performance_schema.events_statements_summary_by_digest
- Postgres:
- Trace end-to-end: APM traces show N+1 patterns and missing indexes in real traffic.
- Run plans on ONE bad query with production-like stats:
EXPLAIN (ANALYZE, BUFFERS)
SELECT o.id, o.total
FROM orders o
JOIN users u ON u.id = o.user_id
WHERE u.email = $1 AND o.status IN ('paid','settled')
ORDER BY o.created_at DESC
LIMIT 20;Red flags you’ll see:
- Seq scans on multi-million tables with a selective filter (missing index)
- Nested loop over big sets (forgot a join index)
- High “Rows Removed by Filter” with
Bitmap Heap Scan(need covering index) - Massive
Shared Readbuffers (cache misses)
Fix the single worst query, measure, then iterate. Ten 10% wins beat one “perfect” migration that never ships.
Fix read paths first: indexes, N+1, and caches that won’t bite you
Most latency comes from reads. Start here.
- Composite and covering indexes: Index for your WHERE + ORDER BY + LIMIT patterns.
-- Postgres: concurrent index to avoid table lock
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_orders_user_status_created
ON orders (user_id, status, created_at DESC)
INCLUDE (total);- Partial indexes for hot slices:
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_orders_paid_recent
ON orders (user_id, created_at DESC)
WHERE status IN ('paid','settled') AND created_at > now() - interval '90 days';Kill N+1s at the ORM boundary. In Rails, preload; in Django,
select_related/prefetch_related; in TypeORM, join eager loads. Or write one explicit SQL with a properJOIN.Read-through cache with stampede protection:
// Node + ioredis example
import Redis from 'ioredis';
const redis = new Redis(process.env.REDIS_URL);
async function getUserProfile(userId: string) {
const key = `user:profile:${userId}`;
const cached = await redis.get(key);
if (cached) return JSON.parse(cached);
const lockKey = `${key}:lock`;
const gotLock = await redis.set(lockKey, '1', 'NX', 'EX', 5);
if (!gotLock) { // someone else is fetching
await new Promise(r => setTimeout(r, 50));
return getUserProfile(userId);
}
const row = await db.query(/* SQL to fetch profile */);
await redis.set(key, JSON.stringify(row), 'EX', 300); // 5 min TTL
await redis.del(lockKey);
return row;
}Rules of thumb:
- Cache only stable, high-hit data (profiles, product catalogs), not per-user secrets.
- Set sane TTLs and instrument cache hit rate; aim for 80%+ on target endpoints.
- Bust caches on writes via events; don’t rely on long TTLs alone.
Control connections and contention: pooling, timeouts, backpressure
I’ve watched Postgres keel over with 2k app connections because someone scaled pods without a pool. The fix is boring and effective.
- Use PgBouncer (transaction mode) or app pools:
# pgbouncer.ini
[databases]
appdb = host=10.0.1.5 port=5432 dbname=appdb
[pgbouncer]
listen_port = 6432
pool_mode = transaction
max_client_conn = 2000
default_pool_size = 50
reserve_pool_size = 20
query_timeout = 5000
server_reset_query = DISCARD ALLSet timeouts and circuit breakers so one bad query doesn’t domino:
- Postgres:
statement_timeout = '2s',idle_in_transaction_session_timeout = '5s' - App: HTTP timeouts, per-endpoint budgets,
retry: falseon non-idempotent ops
- Postgres:
Watch locks:
SELECT pid, wait_event_type, wait_event, query
FROM pg_stat_activity
WHERE wait_event_type = 'Lock';- Apply backpressure: queue requests when pools are saturated; shed load gracefully with
429instead of letting the DB thrash.
Result: fewer deadlocks, predictable tails, and the freedom to scale app instances without nuking the database.
Scale architecture without a rewrite: replicas, hot/cold split, partitioning
Once queries are cheap and connections are sane, scale out the reads and tame large tables.
Read replicas for safe GETs:
- Aurora or Postgres streaming replication works. Use lag-aware routing: only send idempotent queries when
replica_lag < 200ms. - Don’t route writes-by-accident through replicas; enforce
readOnlyat the connection level.
- Aurora or Postgres streaming replication works. Use lag-aware routing: only send idempotent queries when
Hot vs cold data:
- Keep 90 days of orders in the primary partition; archive older to cheaper storage or a warehouse (Snowflake/BigQuery) for analytics.
Partition by time or tenant before sharding:
-- Postgres native range partitioning
CREATE TABLE orders (
id BIGSERIAL PRIMARY KEY,
user_id BIGINT NOT NULL,
status TEXT NOT NULL,
total NUMERIC(10,2) NOT NULL,
created_at TIMESTAMPTZ NOT NULL
) PARTITION BY RANGE (created_at);
CREATE TABLE orders_2025_01 PARTITION OF orders
FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');Move heavy analytics off OLTP: materialize daily aggregates into a separate table/job; don’t let a CFO report run
COUNT(DISTINCT)on the primary at 9am.When to shard: only after partitions + replicas + caching still fail your SLOs. If you do, shard by user/tenant, not random hash; keep routing logic in a library, not scattered across services.
Prove it with numbers: a 30‑day, low-drama plan
Here’s what actually worked on that 1.2s checkout:
Week 1
- Instrument p95/p99 per endpoint, enable
pg_stat_statements, slow logs. - Identify top 10 queries by total time; capture plans in a doc.
- Add two composite indexes with
CONCURRENTLY; kill one N+1 (ORM preload).
Week 2
- Introduce PgBouncer, set
statement_timeout=2s; fix app pool sizes (Godb.SetMaxOpenConns(50)). - Add read-through Redis cache for product details (5 min TTL, stampede lock).
- Canary deploy; watch p95 and error budget burn in Grafana.
Week 3–4
- Partition
ordersby month; archive >12 months to S3 + Athena. - Add one read replica; route catalog/search GETs with lag guard.
- Clean up long transactions, add deadlock alarms; document runbooks.
Outcomes (real numbers from a mid-market retail app):
- p95 /checkout: 1.2s → 220ms
- Error rate: 1.8% → 0.3%
- Infra cost/1M requests: -$480 (smaller DB instance after fixes)
- Conversion uplift: +1.6% (A/B vs holdout)
- Engineering: on-call pages during peak -70%; MTTR -40%
No re-architecture, no weekend cutovers. Just disciplined, observable changes.
If AI-generated “vibe code” shipped questionable queries, the same playbook applies: we’ve had to do vibe code cleanup where GPT sprinkled SELECT * into hot paths. Lint it, explain it, index it, and put it behind budgets.
What I’d do differently (so you don’t learn the hard way)
- Bake
EXPLAIN (ANALYZE, BUFFERS)into code reviews for risky endpoints; require a plan screenshot in the PR. - Track query-level SLOs (“/search DB time p95 < 120ms”) alongside API SLOs.
- Put connection counts and lock wait time on every dashboard; they’re early smoke signals.
- Version and review indexes like code; roll out with
CONCURRENTLY, roll back withDROP INDEX CONCURRENTLY. - Keep a “top 20 queries by total time” doc alive; it becomes your backlog for cheap wins.
- Separate OLTP from analytics earlier than you think.
If you want a second set of eyes, GitPlumbers lives for this stuff—trenches, not theory. We fix AI-assisted and legacy code so your team ships faster and sleeps at night.
Key takeaways
- User-facing metrics first: measure p95/p99 endpoint latency, Apdex, and conversion before touching DB knobs.
- Target top 10 queries via `pg_stat_statements`/`performance_schema` and APM traces—don’t boil the ocean.
- Index for access paths you actually use: composite, covering, and partial indexes beat brute-force hardware.
- Fix N+1 at the ORM boundary and cache only what is stable—avoid stampedes with per-key locks.
- Keep connections under control with PgBouncer or app pools; set timeouts and backpressure to protect the DB.
- Scale reads with replicas and split hot/cold data; add partitioning before you consider sharding.
- Ship in safe increments: `CREATE INDEX CONCURRENTLY`, canary rollouts, and query-level feature flags.
Implementation checklist
- Enable slow query logging and `pg_stat_statements` (or MySQL `performance_schema`).
- Instrument p95/p99 per endpoint with OpenTelemetry + APM; correlate to DB spans.
- Run `EXPLAIN (ANALYZE, BUFFERS)` on the top 10 queries; capture plans in PRs.
- Add the missing composite/covering indexes; use `CONCURRENTLY` in Postgres.
- Eliminate N+1 queries; batch loads or use `JOIN`s; add read-through cache with stampede protection.
- Introduce PgBouncer or fix app pool sizing; enforce timeouts and circuit breakers.
- Add one read replica behind a feature flag; route safe GETs with lag-aware logic.
- Partition large tables by time or tenant; archive cold data; move analytics off OLTP.
Questions we hear from teams
- How do I know if I need sharding?
- If you haven’t added the right composite/covering indexes, controlled connections with a pooler, introduced a read replica, and partitioned large tables—sharding is premature. Most teams get 10–100x headroom with those steps. Shard only when you still miss SLOs and write throughput or dataset size exceed single-node limits.
- Aren’t too many indexes bad for writes?
- Yes—each index adds write overhead and vacuum work. Index only for the access paths you use. Prefer composite/covering indexes that serve multiple queries. Measure write amplification and use `pg_stat_user_indexes` to spot unused ones; drop with `DROP INDEX CONCURRENTLY`.
- Can caching hide bugs or stale data?
- It can. Cache stable resources (catalog, profiles) with explicit TTLs and busting on writes via events. Avoid caching per-user secrets or rapidly mutating rows. Add cache-hit metrics and alarms for stampedes; use per-key locks as shown.
- How do I avoid downtime when adding big indexes?
- Postgres: use `CREATE INDEX CONCURRENTLY` and throttle with `maintenance_work_mem`/`autovacuum` tuned. MySQL: use InnoDB online DDL or gh-ost/pt-online-schema-change. Canary the new index by pushing the query plan in staging first.
- Are ORMs the problem?
- ORMS aren’t evil; opaque queries are. Use them with discipline: explicit selects, preload/joins to avoid N+1, and log the generated SQL. For hot paths, write raw SQL with prepared statements and measured plans.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
