How do I know if I need partitioning or full-blown sharding?

If your single-node primary still has CPU/IOPS headroom after indexing and caching, and most slow queries are long table scans, start with partitioning and archiving. Move to Citus/Vitess when you hit physical limits (CPU > 80% sustained, IOPS saturated) or multi-tenant isolation becomes a requirement.

Will read replicas hurt consistency for critical flows?

Only route flows that tolerate staleness; keep write-after-read paths on primary or use session-level consistency where available. Monitor replica lag and fail back to primary if it exceeds your tolerance.

No. Start with indexes and query fixes. Redis is a force multiplier for read-heavy endpoints and computed aggregates, but bad invalidation can cause correctness issues. Keep TTLs short and use versioned keys.

How do I roll back a bad index?

Create indexes concurrently and gate usage via flags. If the index hurts, flip the flag off and drop the index concurrently during low traffic. Always keep a rollback script in your migration tool.

What about tuning Postgres parameters?

It’s usually the last 10%: `work_mem` for sorts/joins, `effective_cache_size` to match instance memory, realistic `random_page_cost` for SSDs. But don’t expect miracles—fixing queries and concurrency gives bigger wins.

Performance-optimization · Oct 2, 2025 · 10 minute read

The Database Tune-Up That Cut p95 Latency in Half Without Rewriting a Line of App Code

User growth shouldn’t mean slower checkouts, angry PMs, and ballooning RDS bills. Here’s what actually scales when your database becomes the bottleneck.

Back to all posts

The Database Tune-Up That Cut p95 Latency in Half Without Rewriting a Line of App Code

Key takeaways

Implementation checklist