The Expand/Contract Playbook: Shipping Schema Changes Without Waking PagerDuty
You don’t need a maintenance window. You need a plan. Here’s the step-by-step zero-downtime schema migration strategy we actually use in production.
Zero downtime isn’t magic—it’s choreography. Expand safely, backfill patiently, flip deliberately, and only then contract.Back to all posts
Key takeaways
- Zero-downtime schema changes are about orchestration, not heroics—use expand/contract with strict guardrails.
- Always split risky operations into safe metadata changes first, then data backfill, then traffic cutover, then contract.
- Measure success with SLO-aligned metrics: error rate, p95 write latency, replication lag, and lock wait time.
- Automate with GitOps: migrations run as versioned, replayable jobs with clear rollback gates.
- Use online DDL tools (Postgres: CONCURRENTLY, NOT VALID; MySQL: INPLACE/gh-ost) instead of blocking ALTERs.
Implementation checklist
- Define the change as expand/contract; plan dual-write and read-compatibility.
- Add schema in expand phase with non-blocking operations only.
- Backfill in controlled batches with retry/idempotency.
- Flip reads via feature flag or canary; monitor error rate and latency.
- Remove dual-write; enforce constraints; drop old paths in contract phase.
- Create and validate indexes concurrently/online.
- Gate each step with metrics and rollback plans.
- Automate via CI/CD and GitOps hooks; tag artifacts and migrations.
Questions we hear from teams
- Can I skip dual-write and just backfill?
- Only if no new writes will occur to the old shape during the backfill window. In practice, production systems keep writing. Dual-write guarantees consistency while backfill catches up.
- Do I need gh-ost for every MySQL change?
- No. Use native `ALGORITHM=INPLACE, LOCK=NONE` when it’s supported for your operation. Use `gh-ost` or `pt-online-schema-change` for heavy operations (e.g., changing column type, adding NOT NULL to a huge table) or when metadata locks are a risk.
- How do I enforce NOT NULL on Postgres without downtime?
- Add the column nullable, backfill, add a `CHECK (col IS NOT NULL) NOT VALID`, `VALIDATE CONSTRAINT` (non-blocking), then `ALTER COLUMN SET NOT NULL` which is effectively instant because the constraint proves the invariant.
- What about multi-tenant or sharded systems?
- Run the same playbook shard-by-shard or tenant-by-tenant with canaries. Stagger backfills to keep aggregate load stable. For Vitess or Citus, use their online DDL facilities and run canary per shard.
- How do I measure success beyond “no alerts”?
- Tie it to SLOs and business metrics: no added error budget burn, p95 write latency within +10%, no deadlocks, orders/min flat or improved, and a clean diff on parity sampling. Attach these graphs to your change ticket.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.