The Expand/Contract Playbook: Shipping Schema Changes Without Waking PagerDuty

You don’t need a maintenance window. You need a plan. Here’s the step-by-step zero-downtime schema migration strategy we actually use in production.

Zero downtime isn’t magic—it’s choreography. Expand safely, backfill patiently, flip deliberately, and only then contract.
Back to all posts

Related Resources

Key takeaways

  • Zero-downtime schema changes are about orchestration, not heroics—use expand/contract with strict guardrails.
  • Always split risky operations into safe metadata changes first, then data backfill, then traffic cutover, then contract.
  • Measure success with SLO-aligned metrics: error rate, p95 write latency, replication lag, and lock wait time.
  • Automate with GitOps: migrations run as versioned, replayable jobs with clear rollback gates.
  • Use online DDL tools (Postgres: CONCURRENTLY, NOT VALID; MySQL: INPLACE/gh-ost) instead of blocking ALTERs.

Implementation checklist

  • Define the change as expand/contract; plan dual-write and read-compatibility.
  • Add schema in expand phase with non-blocking operations only.
  • Backfill in controlled batches with retry/idempotency.
  • Flip reads via feature flag or canary; monitor error rate and latency.
  • Remove dual-write; enforce constraints; drop old paths in contract phase.
  • Create and validate indexes concurrently/online.
  • Gate each step with metrics and rollback plans.
  • Automate via CI/CD and GitOps hooks; tag artifacts and migrations.

Questions we hear from teams

Can I skip dual-write and just backfill?
Only if no new writes will occur to the old shape during the backfill window. In practice, production systems keep writing. Dual-write guarantees consistency while backfill catches up.
Do I need gh-ost for every MySQL change?
No. Use native `ALGORITHM=INPLACE, LOCK=NONE` when it’s supported for your operation. Use `gh-ost` or `pt-online-schema-change` for heavy operations (e.g., changing column type, adding NOT NULL to a huge table) or when metadata locks are a risk.
How do I enforce NOT NULL on Postgres without downtime?
Add the column nullable, backfill, add a `CHECK (col IS NOT NULL) NOT VALID`, `VALIDATE CONSTRAINT` (non-blocking), then `ALTER COLUMN SET NOT NULL` which is effectively instant because the constraint proves the invariant.
What about multi-tenant or sharded systems?
Run the same playbook shard-by-shard or tenant-by-tenant with canaries. Stagger backfills to keep aggregate load stable. For Vitess or Citus, use their online DDL facilities and run canary per shard.
How do I measure success beyond “no alerts”?
Tie it to SLOs and business metrics: no added error budget burn, p95 write latency within +10%, no deadlocks, orders/min flat or improved, and a clean diff on parity sampling. Attach these graphs to your change ticket.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Want a dry-run with your schema? Talk to an engineer. Download the migration playbook checklist (PDF)

Related resources