Zero-Downtime Schema Changes That Don’t Page You at 2 a.m.: The Expand–Contract Playbook That Actually Works

Stop praying before migrations. Here’s the proven, stepwise path to evolve schemas in production without brownouts, lock storms, or 3 a.m. rollbacks.

If your migration plan depends on everyone stopping traffic for five minutes, you don’t have a plan—you have a wish.
Back to all posts

The reality of zero-downtime schema changes

I’ve lost count of how many “five-minute maintenances” turned into paging marathons. The pattern is always the same: a straightforward ALTER TABLE on a hot path, unexpected table locks, replication falling behind, API timeouts, and a Slack channel that won’t stop blinking. Zero-downtime isn’t magic—it’s a disciplined sequence of safe, observable steps. If your plan depends on everyone stopping traffic “just for a sec,” you don’t have a plan.

This is the playbook we’ve used at GitPlumbers cleaning up after dangerous migrations in Postgres and MySQL at fintechs and marketplaces with 24/7 SLOs. It’s opinionated, tool-backed, and gives you concrete checkpoints that keep you off the pager.

The expand–contract playbook (safe by default)

The core idea: ship forward-compatible changes first (expand), migrate data and behavior under flags, then remove old paths (contract) once the world is stable. Do not combine behavior changes with schema enforcement in one step.

  • Expand: add nullable columns, create indexes concurrently/online, add constraints as NOT VALID (Postgres), introduce views or dual-write paths.
  • Migrate: backfill in chunks with strict rate limits, enable dual reads, monitor SLOs and replication.
  • Contract: validate constraints, flip NOT NULL, remove old columns/paths after a cool-off period.

If a single ALTER can take you down, you’ve already chosen the wrong step size.

Step-by-step with checkpoints and rollbacks

Here’s the sequence I’d sign my name to for a high-traffic users table in Postgres and a similar path for MySQL.

  1. Preflight analysis

    • Measure:
      • Size: SELECT pg_total_relation_size('users');
      • Hot path latency: p95 for endpoints touching users
      • Write QPS and row churn
      • Replication lag baseline (goal: < 200ms steady, < 1s peak)
    • Safety scan:
      • Postgres: can we use CONCURRENTLY for indexes? Will DEFAULT rewrite the table? (Avoid pre-11 defaults on large tables.)
      • MySQL: can ALGORITHM=INSTANT/INPLACE be used? If not, plan gh-ost/pt-online-schema-change.
  2. Expand schema safely

    • Postgres: add column nullable and index concurrently.
-- Never set NOT NULL or a heavy DEFAULT on day one
SET lock_timeout = '1s';
SET statement_timeout = '5min';
ALTER TABLE users ADD COLUMN billing_country text; -- nullable
CREATE INDEX CONCURRENTLY idx_users_lower_email ON users ((lower(email)));

-- Add constraint without blocking table
ALTER TABLE users
  ADD CONSTRAINT users_email_format_chk
  CHECK (email ~* '^[^@]+@[^@]+\.[^@]+$') NOT VALID;
  • MySQL 8+ (prefer instant/online):
ALTER TABLE users ADD COLUMN billing_country VARCHAR(2) NULL, ALGORITHM=INSTANT;
-- If not supported, try INPLACE with LOCK=NONE (or fall back to gh-ost)
ALTER TABLE users ADD INDEX idx_users_lower_email ((lower(email))), ALGORITHM=INPLACE, LOCK=NONE;
  • If online DDL isn’t guaranteed, use gh-ost:
gh-ost \
  --host=primary.db \
  --user=migrator --password=$MYSQL_PWD \
  --database=app --table=users \
  --alter="ADD COLUMN billing_country VARCHAR(2) NULL" \
  --allow-on-master --switch-to-rbr --cut-over=default \
  --exact-rowcount --chunk-size=1000 \
  --max-load=Threads_running=80 --critical-load=Threads_running=120 \
  --cut-over-lock-timeout-seconds=3

Checkpoint A: No error-rate increase, p95 delta < 5%, replication lag < 1s for 30 min.

  1. Ship code with dual-read/write (behind flags)
  • Use a feature flag (LaunchDarkly, Unleash, or OpenFeature) to control read path. Initially write to both; read old path.
// pseudo-code
type User = { billingCountry?: string; billing_country?: string };

const useNewBilling = flags.get("use_new_billing_country");

function readBillingCountry(u: User) {
  return useNewBilling ? u.billingCountry ?? u.billing_country : u.billing_country ?? u.billingCountry;
}

async function saveUser(u: User) {
  // dual write
  await db.query("UPDATE users SET billing_country=$1 WHERE id=$2", [u.billingCountry ?? u.billing_country, u.id]);
}
  • If you can’t dual-write in app code, a temporary trigger works—but keep it obvious and temporary:
CREATE OR REPLACE FUNCTION sync_billing_country() RETURNS trigger AS $$
BEGIN
  IF NEW.billing_country IS NULL AND NEW.billing_country_new IS NOT NULL THEN
    NEW.billing_country := NEW.billing_country_new;
  END IF;
  RETURN NEW;
END; $$ LANGUAGE plpgsql;

CREATE TRIGGER users_sync_billing_country
BEFORE INSERT OR UPDATE ON users
FOR EACH ROW EXECUTE FUNCTION sync_billing_country();

Checkpoint B: Deploy app with flag off, ensure no increased UPDATE contention, no deadlock spikes.

  1. Backfill in chunks with guardrails
  • Treat backfill like a batch job with rate limiting. No giant transactions.
# Example: chunked backfill with psql
export PGOPTIONS='-c lock_timeout=1000 -c statement_timeout=60000'
for start in $(seq 1 1000000 100000000); do
  end=$((start+999999))
  psql "$PGURL" -v start=$start -v end=$end <<'SQL'
  WITH c AS (
    SELECT id FROM users WHERE id BETWEEN :start AND :end AND billing_country IS NULL
    ORDER BY id LIMIT 5000
  )
  UPDATE users u
     SET billing_country = 'US'
    FROM c
   WHERE u.id = c.id;
SQL
  sleep 0.5 # give replicas a breather
  # Break if replication lag is high
  LAG=$(psql "$PGURL" -t -c "select extract(epoch from now()-pg_last_xact_replay_timestamp())" | xargs)
  awk "BEGIN{exit !($LAG>1.0)}" && echo "Lag $LAGs >1s; pausing" && break
done
  • MySQL alternative with pt-online-schema-change --execute supports triggers for backfill; prefer gh-ost for fewer surprises in RBR setups.

Checkpoint C: Backfill progresses with < 1% error budget consumption, p95 writes < +10%, replication lag < 2s sustained.

  1. Flip reads and validate constraints
  • Gradually route reads to new path under the flag. Keep dual-write a while longer.
  • Postgres: validate the constraint online, then enforce NOT NULL after you’re sure.
ALTER TABLE users VALIDATE CONSTRAINT users_email_format_chk; -- non-blocking

-- Only after backfill success and steady-state window
ALTER TABLE users ALTER COLUMN billing_country SET NOT NULL;
  • Foreign keys: create as NOT VALID, then validate.
ALTER TABLE orders
  ADD CONSTRAINT orders_user_fk FOREIGN KEY (user_id)
  REFERENCES users(id) NOT VALID;
ALTER TABLE orders VALIDATE CONSTRAINT orders_user_fk;

Checkpoint D: 24–72 hours of normal traffic, no error-rate regression, zero new nulls detected in billing_country.

  1. Contract safely
  • Remove old code paths. Drop old columns last. Keep artifacts for at least one deploy cycle.
-- after cool-off
ALTER TABLE users DROP COLUMN billing_country_old;

Rollback plan at every step

  • Expand failure: changes are additive; roll forward or leave in place.
  • Backfill issues: stop the job; revert read flag; data remains consistent.
  • Read flip regression: flip the flag off; dual writes preserved state.
  • Contract surprise: revert code; column still exists if you followed the cool-off practice.

Tooling that actually helps (and what to avoid)

What works:

  • Postgres
    • CREATE INDEX CONCURRENTLY and ALTER INDEX ... RENAME instead of blocking variants.
    • NOT VALID constraints + VALIDATE CONSTRAINT for FKs and CHECKs.
    • pg_stat_activity, pg_stat_progress_create_index, pg_replication_lag for visibility.
  • MySQL
    • MySQL 8 ALGORITHM=INSTANT for column adds; INPLACE, LOCK=NONE otherwise.
    • gh-ost for hot tables; it respects load, has controllable cutover.
  • Migration frameworks
    • Flyway or Liquibase with explicit --outOfOrder disabled for prod, and separate “expand”/“contract” migration directories.
  • Release control
    • ArgoCD/GitOps pipelines to gate migrations behind checks.
    • Feature flags (LaunchDarkly, Unleash, OpenFeature).

What to avoid:

  • ORMs that auto-migrate in prod (rails db:migrate with schema changes that do table rewrites). Use safe wrappers or reviewed SQL.
  • Adding NOT NULL DEFAULT 'X' on pre-Postgres 11 tables—rewrites the whole table.
  • Big-bang renames that break drivers/ORMs; prefer views or aliases during the transition.

Example Argo Rollouts guard (Prometheus-backed):

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: db-migration-guard
spec:
  metrics:
  - name: api-latency-p95
    interval: 1m
    count: 5
    successCondition: result < 0.3
    provider:
      prometheus:
        address: http://prometheus:9090
        query: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job="api"}[1m])) by (le))
  - name: pg-replication-lag
    interval: 30s
    count: 20
    successCondition: result < 1
    provider:
      prometheus:
        address: http://prometheus:9090
        query: max(pg_replication_lag_seconds)

Observability and gates: prove it’s safe

Track these in real time during each step:

  • Error rate: maintain SLO (e.g., < 0.1% 5xx/5m). Gate flips on error deltas.
  • Latency: p95 for endpoints touching the table; target < +5% during expand/backfill.
  • Replication lag: ceiling 1s; pause backfill above threshold.
  • Lock waits: pg_locks and MySQL INNODB_LOCK_WAITS—look for anomalies.
  • WAL/redo growth: watch disk headroom; long backfills build WAL.
  • Bloat and dead tuples: pg_stat_all_tables.n_dead_tup and vacuum cadence.

Quick Postgres check script in CI before “expand”:

psql "$PGURL" <<'SQL'
SELECT relname, n_dead_tup FROM pg_stat_all_tables WHERE schemaname='public' AND n_dead_tup > 100000;
SELECT pid, query, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event IS NOT NULL;
SQL

Case notes: evolving a payments ledger in production

At a payments company, we needed to introduce settlement_currency to a 400M-row ledger_entries table (Postgres 12), enforce non-null, and index for reconciliation jobs. Hard downtime was not an option.

  • Expand: added nullable settlement_currency, created CREATE INDEX CONCURRENTLY on (account_id, settlement_currency). Shipped code with dual-write via a LaunchDarkly flag.
  • Backfill: chunked 10k rows per batch with 200ms sleeps; paused automatically if replication lag > 1.5s. Ran during business hours to match traffic patterns.
  • Validate + enforce: CHECK on ISO code pattern NOT VALID then VALIDATE. After 48 hours steady, ALTER COLUMN ... SET NOT NULL.
  • Contract: removed fallback reads after one deploy cycle; dropped old derived field after two weeks.

Results (real numbers):

  • No incidents; p95 for write-heavy endpoints increased by ~2.3% during backfill windows.
  • Replication lag peaked at 0.9s; average 0.2s. Backfill total wall time: 36 hours.
  • Reconciliation job improved by 18% after the new composite index.

Common traps and how to escape them

  • Column renames: many ORMs bake column names in cached statements. Use additive columns + alias views, ship code, then drop the old column later.
  • Defaults that rewrite: pre-Postgres 11 ALTER TABLE ... ADD COLUMN ... DEFAULT 'X' rewrites the table. Add without default, backfill, then SET DEFAULT.
  • Foreign keys on huge tables: adding FK can block. Use NOT VALID + VALIDATE CONSTRAINT.
  • Big transactions: backfills that run in one giant UPDATE spam WAL and stall replicas. Chunk and sleep.
  • MySQL DDL surprises: not all INPLACE are lock-free; check the execution plan and watch SHOW WARNINGS;. When in doubt, gh-ost.
  • Index builds: in Postgres, always CONCURRENTLY on hot tables. Monitor pg_stat_progress_create_index.
  • Timeouts: set lock_timeout short so you fail fast rather than pile up blocked sessions.

If this still feels risky, you’re not wrong. Most teams get into trouble by skipping observability and flags. We’ve been pulled in after these go sideways; the fix is almost always the same: smaller steps, better gates, and the discipline to leave rollback paths in place.

Related Resources

Key takeaways

  • Zero-downtime is a process, not a flag: use expand–contract with feature-gated reads/writes.
  • Never run blocking DDL blindly—use Postgres CONCURRENTLY, NOT VALID, and MySQL online DDL/gh-ost.
  • Backfills must be chunked and observable; treat them like long-running batch jobs, not migrations.
  • Gate every step with metrics: error rate, p95 latency, replication lag, lock wait time.
  • Have explicit rollback switches: feature flags to revert reads, and leave forward-compatible artifacts in place.

Implementation checklist

  • Preflight: size the tables, hot paths, QPS, and replication topology; simulate on prod-like data.
  • Plan expand–contract steps with idempotent migrations and feature flags for dual-read/write if needed.
  • Add columns/indexes with non-blocking DDL (Postgres CONCURRENTLY; MySQL gh-ost/INSTANT/INPLACE).
  • Run chunked backfills with rate limits and SLO guards; stop on lag or error thresholds.
  • Flip reads under a flag; observe for a steady-state window before contracting.
  • Validate constraints online (Postgres NOT VALID + VALIDATE); then enforce NOT NULL.
  • Remove dead paths only after a cool-off period and postmortem metrics review.

Questions we hear from teams

Is true zero-downtime possible for every schema change?
No. Some changes (e.g., shrinking VARCHAR length, changing column type incompatibly) require shims or multi-release choreography. The goal is zero user-visible downtime, not zero CPU pause. Use additive steps, views, and dual writes to emulate the new shape, then migrate gradually.
How do I handle column type changes safely?
Add a new column with the target type, backfill with `USING` expression logic in app code or batch job, dual-read under a flag, then switch and drop the old column. In Postgres, avoid `ALTER TYPE ... USING` on hot tables unless you’ve proven it’s lightweight.
What about multi-tenant or sharded databases?
Run the same expand–migrate–contract steps per shard with staggered rollouts. Canary 1–5% of tenants first, watch cross-shard metrics, and automate halt-on-lag. Keep playbooks shard-aware (connection strings, job distribution, and id ranges).
Do I need gh-ost or pt-online-schema-change on MySQL 8?
Often no—`ALGORITHM=INSTANT` and `INPLACE, LOCK=NONE` cover many cases. But for very large/hot tables or when server settings/DDL paths are unclear, gh-ost still gives safer, observable migrations with controlled cutover.
How long should I keep dual writes after flipping reads?
At least one full deploy cycle plus your longest data pipeline SLA (often 24–72 hours). Keep until you’ve validated metrics, backfills are done, and downstream consumers aren’t broken.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to an engineer about your next migration Download the zero-downtime migration runbook

Related resources