The Migration That Didn’t Wake PagerDuty: A Real Zero‑Downtime Schema Strategy
I’ve watched teams brick prod with a “simple” ALTER TABLE. Here’s the expand/contract playbook—with metrics, commands, and guardrails—that actually ships without waking the on-call.
“Stop shipping DDL like it’s 2008. Schema changes deserve the same rollout discipline as your app.”Back to all posts
The outage you can avoid
I've watched more outages start with a confident ALTER TABLE than with bad deploys. At a unicorn fintech in 2019, a NOT NULL with a default on a 200M-row Postgres table rewrote the whole table, pinned IO, and sent p99 from 35ms to “nobody knows.” Replication fell behind by hours. It took a weekend to unstick. That team now runs zero-downtime schema changes as a staged rollout, just like code.
This guide is the playbook we use at GitPlumbers when a client says “We need to evolve the schema but prod can’t blink.” It’s opinionated, it’s boring, and it works.
Principles that keep you off the pager
Expand/contract over “big bang.” You add safe structures first (expand), migrate traffic/data, then remove old ones (contract).
Forward/backward compatibility across at least two app versions. Old code must run on expanded schema; new code must tolerate old data.
Gate with metrics, not vibes: error rate, p95/p99, queue depth, replication lag, and lock wait time.
Online changes only:
CONCURRENTLYin Postgres;gh-ostorpt-online-schema-changein MySQL;pg_repackfor rewrites.Roll back fast: feature flags for reads/writes, reversible steps, and a “stop-the-backfill” kill switch.
Step-by-step: the expand/contract playbook
- Classify the change
Expand-safe: add nullable column, new table, new index (online).
Risky: adding
NOT NULLwith default, renames, type changes, widening PKs, foreign keys on hot tables.Contract: dropping old column/index/constraint.
Rule: If it rewrites a large table or blocks writes, assume it’s risky and plan online tooling + canary.
- Instrument your gates
SLOs:
HTTP 5xx < 0.1%,p95 < SLO + 20%,replication lag < 2s,lock wait < 500ms.Postgres:
pg_stat_activity,pg_locks,pg_stat_replication,pg_stat_progress_create_index.MySQL:
performance_schema.events_statements,information_schema.innodb_trx, replicaSeconds_Behind_Master.Export to Prometheus; alert in Slack. Example PromQL:
rate(http_server_errors_total{job="api"}[5m]) > 0.001- Expand the schema safely
- Postgres example: add column and index without blocking writes:
-- Avoid table rewrite: no default; nullable
ALTER TABLE customer ADD COLUMN region_id bigint;
-- Build index without locking writes
CREATE INDEX CONCURRENTLY idx_customer_region_id ON customer(region_id);- MySQL example with
gh-ost:
gh-ost \
--host=primary.db.internal \
--database=app \
--table=customer \
--alter="ADD COLUMN region_id BIGINT NULL, ADD INDEX idx_customer_region_id(region_id)" \
--allow-on-master \
--cut-over=default \
--max-load=Threads_running=50 \
--critical-load=Threads_running=200 \
--exact-rowcount \
--chunk-size=1000 \
--execute- Gate to proceed: p95 write latency ≤ SLO+20%, error rate steady, replication lag < 2s for 15 minutes.
- Backfill without melting disks
- Throttle and batch; avoid long transactions. Example Postgres backfill in chunks:
-- Create a job table to track progress
CREATE TABLE IF NOT EXISTS _backfill_customer_region_progress (last_id bigint);
INSERT INTO _backfill_customer_region_progress(last_id) VALUES (0) ON CONFLICT DO NOTHING;
DO $$
DECLARE cur bigint;
BEGIN
SELECT last_id INTO cur FROM _backfill_customer_region_progress FOR UPDATE;
LOOP
UPDATE customer c
SET region_id = r.id
FROM region r
WHERE c.region_id IS NULL
AND r.code = c.region_code
AND c.id > cur
ORDER BY c.id
LIMIT 5000;
GET DIAGNOSTICS cur = ROW_COUNT;
EXIT WHEN cur = 0; -- No more rows in this batch
PERFORM pg_sleep(0.05); -- throttle
END LOOP;
END $$;Or application-driven backfill with id-range workers and
SELECT ... FOR UPDATE SKIP LOCKED.Gate to proceed: row count parity verified, shadow reads match (see next step), no sustained IO saturation (>80%).
- Dual-write and shadow-read
- Put dual-writes behind a feature flag. Example in a TypeScript service:
async function createCustomer(input: CreateCustomer) {
const tx = await db.transaction();
try {
const customerId = await tx.customer.insert(input);
if (flags.dualWriteRegionId) {
await tx.customer.update({ id: customerId }, { region_id: await resolveRegion(input.region_code) });
}
await tx.commit();
return customerId;
} catch (e) {
await tx.rollback();
throw e;
}
}
async function getCustomer(id: number) {
const row = await db.customer.findById(id);
return flags.readFromNew?
{ ...row, region: await regionById(row.region_id) } :
{ ...row, region: await regionByCode(row.region_code) };
}- Shadow reads: compare old vs new results in the background and log diffs:
SELECT COUNT(*) AS mismatches
FROM customer c
JOIN region r ON r.code = c.region_code
LEFT JOIN region r2 ON r2.id = c.region_id
WHERE r.id IS DISTINCT FROM r2.id;- Gate: dual-write error rate = 0, shadow mismatch = 0 for N hours (we use 4–24h depending on risk).
- Cut over reads, then constraints
- Flip read path flag to use
region_id. Watch p95/p99, cache hit rate, and DB CPU. If stable for a soak period, enforce constraints:
-- Postgres: backfill default value safely first if you truly need it
ALTER TABLE customer ALTER COLUMN region_id SET NOT NULL; -- do this only after backfill complete
-- Optionally add FK if it won’t lock hot paths (Postgres can do it without full table scan since v12)
ALTER TABLE customer
ADD CONSTRAINT fk_customer_region
FOREIGN KEY (region_id) REFERENCES region(id) NOT VALID;
-- Validate in the background
ALTER TABLE customer VALIDATE CONSTRAINT fk_customer_region;- Gate: constraint validation completes; lock waits < 500ms; replicas remain < 2s lag.
- Contract (later)
- Remove old column/index after a deprecation window (we do 1–2 sprints).
ALTER TABLE customer DROP COLUMN region_code;- Keep a rollback note: you can re-add
region_codequickly if you haven’t rewritten the world.
- Rollback plan (write it first)
Back out flags to prior read path instantly.
Stop backfill job; leave expanded schema in place (harmless).
If DDL went wrong, online tools support
--panic-hook/--abortto stop. Avoid schema downgrades that rewrite large tables in prod.
Tooling that won’t betray you at 2 a.m.
Schema runners:
Flyway(simple, idempotent-ish),Liquibase(changelogs, rollbacks),Alembic(SQLAlchemy),Prisma Migrate(dev-friendly, hand-edit in prod).Online DDL:
gh-ostandpt-online-schema-change(MySQL),pg_repackandCREATE INDEX CONCURRENTLY(Postgres).Verification:
SQLshadow queries,pg_dump --schema-onlydiffs,mysqldump --no-data, and application-level parity checks.Pipelines: GitOps with
ArgoCD/Flux; or CI with guarded jobs. Example GitHub Actions gate:
name: db-migration
on: [workflow_dispatch]
jobs:
expand:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: flyway/flyway-action@v2
with:
url: ${{ secrets.DATABASE_URL }}
locations: filesystem:./migrations/expand
verify:
needs: expand
runs-on: ubuntu-latest
steps:
- name: Check SLO gates
run: ./scripts/check_prometheus_gates.sh
contract:
if: ${{ success() }}
runs-on: ubuntu-latest
steps:
- uses: flyway/flyway-action@v2
with:
url: ${{ secrets.DATABASE_URL }}
locations: filesystem:./migrations/contractExample: renaming a hot column without downtime
You can’t safely ALTER TABLE ... RENAME COLUMN on a hot, replicated table in many shops. Do this instead:
- Expand
ALTER TABLE orders ADD COLUMN status_v2 smallint; -- nullable, no default
CREATE INDEX CONCURRENTLY idx_orders_status_v2 ON orders(status_v2);- Backfill
UPDATE orders SET status_v2 = CASE status
WHEN 'PENDING' THEN 0 WHEN 'PAID' THEN 1 WHEN 'FAILED' THEN 2 ELSE NULL END
WHERE status_v2 IS NULL
AND updated_at < now() - interval '1 minute'
LIMIT 5000; -- loop this with throttle- Dual-write (app)
if (flags.dualWriteOrderStatus) {
await db.orders.update({ id }, { status_v2: encode(status) });
}- Cutover reads
- SELECT status FROM orders WHERE id = $1;
+ SELECT status_v2 FROM orders WHERE id = $1;- Enforce and contract
ALTER TABLE orders ALTER COLUMN status_v2 SET NOT NULL;
-- Optional: drop old enum/text
ALTER TABLE orders DROP COLUMN status;
ALTER TABLE orders RENAME COLUMN status_v2 TO status; -- if you really need the old nameGuardrails: watch p95 on write endpoints, pg_locks for any ACCESS EXCLUSIVE waits, and row mismatch counts between old/new encodings.
Gates, dashboards, and what to watch like a hawk
Latency:
p95andp99for read/write endpoints touching the table. Expect small bumps during online operations; roll back if >20–30% sustained for 10m.Errors:
5xxrate and DB timeouts (e.g.,statement_timeout,innodb_lock_wait_timeout).Replication:
pg_stat_replicationlag,Seconds_Behind_Master. Don’t cut over if >2s sustained.Locks: Postgres
pg_locksandpg_blocking_pids(). Anything requestingACCESS EXCLUSIVEon a hot table is a red flag.Storage/IO: queue depth, write IOPS,
checkpoint_write_time/checkpoint_sync_timein Postgres. Throttle backfills if IO >80%.Backfill progress: rows completed/min, estimated remaining time. Stop if error parity diverges.
Simple Postgres locks dashboard query:
SELECT bl.pid AS blocked_pid, a.usename, a.query, now() - a.query_start AS age,
ka.pid AS blocking_pid, ka.query AS blocking_query
FROM pg_locks bl
JOIN pg_stat_activity a ON a.pid = bl.pid
JOIN pg_locks kl ON kl.locktype = bl.locktype AND kl.DATABASE IS NOT DISTINCT FROM bl.DATABASE AND kl.relation IS NOT DISTINCT FROM bl.relation AND kl.GRANTED
JOIN pg_stat_activity ka ON ka.pid = kl.pid
WHERE NOT bl.granted;Anti-patterns I’ve seen nuke prod
NOT NULL+DEFAULTon a big Postgres table pre-11/12: full table rewrite. Prefer nullable + backfill, then setNOT NULL.Creating an index without
CONCURRENTLYon Postgres: blocks writes. Always useCONCURRENTLYunless you’re offline.Long transactions for backfill: vacuum bloat, replication lag. Batch and commit. Use
SKIP LOCKED.Foreign keys on hot paths in MySQL: can lock long enough to page you. Consider app-level integrity + offline validation, or gh-ost with careful cutover windows.
Renames in-place: break old app versions. Use add-new + backfill + swap.
Backfilling with
ORDER BY random(): you’ll DOS yourself. Use primary key ranges.Forgetting the read replicas: online tools may not be replica-safe by default; test and monitor lag.
Deploying app and DDL in one shot: separate to keep blast radius small.
Bottom line
Zero-downtime schema changes are a deployment strategy, not a lucky migration script. Treat them like you treat production rollouts: feature flags, canaries, gates, and a rollback that’s faster than your coffee cools. We’ve run this playbook at fintechs, marketplaces, and SaaS shops with billions of rows and strict SLOs. Boring wins.
Key takeaways
- Zero-downtime is a process, not a single migration script—treat schema changes like multi-stage rollouts.
- Use expand/contract: add safe structures first, dual-write, backfill, cut over, then retire old paths.
- Measure and gate each stage with latency/error SLOs, replication lag, and lock/IO telemetry.
- Use online-change tools (gh-ost/pt-osc/pg_repack) and safe DDL (e.g., Postgres CONCURRENTLY).
- Plan rollback as seriously as rollout—feature flags and reversible steps are non-negotiable.
Implementation checklist
- Classify the change (expand vs contract vs unsafe) and write the rollback path first.
- Pin tools/versions (e.g., Flyway, gh-ost, pg_repack) and dry-run in prod-like data volumes.
- Instrument gates: error rate, p95 latency, replication lag, lock wait time, disk/CPU headroom.
- Run expand DDL safely (CONCURRENTLY/online) and introduce new columns/tables nullable.
- Backfill incrementally with throttling; verify with shadow reads and row counts.
- Enable dual-writes behind a feature flag; monitor write error parity and lag.
- Cut over reads, enforce constraints, and remove old paths only after a soak period.
- Document deprecation and schedule the contract drop with another gate/rollback plan.
Questions we hear from teams
- Do I need gh-ost/pt-online-schema-change if I’m on Postgres?
- Not for most cases. Postgres has `CREATE INDEX CONCURRENTLY`, `ALTER TABLE ... VALIDATE CONSTRAINT`, and can add columns without rewrite if you avoid defaults. Use `pg_repack` for heavy rewrites (table bloat, clustering) without long locks.
- How do I keep replicas from falling over during backfill?
- Throttle batch size and sleep between chunks; keep transactions small. Monitor replication lag and pause the job if you exceed your gate (e.g., 2s). Consider running backfill on the primary only and let replicas catch up before cutover.
- Can I collapse steps if I have a maintenance window?
- Yes, but treat that as a separate runbook. If your business approves a window, you can do blocking DDL. Most teams can’t, so the expand/contract approach is safer and repeatable.
- What about ORMs?
- Use them for scaffolding, not for prod DDL. Generate migrations, then hand-edit for online semantics. For example, convert generated Postgres index creation to `CREATE INDEX CONCURRENTLY` and remove `NOT NULL DEFAULT` on large tables.
- How long should the soak period be before contracting?
- Depends on risk. We usually keep old paths for 1–2 sprints. If data volumes are huge or the change is core to revenue, leave it longer and schedule the contract step with the same gates.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
