Do I need gh-ost/pt-online-schema-change if I’m on Postgres?

Not for most cases. Postgres has `CREATE INDEX CONCURRENTLY`, `ALTER TABLE ... VALIDATE CONSTRAINT`, and can add columns without rewrite if you avoid defaults. Use `pg_repack` for heavy rewrites (table bloat, clustering) without long locks.

How do I keep replicas from falling over during backfill?

Throttle batch size and sleep between chunks; keep transactions small. Monitor replication lag and pause the job if you exceed your gate (e.g., 2s). Consider running backfill on the primary only and let replicas catch up before cutover.

Can I collapse steps if I have a maintenance window?

Yes, but treat that as a separate runbook. If your business approves a window, you can do blocking DDL. Most teams can’t, so the expand/contract approach is safer and repeatable.

Use them for scaffolding, not for prod DDL. Generate migrations, then hand-edit for online semantics. For example, convert generated Postgres index creation to `CREATE INDEX CONCURRENTLY` and remove `NOT NULL DEFAULT` on large tables.

How long should the soak period be before contracting?

Depends on risk. We usually keep old paths for 1–2 sprints. If data volumes are huge or the change is core to revenue, leave it longer and schedule the contract step with the same gates.

Guides · Nov 5, 2025 · 10 minute read

The Migration That Didn’t Wake PagerDuty: A Real Zero‑Downtime Schema Strategy

I’ve watched teams brick prod with a “simple” ALTER TABLE. Here’s the expand/contract playbook—with metrics, commands, and guardrails—that actually ships without waking the on-call.

Alex Ramirez

Partner, Reliability & Modernization

20 years in the trenches from Perl monoliths to planet-scale microservices. Ex-Amazon, ex-Fintech SRE Director. At GitPlumbers, Alex leads zero-downtime modernization and incident reduction programs.

“Stop shipping DDL like it’s 2008. Schema changes deserve the same rollout discipline as your app.”

Back to all posts

The outage you can avoid

I've watched more outages start with a confident ALTER TABLE than with bad deploys. At a unicorn fintech in 2019, a NOT NULL with a default on a 200M-row Postgres table rewrote the whole table, pinned IO, and sent p99 from 35ms to “nobody knows.” Replication fell behind by hours. It took a weekend to unstick. That team now runs zero-downtime schema changes as a staged rollout, just like code.

This guide is the playbook we use at GitPlumbers when a client says “We need to evolve the schema but prod can’t blink.” It’s opinionated, it’s boring, and it works.

Principles that keep you off the pager

Expand/contract over “big bang.” You add safe structures first (expand), migrate traffic/data, then remove old ones (contract).
Forward/backward compatibility across at least two app versions. Old code must run on expanded schema; new code must tolerate old data.
Gate with metrics, not vibes: error rate, p95/p99, queue depth, replication lag, and lock wait time.
Online changes only: CONCURRENTLY in Postgres; gh-ost or pt-online-schema-change in MySQL; pg_repack for rewrites.
Roll back fast: feature flags for reads/writes, reversible steps, and a “stop-the-backfill” kill switch.

Step-by-step: the expand/contract playbook

Classify the change

Expand-safe: add nullable column, new table, new index (online).
Risky: adding NOT NULL with default, renames, type changes, widening PKs, foreign keys on hot tables.
Contract: dropping old column/index/constraint.

Rule: If it rewrites a large table or blocks writes, assume it’s risky and plan online tooling + canary.

Instrument your gates

SLOs: HTTP 5xx < 0.1%, p95 < SLO + 20%, replication lag < 2s, lock wait < 500ms.
Postgres: pg_stat_activity, pg_locks, pg_stat_replication, pg_stat_progress_create_index.
MySQL: performance_schema.events_statements, information_schema.innodb_trx, replica Seconds_Behind_Master.
Export to Prometheus; alert in Slack. Example PromQL:

rate(http_server_errors_total{job="api"}[5m]) > 0.001

Expand the schema safely

Postgres example: add column and index without blocking writes:

-- Avoid table rewrite: no default; nullable
ALTER TABLE customer ADD COLUMN region_id bigint;
-- Build index without locking writes
CREATE INDEX CONCURRENTLY idx_customer_region_id ON customer(region_id);

MySQL example with gh-ost:

gh-ost \
  --host=primary.db.internal \
  --database=app \
  --table=customer \
  --alter="ADD COLUMN region_id BIGINT NULL, ADD INDEX idx_customer_region_id(region_id)" \
  --allow-on-master \
  --cut-over=default \
  --max-load=Threads_running=50 \
  --critical-load=Threads_running=200 \
  --exact-rowcount \
  --chunk-size=1000 \
  --execute

Gate to proceed: p95 write latency ≤ SLO+20%, error rate steady, replication lag < 2s for 15 minutes.

Backfill without melting disks

Throttle and batch; avoid long transactions. Example Postgres backfill in chunks:

-- Create a job table to track progress
CREATE TABLE IF NOT EXISTS _backfill_customer_region_progress (last_id bigint);
INSERT INTO _backfill_customer_region_progress(last_id) VALUES (0) ON CONFLICT DO NOTHING;

DO $$
DECLARE cur bigint;
BEGIN
  SELECT last_id INTO cur FROM _backfill_customer_region_progress FOR UPDATE;
  LOOP
    UPDATE customer c
      SET region_id = r.id
    FROM region r
    WHERE c.region_id IS NULL
      AND r.code = c.region_code
      AND c.id > cur
    ORDER BY c.id
    LIMIT 5000;

    GET DIAGNOSTICS cur = ROW_COUNT;
    EXIT WHEN cur = 0; -- No more rows in this batch
    PERFORM pg_sleep(0.05); -- throttle
  END LOOP;
END $$;

Or application-driven backfill with id-range workers and SELECT ... FOR UPDATE SKIP LOCKED.
Gate to proceed: row count parity verified, shadow reads match (see next step), no sustained IO saturation (>80%).

Dual-write and shadow-read

Put dual-writes behind a feature flag. Example in a TypeScript service:

async function createCustomer(input: CreateCustomer) {
  const tx = await db.transaction();
  try {
    const customerId = await tx.customer.insert(input);

    if (flags.dualWriteRegionId) {
      await tx.customer.update({ id: customerId }, { region_id: await resolveRegion(input.region_code) });
    }

    await tx.commit();
    return customerId;
  } catch (e) {
    await tx.rollback();
    throw e;
  }
}

async function getCustomer(id: number) {
  const row = await db.customer.findById(id);
  return flags.readFromNew?
    { ...row, region: await regionById(row.region_id) } :
    { ...row, region: await regionByCode(row.region_code) };
}

Shadow reads: compare old vs new results in the background and log diffs:

SELECT COUNT(*) AS mismatches
FROM customer c
JOIN region r ON r.code = c.region_code
LEFT JOIN region r2 ON r2.id = c.region_id
WHERE r.id IS DISTINCT FROM r2.id;

Gate: dual-write error rate = 0, shadow mismatch = 0 for N hours (we use 4–24h depending on risk).

Cut over reads, then constraints

Flip read path flag to use region_id. Watch p95/p99, cache hit rate, and DB CPU. If stable for a soak period, enforce constraints:

-- Postgres: backfill default value safely first if you truly need it
ALTER TABLE customer ALTER COLUMN region_id SET NOT NULL; -- do this only after backfill complete

-- Optionally add FK if it won’t lock hot paths (Postgres can do it without full table scan since v12)
ALTER TABLE customer
  ADD CONSTRAINT fk_customer_region
  FOREIGN KEY (region_id) REFERENCES region(id) NOT VALID;

-- Validate in the background
ALTER TABLE customer VALIDATE CONSTRAINT fk_customer_region;

Gate: constraint validation completes; lock waits < 500ms; replicas remain < 2s lag.

Contract (later)

Remove old column/index after a deprecation window (we do 1–2 sprints).

ALTER TABLE customer DROP COLUMN region_code;

Keep a rollback note: you can re-add region_code quickly if you haven’t rewritten the world.

Rollback plan (write it first)

Back out flags to prior read path instantly.
Stop backfill job; leave expanded schema in place (harmless).
If DDL went wrong, online tools support --panic-hook/--abort to stop. Avoid schema downgrades that rewrite large tables in prod.

Tooling that won’t betray you at 2 a.m.

Schema runners: Flyway (simple, idempotent-ish), Liquibase (changelogs, rollbacks), Alembic (SQLAlchemy), Prisma Migrate (dev-friendly, hand-edit in prod).
Online DDL: gh-ost and pt-online-schema-change (MySQL), pg_repack and CREATE INDEX CONCURRENTLY (Postgres).
Verification: SQL shadow queries, pg_dump --schema-only diffs, mysqldump --no-data, and application-level parity checks.
Pipelines: GitOps with ArgoCD/Flux; or CI with guarded jobs. Example GitHub Actions gate:

name: db-migration
on: [workflow_dispatch]
jobs:
  expand:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: flyway/flyway-action@v2
        with:
          url: ${{ secrets.DATABASE_URL }}
          locations: filesystem:./migrations/expand
  verify:
    needs: expand
    runs-on: ubuntu-latest
    steps:
      - name: Check SLO gates
        run: ./scripts/check_prometheus_gates.sh
  contract:
    if: ${{ success() }}
    runs-on: ubuntu-latest
    steps:
      - uses: flyway/flyway-action@v2
        with:
          url: ${{ secrets.DATABASE_URL }}
          locations: filesystem:./migrations/contract

Example: renaming a hot column without downtime

You can’t safely ALTER TABLE ... RENAME COLUMN on a hot, replicated table in many shops. Do this instead:

Expand

ALTER TABLE orders ADD COLUMN status_v2 smallint; -- nullable, no default
CREATE INDEX CONCURRENTLY idx_orders_status_v2 ON orders(status_v2);

Backfill

UPDATE orders SET status_v2 = CASE status
  WHEN 'PENDING' THEN 0 WHEN 'PAID' THEN 1 WHEN 'FAILED' THEN 2 ELSE NULL END
WHERE status_v2 IS NULL
  AND updated_at < now() - interval '1 minute'
LIMIT 5000; -- loop this with throttle

Dual-write (app)

if (flags.dualWriteOrderStatus) {
  await db.orders.update({ id }, { status_v2: encode(status) });
}

Cutover reads

- SELECT status FROM orders WHERE id = $1;
+ SELECT status_v2 FROM orders WHERE id = $1;

Enforce and contract

ALTER TABLE orders ALTER COLUMN status_v2 SET NOT NULL;
-- Optional: drop old enum/text
ALTER TABLE orders DROP COLUMN status;
ALTER TABLE orders RENAME COLUMN status_v2 TO status; -- if you really need the old name

Guardrails: watch p95 on write endpoints, pg_locks for any ACCESS EXCLUSIVE waits, and row mismatch counts between old/new encodings.

Gates, dashboards, and what to watch like a hawk

Latency: p95 and p99 for read/write endpoints touching the table. Expect small bumps during online operations; roll back if >20–30% sustained for 10m.
Errors: 5xx rate and DB timeouts (e.g., statement_timeout, innodb_lock_wait_timeout).
Replication: pg_stat_replication lag, Seconds_Behind_Master. Don’t cut over if >2s sustained.
Locks: Postgres pg_locks and pg_blocking_pids(). Anything requesting ACCESS EXCLUSIVE on a hot table is a red flag.
Storage/IO: queue depth, write IOPS, checkpoint_write_time/checkpoint_sync_time in Postgres. Throttle backfills if IO >80%.
Backfill progress: rows completed/min, estimated remaining time. Stop if error parity diverges.

Simple Postgres locks dashboard query:

SELECT bl.pid AS blocked_pid, a.usename, a.query, now() - a.query_start AS age,
       ka.pid AS blocking_pid, ka.query AS blocking_query
FROM pg_locks bl
JOIN pg_stat_activity a ON a.pid = bl.pid
JOIN pg_locks kl ON kl.locktype = bl.locktype AND kl.DATABASE IS NOT DISTINCT FROM bl.DATABASE AND kl.relation IS NOT DISTINCT FROM bl.relation AND kl.GRANTED
JOIN pg_stat_activity ka ON ka.pid = kl.pid
WHERE NOT bl.granted;

Anti-patterns I’ve seen nuke prod

NOT NULL + DEFAULT on a big Postgres table pre-11/12: full table rewrite. Prefer nullable + backfill, then set NOT NULL.
Creating an index without CONCURRENTLY on Postgres: blocks writes. Always use CONCURRENTLY unless you’re offline.
Long transactions for backfill: vacuum bloat, replication lag. Batch and commit. Use SKIP LOCKED.
Foreign keys on hot paths in MySQL: can lock long enough to page you. Consider app-level integrity + offline validation, or gh-ost with careful cutover windows.
Renames in-place: break old app versions. Use add-new + backfill + swap.
Backfilling with ORDER BY random(): you’ll DOS yourself. Use primary key ranges.
Forgetting the read replicas: online tools may not be replica-safe by default; test and monitor lag.
Deploying app and DDL in one shot: separate to keep blast radius small.

Bottom line

Zero-downtime schema changes are a deployment strategy, not a lucky migration script. Treat them like you treat production rollouts: feature flags, canaries, gates, and a rollback that’s faster than your coffee cools. We’ve run this playbook at fintechs, marketplaces, and SaaS shops with billions of rows and strict SLOs. Boring wins.

Related Resources

Key takeaways

Zero-downtime is a process, not a single migration script—treat schema changes like multi-stage rollouts.
Use expand/contract: add safe structures first, dual-write, backfill, cut over, then retire old paths.
Measure and gate each stage with latency/error SLOs, replication lag, and lock/IO telemetry.
Use online-change tools (gh-ost/pt-osc/pg_repack) and safe DDL (e.g., Postgres CONCURRENTLY).
Plan rollback as seriously as rollout—feature flags and reversible steps are non-negotiable.

Implementation checklist

Classify the change (expand vs contract vs unsafe) and write the rollback path first.
Pin tools/versions (e.g., Flyway, gh-ost, pg_repack) and dry-run in prod-like data volumes.
Instrument gates: error rate, p95 latency, replication lag, lock wait time, disk/CPU headroom.
Run expand DDL safely (CONCURRENTLY/online) and introduce new columns/tables nullable.
Backfill incrementally with throttling; verify with shadow reads and row counts.
Enable dual-writes behind a feature flag; monitor write error parity and lag.
Cut over reads, enforce constraints, and remove old paths only after a soak period.
Document deprecation and schedule the contract drop with another gate/rollback plan.

Questions we hear from teams

Do I need gh-ost/pt-online-schema-change if I’m on Postgres?: Not for most cases. Postgres has `CREATE INDEX CONCURRENTLY`, `ALTER TABLE ... VALIDATE CONSTRAINT`, and can add columns without rewrite if you avoid defaults. Use `pg_repack` for heavy rewrites (table bloat, clustering) without long locks.
How do I keep replicas from falling over during backfill?: Throttle batch size and sleep between chunks; keep transactions small. Monitor replication lag and pause the job if you exceed your gate (e.g., 2s). Consider running backfill on the primary only and let replicas catch up before cutover.
Can I collapse steps if I have a maintenance window?: Yes, but treat that as a separate runbook. If your business approves a window, you can do blocking DDL. Most teams can’t, so the expand/contract approach is safer and repeatable.
What about ORMs?: Use them for scaffolding, not for prod DDL. Generate migrations, then hand-edit for online semantics. For example, convert generated Postgres index creation to `CREATE INDEX CONCURRENTLY` and remove `NOT NULL DEFAULT` on large tables.
How long should the soak period be before contracting?: Depends on risk. We usually keep old paths for 1–2 sprints. If data volumes are huge or the change is core to revenue, leave it longer and schedule the contract step with the same gates.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to GitPlumbers about your next migration Read our fintech online schema change case study