The “One-Liner” ALTER TABLE That Took Production Down: Zero-Downtime Schema Changes That Actually Work

A pragmatic migration playbook for senior engineers: expand/contract, backfills, online indexes, rollout checkpoints, and the metrics that tell you when you’re safe.

If your migration plan fits in one `ALTER TABLE`, it probably fits in one postmortem too.
Back to all posts

The real enemy: locks, long transactions, and “harmless” defaults

I’ve seen more outages from “simple” schema tweaks than from bad application code. The classic is a confident ALTER TABLE ... ADD COLUMN ... DEFAULT ... NOT NULL on a hot Postgres table. In staging it’s instant; in prod it takes an ACCESS EXCLUSIVE lock, queues writes, your p95 latency goes vertical, and the incident commander starts asking why the API is returning 502s.

Zero-downtime migrations aren’t magic. They’re a pattern: make the schema change backward-compatible, deploy code that can handle both worlds, backfill safely, validate, then remove old paths. If you can’t explain your change in those phases, you’re probably one DDL statement away from a bad day.

Step 0 — Preflight: know your blast radius before you type SQL

Before you touch the database, collect the numbers that predict pain:

  • Table size and bloat
    • Postgres: pg_total_relation_size, pg_stat_all_tables
    • MySQL: information_schema.TABLES
  • Write rate (QPS) and peak windows (from pg_stat_statements, ProxySQL, RDS Performance Insights, etc.)
  • Long-running transactions (these turn “online” operations into lock disasters)
    • Postgres: pg_stat_activity + xact_start
  • Replication lag (especially if you backfill hard)
  • SLO guardrails: baseline p95/p99 latency, error rate, and queue depth

Concrete preflight queries (Postgres):

-- Table size
SELECT
  relname,
  pg_size_pretty(pg_total_relation_size(relid)) AS total_size
FROM pg_catalog.pg_statio_user_tables
WHERE relname = 'orders';

-- Long transactions (watch these during migrations)
SELECT pid, usename, state, now() - xact_start AS xact_age, query
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
ORDER BY xact_age DESC
LIMIT 20;

Checkpoint:

  • You can name the largest tables, the peak write hour, and the longest transaction in production.
  • You have a plan for lock avoidance: lock_timeout, online DDL tooling, or both.

The strategy that holds up under load: expand → backfill → validate → contract

The only strategy I’ve seen consistently survive high-throughput systems (Stripe-style event tables, Shopify-scale order tables, you name it) is expand/contract.

Expand

Add new schema elements in a way that doesn’t break old code:

  • Add new nullable columns (no default rewrite)
  • Add new tables alongside old
  • Add indexes using online/lock-minimized methods
  • Add constraints as NOT VALID where supported

Backfill

Populate new structures without melting prod:

  • Chunked updates (by primary key range or timestamp)
  • Rate limits (sleep, token bucket, or job queue concurrency)
  • Idempotent logic (safe to rerun)

Validate

Prove correctness before flipping behavior:

  • Validate constraints
  • Run row count / checksum comparisons
  • Shadow reads (read from both, compare, don’t serve)

Contract

Remove old paths only after stability:

  • Switch reads/writes to new schema
  • Stop dual-write
  • Drop old columns/indexes/tables in a later deploy

Checkpoint:

  • Every step has a rollback that preserves data compatibility (old app version still works against the “expanded” schema).

Worked example: rename a column in Postgres with zero downtime

“Rename column” is deceptively nasty because old code expects old_name and new code expects new_name. The safe way is add → dual-write → backfill → flip reads → remove.

Assume we want users.full_name to become users.display_name.

1) Expand (DDL)

-- Keep locks short so you fail fast instead of blocking prod
SET lock_timeout = '2s';
SET statement_timeout = '30s';

ALTER TABLE users ADD COLUMN display_name text;

-- Optional: add a check constraint, but don't validate yet
ALTER TABLE users
  ADD CONSTRAINT users_display_name_nonempty
  CHECK (display_name IS NULL OR length(display_name) > 0)
  NOT VALID;

2) Deploy app code: dual-write + read fallback

Pseudo-TypeScript (same concept in Rails/Java/etc.):

// write path
await db.query(
  `UPDATE users
   SET full_name = $1,
       display_name = $1
   WHERE id = $2`,
  [name, userId]
);

// read path (fallback while backfill is running)
const row = await db.query(
  `SELECT COALESCE(display_name, full_name) AS name
   FROM users WHERE id = $1`,
  [userId]
);

Gate the behavior behind a feature flag (LaunchDarkly, Unleash, Flagsmith) so you can canary it.

Checkpoint:

  • New code runs fine with either column populated.
  • Old code runs fine because full_name still exists.

3) Backfill (chunked, observable)

# Example: run a backfill worker with a conservative concurrency
export BATCH_SIZE=1000
export SLEEP_MS=100
node backfill-display-name.js

And the backfill logic should look like this (important bits: chunking + idempotency):

while (true) {
  const { rows } = await db.query(
    `SELECT id
     FROM users
     WHERE display_name IS NULL
     ORDER BY id
     LIMIT $1`,
    [BATCH_SIZE]
  );

  if (rows.length === 0) break;

  await db.query(
    `UPDATE users u
     SET display_name = u.full_name
     WHERE u.id = ANY($1)
       AND u.display_name IS NULL`,
    [rows.map(r => r.id)]
  );

  await sleep(SLEEP_MS);
}

Metrics to watch during backfill:

  • p95 write latency on the primary
  • replication lag (seconds behind)
  • deadlocks / lock waits
  • Backfill throughput: rows/sec and ETA

Checkpoint:

  • Backfill can be stopped/restarted without creating corruption.
  • You can answer: “If we need to pause, what happens?” (Nothing bad should happen.)

4) Validate + flip reads

-- Validate the constraint once most rows are populated
ALTER TABLE users VALIDATE CONSTRAINT users_display_name_nonempty;

-- Optional validation query
SELECT COUNT(*) AS missing
FROM users
WHERE display_name IS NULL;

Once missing is ~0 (or you have a defined exception set), flip reads to display_name only.

5) Contract (later deploy)

After a stability window (I like at least one full business cycle—often 1–2 weeks):

ALTER TABLE users DROP COLUMN full_name;

Indexes and constraints: where “online” is a lie unless you’re careful

Postgres: use CONCURRENTLY and staged constraints

  • Index builds: CREATE INDEX CONCURRENTLY avoids blocking writes, but it’s slower and can fail—make your migration tooling handle retries.
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_orders_customer_id
ON orders(customer_id);
  • NOT NULL on big tables: don’t just ALTER COLUMN ... SET NOT NULL on day one. Use a staged approach:
-- 1) Add a check constraint without validation
ALTER TABLE orders
  ADD CONSTRAINT orders_customer_id_nn
  CHECK (customer_id IS NOT NULL)
  NOT VALID;

-- 2) Backfill missing values if needed

-- 3) Validate (this scans, but avoids long exclusive locks)
ALTER TABLE orders VALIDATE CONSTRAINT orders_customer_id_nn;

-- 4) Optionally set NOT NULL once proven
ALTER TABLE orders ALTER COLUMN customer_id SET NOT NULL;

Also: set lock_timeout in migration sessions so you fail fast instead of blocking traffic.

MySQL: prefer online DDL or external tooling

If you’re on MySQL 8.0 / Aurora MySQL, many operations can be INSTANT/INPLACE, but “many” is not “all”. For big tables, assume you’ll need:

  • gh-ost (GitHub’s online schema migration tool)
  • pt-online-schema-change (Percona)

Example with gh-ost:

gh-ost \
  --host=prod-mysql \
  --database=app \
  --table=orders \
  --alter="ADD COLUMN display_name VARCHAR(255) NULL" \
  --execute

Checkpoint:

  • You can state whether the change is blocking, metadata-only, or copy-based, and you’ve chosen the right tool accordingly.

Rollout checkpoints: treat migrations like a production release, not a DBA task

Here’s what actually works in orgs that don’t want 3 a.m. pages:

  1. Dry run on prod-like data
    • Restore a recent snapshot (or use a sanitized clone).
    • Run the migration and record duration + locks.
  2. Canary the application behavior
    • Enable dual-write for 1–5% of traffic via feature flags.
    • Watch error rate and write amplification.
  3. Backfill with circuit breakers
    • Auto-pause if p95 latency increases > X% or replication lag > Y seconds.
  4. Flip reads
    • Shadow reads first (compare values), then serve from new column.
  5. Cleanup later
    • Dropping old columns is a separate change with its own review.

Metrics that should be on a dashboard during migration:

  • Error rate (5xx, DB errors, deadlocks)
  • p95/p99 latency at API and DB
  • DB lock waits (Postgres pg_locks, MySQL performance_schema)
  • Replication lag
  • Connection pool saturation
  • Migration/backfill throughput and estimated completion

If you don’t have these, you’re flying IFR with the instruments off.

Tooling: keep humans out of the critical path

You want repeatability and guardrails, not a heroic staff engineer SSH’ing into prod.

Good combos I’ve seen work:

  • Migration runner: Flyway or Liquibase (strong audit trail), or framework-native (Rails, Django, Alembic) with discipline.
  • GitOps delivery: ArgoCD or Flux for consistent promotion across environments.
  • Feature flags: LaunchDarkly, Unleash, Flagsmith.
  • Observability: Prometheus + Grafana, plus DB-native insights (RDS Performance Insights, pg_stat_statements).

Example: Flyway with safety settings (conceptual):

# flyway.conf pattern (exact keys vary by setup)
flyway:
  outOfOrder: false
  validateMigrationNaming: true
  placeholders:
    lock_timeout: "2s"
    statement_timeout: "30s"

And in SQL migrations, enforce timeouts explicitly:

SET lock_timeout = '2s';
SET statement_timeout = '30s';

Checkpoint:

  • Migrations run automatically in CI/CD with an audit trail.
  • Manual steps (backfill, flag flips) are scripted and reversible.

What GitPlumbers does when this goes sideways (and how to avoid getting there)

We get pulled in when a team did “the obvious migration,” it locked a table, and now they’re scared to touch the database. Or the app has AI-generated migration code that looks fine but ignores lock behavior, index build costs, or replication.

The fix is rarely exotic: we help teams implement the expand/contract discipline, add the missing guardrails (timeouts, canaries, feature flags), and wire up the dashboards that make go/no-go decisions boring.

If you’re planning a risky change (big table, hot write path, multi-tenant schema, or a MySQL online DDL situation), this is exactly the kind of work GitPlumbers is built for.

Related Resources

Key takeaways

  • Treat schema changes like distributed systems changes: **expand → backfill → validate → contract** across multiple deploys.
  • Use **online DDL** primitives (`CREATE INDEX CONCURRENTLY`, MySQL `ALGORITHM=INPLACE/INSTANT`, `VALIDATE CONSTRAINT`) and enforce **short lock timeouts**.
  • Backfills should be **chunked, rate-limited, observable**, and safe to stop/restart.
  • Use **feature flags** + dual paths (dual-write or read-fallback) to decouple app deploy risk from migration risk.
  • Your go/no-go should be based on metrics: **lock waits, p95 latency, error rate, replication lag, migration throughput**.

Implementation checklist

  • Migration plan written as expand/contract across at least 2 deploys
  • Preflight: table size, indexes, write QPS, replication lag, maintenance windows (if any)
  • Lock strategy: `lock_timeout`/`statement_timeout` (Postgres) or online DDL tooling (MySQL)
  • Backfill job is chunked, idempotent, rate-limited, and monitored
  • Dual-read/dual-write behavior gated behind a feature flag
  • Validation queries + constraints staged (`NOT VALID` then `VALIDATE`)
  • Rollback path documented (app rollback + data compatibility)
  • Post-migration cleanup (drop old column/index) scheduled after stability window

Questions we hear from teams

Can I do true zero-downtime schema changes without feature flags?
You can sometimes get away with it for additive changes (new nullable column, new table) if the app never touches the new schema until after deploy. The moment you need dual-read/dual-write, flags make it safer because you can canary and roll back behavior without rolling back binaries.
What’s the safest way to add a NOT NULL column on Postgres?
Add the column nullable (no default rewrite), deploy code that writes it, backfill, add a `CHECK (col IS NOT NULL) NOT VALID`, validate it, then (optionally) set `NOT NULL`. This avoids long exclusive locks and gives you validation checkpoints.
How do you prevent migrations from blocking production traffic?
Set aggressive `lock_timeout` and reasonable `statement_timeout` for migration sessions, use online primitives (`CREATE INDEX CONCURRENTLY`, staged constraint validation), and schedule copy-based MySQL changes via `gh-ost`/`pt-online-schema-change` with throttling tied to replication lag and query latency.
What metrics are the best early warning signs during a backfill?
p95 write latency on the primary, replication lag, deadlocks/lock waits, and connection pool saturation. Backfill throughput/ETA is also important so you can predict whether you’ll collide with peak traffic windows.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to GitPlumbers about a risky production migration See how we approach legacy database modernization

Related resources