The Zero‑Downtime Cutover Checklist We Actually Use in Production

A hands-on, step-by-step runbook for migrating a critical workload without a blip — with the guardrails, metrics, and tooling that keep you out of pager hell.

Zero downtime isn’t luck. It’s a boring, rehearsed checklist with a rollback faster than the forward path.
Back to all posts

The scenario you’ve seen before

You’ve got a revenue-path service doing 3k TPS, backed by a tired MySQL primary in a region you want to exit. Or you’re moving from an ancient VM stack to Kubernetes with Istio. The VP says “zero downtime,” finance says “don’t double-spend,” and you remember the last time DNS bit you. We’ve run this movie at fintechs, SaaS unicorns, and marketplaces. The zero-downtime migrations that worked all followed the same playbook.

The trick isn’t heroics at 3am. It’s a boring checklist, rehearsed twice, with a rollback that’s easier than the forward path.

1) Preflight: freeze, SLOs, and abort criteria

Lock the blast radius before you touch anything.

  • Change freeze: lock deploys for the service(s), schema, and infra. Create a change window in ServiceNow/Jira.
  • Owners on-call: app, DB, SRE, networking, observability. One decision-maker, one comms channel (#cutover-warroom).
  • SLOs + abort gates (write them down):
    • Error rate: 5xx < 0.5% sustained; abort if > 0.5% for 3 minutes.
    • Latency: p95 < 2x baseline; abort if > 2x for 5 minutes.
    • Saturation: CPU < 70%, DB connections < 80% of max.
    • Replication lag: < 500ms; abort if > 2s for 2 minutes.
  • Capacity: pre-provision target to 1.5–2x expected peak QPS. Do a 15-minute synthetic load test (k6, Vegeta) with production-like headers and auth.
  • Observability: create a cutover dashboard in Grafana:
    • 4 golden signals per service, plus DB lag, queue depth, Kafka consumer lag, LB error counts.
    • Separate panels for source vs target.
  • Toggle strategy: choose your primary switch:
    • Istio VirtualService weighted routing.
    • LB weighted backends (NGINX, HAProxy, AWS ALB target weights).
    • Feature flag (LaunchDarkly, OpenFeature) at the edge or in the app.
  • DNS plan if applicable: Route 53/Cloudflare TTL → 30–60s 24 hours before cutover. Don’t rely on instant TTL changes.
  • Runbook: minute-by-minute script with commands, dashboards, and “who does what if X happens.” Store it in git, versioned, reviewed.

Proof point: at a payments client (≈3.5k TPS), these gates kept us honest. We aborted once on a rehearsal when p95 doubled due to a missing Redis warmup. Saved the real cutover.

2) Data: replicate, backfill, cutover, verify

Traffic is easy. Data is where migrations die. Treat reads and writes separately.

  1. Choose your replication:
    • Postgres: logical replication (pglogical), or AWS DMS for cross-cloud. Schema drift tracked with Flyway/Liquibase.
    • MySQL: gh-ost or pt-online-schema-change for live schema, plus binlog replication.
    • MongoDB: add target as a hidden secondary, promote later.
    • Kafka topics: mirror with MirrorMaker 2 or Confluent Replicator; monitor consumer lag.
  2. Backfill:
    • Bulk load historical data first (snapshots). For PG: COPY from S3; for MySQL: mysqldump → import or mydumper.
    • Verify row counts and checksums per table. Use pg_checksums, CHECKSUM TABLE, or a custom hash(id||updated_at) pass.
  3. Dual-write or CDC:
    • Prefer app-level dual-write behind a feature flag for idempotent operations; include a retry queue.
    • If app changes are risky, use CDC (Debezium, AWS DMS) to stream changes.
  4. Read routing:
    • Keep reads on the source until lag on the target is < 500ms and read-only checks pass.
  5. Cutover:
    • Freeze writes for 30–120 seconds (only if absolutely needed) or ensure idempotency guarantees.
    • Point writers to the target; validate replication direction if you need a rollback path.
  6. Post-cutover verification:
    • Run diff reports on hot tables: row counts, sum of key columns, sample checksums.
    • Reconcile dead-letter queues and retry topics.

Example: we moved a 1.2B-row ledger from RDS to Aurora with pglogical and a Debezium sidecar. Backfill took 9 hours; max lag during cutover: 220ms; write freeze: 0s; total user-visible impact: 0.

3) Traffic control: canary first, then weighted

Do not big-bang your traffic. Prove the new path under real load, then roll forward.

  • Shadow traffic (optional but powerful): duplicate a slice of requests to the target with responses discarded.
    • Tools: Istio mirrors, NGINX mirror module, or a service mesh tap.
    • Compare response codes and latency distributions.
  • Canary 1% → 5% → 25% → 50% → 100% with pause gates at each step.
  • Sticky sessions: turn them off or account for them. If not possible, cut by cohort (e.g., device IDs, tenants).
  • Circuit breakers: set conservative maxConnections, timeouts, and outlier detection.

Istio example (weighted split):

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: payments-svc
spec:
  hosts: ["payments.internal"]
  http:
  - route:
    - destination: { host: payments-old, subset: v1 }
      weight: 95
    - destination: { host: payments-new, subset: v2 }
      weight: 5
    retries:
      attempts: 2
      perTryTimeout: 500ms
    timeout: 2s
    fault:
      abort: { percentage: 0 }

NGINX example (weighted upstream):

upstream payments {
  server old:8080 weight=95 max_fails=1 fail_timeout=2s;
  server new:8080 weight=5  max_fails=1 fail_timeout=2s;
}

Call out: if you depend on DNS, prep it a day early. Set TTL to 60s, validate resolvers honor it, and expect some long-lived clients to ignore you.

4) Observability: dashboards, SLOs, and hard stops

If you can’t see it, you can’t cut it over.

  • Dashboards (Prometheus/Grafana):
    • Request rate, error rate, p50/p95/p99 latency split by old vs new path.
    • DB metrics: replication lag, lock waits, deadlocks, connection pool saturation.
    • Infra: CPU, memory, GC pauses, thread pool queue length.
    • Queue/Kafka: consumer lag, DLQ rate.
  • Log correlation: tag requests by route (x-cutover-route: old|new|shadow). Sample traces in Jaeger/Tempo with that tag.
  • Synthetic checks: Blackbox Exporter hitting both old and new endpoints with a canary token.
  • Alerts tuned for cutover:
    • “Cutover error rate > 0.5% for 3m” → auto-trigger rollback playbook.
    • “DB replication lag > 2s for 2m” → pause traffic shift.
  • Business metrics: auth success rate, checkout conversion, payment success by PSP. Don’t ship if the business graph tanks.

We’ve shut down cutovers when p95 was clean but payment success dipped 1.2%. Turned out to be a PSP IP whitelist on the new egress. Metrics saved revenue.

5) Dry runs and game day

Rehearsals are where embarrassment is cheap.

  1. Clone prod shape: same schemas, masked data, similar cardinalities. Rehydrate caches. Use last week’s traffic replay with Gor/GoReplay or mizu.
  2. Run full migration: backfill, CDC, canary, rollback. Twice.
  3. Inject failure (chaos, but surgical): kill a replica, spike latency, force a DNS cache miss. Measure MTTR.
  4. Time it: document each phase duration, expected lag, and the slowest step. This becomes your day-of timeline buffers.
  5. Fix toil: every manual step becomes a script (make cutover, script/canary.sh).

Deliverables: a runbook PR with updated timings, screenshots of dashboards at each gate, and one recorded rollback drill.

6) Day-of runbook (minute-by-minute)

This is the exact sequence we hand to teams. Tailor numbers to your SLOs.

  1. T‑30m: confirm on-calls, change window opened, dashboards green, feature flags prepared but off.
  2. T‑25m: enable shadow traffic to target for 5% of requests; compare status codes and p95 (delta < 10%).
  3. T‑20m: start canary 1% user traffic to target.
  4. Hold 5m: monitor error rate (< 0.5%), p95 (< 2x), replication lag (< 500ms), CPU (< 70%).
  5. Ramp to 5% → hold 5m; fix any headroom issues (autoscale up if HPA lags).
  6. Ramp to 25% → hold 10m; warm caches explicitly (/warmup if you have it) and pre-prime JIT.
  7. Ramp to 50% → hold 10m; verify business KPIs (auth, payments, conversion).
  8. Ramp to 100%; keep shadow to old path for 10m for parity checks.
  9. Flip writers to target DB or confirm dual-writes are consistent; verify replication direction and lag < 500ms.
  10. Run post-cutover data diffs on hot tables; reconcile DLQ.
  11. Announce “provisional success”; hold for 30–60m with owners watching dashboards.
  12. If any hard gate trips, execute rollback immediately:
    • Set traffic weight back to 0% on target (single command).
    • Swap DB writers to source.
    • Drain in-flight queues; preserve audit logs.

Commands we typically prepare:

# Istio traffic shift (ArgoCD-managed)
kubectl apply -f vs-weights-5.yaml
kubectl apply -f vs-weights-25.yaml
# NGINX weight flip (via GitOps)
git commit -m "canary 25%" && git push; argocd app sync payments-edge
# Feature flag kill switch
ld toggle payments.new_path off --reason "error_rate>0.5%"
# Read-only toggle (if needed for safety)
psql -c "ALTER SYSTEM SET default_transaction_read_only=on; SELECT pg_reload_conf();"

7) Aftercare: verify, decommission, and cost sanity

You’re not done at 100% traffic.

  • Extended hold: 24–48 hours with elevated alerts. Nightly checksums on top tables.
  • Decommission in phases:
    • Keep the old path hot for 24h as a warm standby.
    • Remove dual-writes after reconciliation is clean for 48h.
    • Turn off CDC, then tear down old infra via Terraform with a change review.
  • Cost and performance:
    • Compare p95/p99 and error rate week-over-week.
    • Validate autoscaling targets and right-size instances (don’t strand that 2x capacity forever).
  • Postmortem (even if green): document surprises, timings vs expectations, and update the runbook.
  • Security and compliance: update data flow diagrams, DPIA, and vendor scopes; notify auditors if required.

At a subscription SaaS, this phase alone saved ~22% monthly infra by right-sizing after the adrenaline wore off.

Tooling menu (pick what fits your stack)

  • Infra and deployment: Terraform, ArgoCD, Spinnaker, GitHub Actions, Flux.
  • Traffic: Istio, Envoy, NGINX, HAProxy, AWS ALB/NLB, Cloudflare, Route 53.
  • Feature flags: LaunchDarkly, OpenFeature, Flipt.
  • Data migration:
    • Postgres: pglogical, AWS DMS, pg_dump/COPY, pg_checksums.
    • MySQL: gh-ost, pt-online-schema-change, binlog replication.
    • MongoDB: replica set promotions.
    • Kafka: MirrorMaker 2, Confluent Replicator.
  • Observability: Prometheus, Grafana, Jaeger/Tempo, Blackbox Exporter, Loki/ELK.
  • Load/replay: k6, Vegeta, GoReplay, mizu.

If you need a second set of eyes, GitPlumbers can run a rehearsal, build the runbook, and sit in your war room so you sleep at night.

Related Resources

Key takeaways

  • Zero downtime isn’t magic — it’s a checklist, SLOs, and rehearsals.
  • Traffic and data are separate problems; treat them with different playbooks.
  • Define hard abort criteria upfront; automate rollbacks and don’t negotiate with dashboards.
  • Measure replication lag, error rate, p95 latency, and saturation; everything else is noise during cutover.
  • Practice the migration on production-like data; game days beat slide decks.

Implementation checklist

  • Freeze changes: code, schema, and infra, with an explicit change window and dry-run signed off by owners.
  • Document SLOs and hard abort criteria (e.g., 5xx > 0.5% for 3m, p95 > 2x baseline for 5m, DB lag > 1s).
  • Implement dual-run or shadow traffic to validate the target path before shifting user traffic.
  • Set up data replication: logical replication (Postgres `pglogical`), `gh-ost` for MySQL, or DMS/CDC for cloud hops.
  • Backfill and verify checksums/row counts; build a repeatable diff report with pass/fail gates.
  • Wire traffic control: `Istio VirtualService` weights or LB weighted backends; set DNS TTL to 30–60s if needed.
  • Pre-provision capacity to 1.5–2x expected peak; smoke test concurrency and failure modes.
  • Automate rollback with a single switch (feature flag/LB weight/DNS); dry run a full rollback twice.
  • Create a minute-by-minute runbook with owners, commands, dashboards, and comms channels.
  • Run the cutover, monitor 4 golden signals, hold for stability, then decommission the old path in phases.

Questions we hear from teams

Do I really need dual-writes for zero downtime?
Not always, but you need continuous consistency. If app-level dual-writes are risky, use CDC (Debezium, DMS) to mirror changes. If you can guarantee idempotency and have a brief write freeze of 30–120 seconds, you can sometimes avoid dual-writes—just be honest about the SLO impact.
What if the schema changes aren’t backward compatible?
Use an expand/contract pattern. Deploy code that can read both versions, expand schema (add new fields/tables), backfill, flip writers, then contract (drop old fields) only after traffic is 100% on the new path and retention windows close.
Can we do this with DNS only?
You can, but it’s the spiciest option. Prep TTLs 24h in advance, expect stragglers to cache for minutes or hours, and have a fast rollback path that doesn’t depend on DNS propagation. Prefer LB or service mesh weight flips when possible.
How do queues and async jobs affect cutover?
Drain or duplicate them. Pause consumers, snapshot offsets, mirror topics/queues, then resume consumers pointing at the target. Monitor consumer lag and DLQ rate as first-class cutover metrics.
What’s a realistic rehearsal success signal?
Two clean end-to-end rehearsals with max replication lag < 500ms, p95 latency within 20% of baseline under replayed load, and a rollback that takes under 3 minutes from decision to steady state.
What if we can’t double capacity due to budget?
Prioritize prewarming the target and cut at off-peak. Use tighter canary steps (1%→2%→5%). Keep cache hit rates high and push heavier reports/batch to a maintenance window. But don’t skip abort gates—budget cuts don’t change physics.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Book a migration rehearsal with GitPlumbers Read the payments cutover case study

Related resources