The Zero‑Downtime Cutover Checklist We Actually Use in Production
A hands-on, step-by-step runbook for migrating a critical workload without a blip — with the guardrails, metrics, and tooling that keep you out of pager hell.
Zero downtime isn’t luck. It’s a boring, rehearsed checklist with a rollback faster than the forward path.Back to all posts
The scenario you’ve seen before
You’ve got a revenue-path service doing 3k TPS, backed by a tired MySQL primary in a region you want to exit. Or you’re moving from an ancient VM stack to Kubernetes with Istio
. The VP says “zero downtime,” finance says “don’t double-spend,” and you remember the last time DNS bit you. We’ve run this movie at fintechs, SaaS unicorns, and marketplaces. The zero-downtime migrations that worked all followed the same playbook.
The trick isn’t heroics at 3am. It’s a boring checklist, rehearsed twice, with a rollback that’s easier than the forward path.
1) Preflight: freeze, SLOs, and abort criteria
Lock the blast radius before you touch anything.
- Change freeze: lock deploys for the service(s), schema, and infra. Create a change window in
ServiceNow
/Jira
. - Owners on-call: app, DB, SRE, networking, observability. One decision-maker, one comms channel (
#cutover-warroom
). - SLOs + abort gates (write them down):
- Error rate: 5xx < 0.5% sustained; abort if > 0.5% for 3 minutes.
- Latency: p95 < 2x baseline; abort if > 2x for 5 minutes.
- Saturation: CPU < 70%, DB connections < 80% of max.
- Replication lag: < 500ms; abort if > 2s for 2 minutes.
- Capacity: pre-provision target to 1.5–2x expected peak QPS. Do a 15-minute synthetic load test (
k6
,Vegeta
) with production-like headers and auth. - Observability: create a cutover dashboard in
Grafana
:- 4 golden signals per service, plus DB lag, queue depth, Kafka consumer lag, LB error counts.
- Separate panels for source vs target.
- Toggle strategy: choose your primary switch:
Istio VirtualService
weighted routing.- LB weighted backends (
NGINX
,HAProxy
,AWS ALB
target weights). - Feature flag (
LaunchDarkly
,OpenFeature
) at the edge or in the app.
- DNS plan if applicable:
Route 53
/Cloudflare
TTL → 30–60s 24 hours before cutover. Don’t rely on instant TTL changes. - Runbook: minute-by-minute script with commands, dashboards, and “who does what if X happens.” Store it in git, versioned, reviewed.
Proof point: at a payments client (≈3.5k TPS), these gates kept us honest. We aborted once on a rehearsal when p95 doubled due to a missing Redis warmup. Saved the real cutover.
2) Data: replicate, backfill, cutover, verify
Traffic is easy. Data is where migrations die. Treat reads and writes separately.
- Choose your replication:
- Postgres: logical replication (
pglogical
), orAWS DMS
for cross-cloud. Schema drift tracked withFlyway
/Liquibase
. - MySQL:
gh-ost
orpt-online-schema-change
for live schema, plus binlog replication. - MongoDB: add target as a hidden secondary, promote later.
- Kafka topics: mirror with
MirrorMaker 2
orConfluent Replicator
; monitor consumer lag.
- Postgres: logical replication (
- Backfill:
- Bulk load historical data first (snapshots). For PG:
COPY
from S3; for MySQL:mysqldump
→ import ormydumper
. - Verify row counts and checksums per table. Use
pg_checksums
,CHECKSUM TABLE
, or a customhash(id||updated_at)
pass.
- Bulk load historical data first (snapshots). For PG:
- Dual-write or CDC:
- Prefer app-level dual-write behind a feature flag for idempotent operations; include a retry queue.
- If app changes are risky, use CDC (
Debezium
,AWS DMS
) to stream changes.
- Read routing:
- Keep reads on the source until lag on the target is < 500ms and read-only checks pass.
- Cutover:
- Freeze writes for 30–120 seconds (only if absolutely needed) or ensure idempotency guarantees.
- Point writers to the target; validate replication direction if you need a rollback path.
- Post-cutover verification:
- Run diff reports on hot tables: row counts, sum of key columns, sample checksums.
- Reconcile dead-letter queues and retry topics.
Example: we moved a 1.2B-row ledger from RDS to Aurora with pglogical
and a Debezium sidecar. Backfill took 9 hours; max lag during cutover: 220ms; write freeze: 0s; total user-visible impact: 0.
3) Traffic control: canary first, then weighted
Do not big-bang your traffic. Prove the new path under real load, then roll forward.
- Shadow traffic (optional but powerful): duplicate a slice of requests to the target with responses discarded.
- Tools:
Istio
mirrors
,NGINX
mirror
module, or a service mesh tap. - Compare response codes and latency distributions.
- Tools:
- Canary 1% → 5% → 25% → 50% → 100% with pause gates at each step.
- Sticky sessions: turn them off or account for them. If not possible, cut by cohort (e.g., device IDs, tenants).
- Circuit breakers: set conservative
maxConnections
, timeouts, and outlier detection.
Istio example (weighted split):
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: payments-svc
spec:
hosts: ["payments.internal"]
http:
- route:
- destination: { host: payments-old, subset: v1 }
weight: 95
- destination: { host: payments-new, subset: v2 }
weight: 5
retries:
attempts: 2
perTryTimeout: 500ms
timeout: 2s
fault:
abort: { percentage: 0 }
NGINX example (weighted upstream):
upstream payments {
server old:8080 weight=95 max_fails=1 fail_timeout=2s;
server new:8080 weight=5 max_fails=1 fail_timeout=2s;
}
Call out: if you depend on DNS, prep it a day early. Set TTL to 60s, validate resolvers honor it, and expect some long-lived clients to ignore you.
4) Observability: dashboards, SLOs, and hard stops
If you can’t see it, you can’t cut it over.
- Dashboards (Prometheus/Grafana):
- Request rate, error rate, p50/p95/p99 latency split by old vs new path.
- DB metrics: replication lag, lock waits, deadlocks, connection pool saturation.
- Infra: CPU, memory, GC pauses, thread pool queue length.
- Queue/Kafka: consumer lag, DLQ rate.
- Log correlation: tag requests by route (
x-cutover-route: old|new|shadow
). Sample traces inJaeger
/Tempo
with that tag. - Synthetic checks:
Blackbox Exporter
hitting both old and new endpoints with a canary token. - Alerts tuned for cutover:
- “Cutover error rate > 0.5% for 3m” → auto-trigger rollback playbook.
- “DB replication lag > 2s for 2m” → pause traffic shift.
- Business metrics: auth success rate, checkout conversion, payment success by PSP. Don’t ship if the business graph tanks.
We’ve shut down cutovers when p95 was clean but payment success dipped 1.2%. Turned out to be a PSP IP whitelist on the new egress. Metrics saved revenue.
5) Dry runs and game day
Rehearsals are where embarrassment is cheap.
- Clone prod shape: same schemas, masked data, similar cardinalities. Rehydrate caches. Use last week’s traffic replay with
Gor
/GoReplay
ormizu
. - Run full migration: backfill, CDC, canary, rollback. Twice.
- Inject failure (chaos, but surgical): kill a replica, spike latency, force a DNS cache miss. Measure MTTR.
- Time it: document each phase duration, expected lag, and the slowest step. This becomes your day-of timeline buffers.
- Fix toil: every manual step becomes a script (
make cutover
,script/canary.sh
).
Deliverables: a runbook PR with updated timings, screenshots of dashboards at each gate, and one recorded rollback drill.
6) Day-of runbook (minute-by-minute)
This is the exact sequence we hand to teams. Tailor numbers to your SLOs.
- T‑30m: confirm on-calls, change window opened, dashboards green, feature flags prepared but off.
- T‑25m: enable shadow traffic to target for 5% of requests; compare status codes and p95 (delta < 10%).
- T‑20m: start canary 1% user traffic to target.
- Hold 5m: monitor error rate (< 0.5%), p95 (< 2x), replication lag (< 500ms), CPU (< 70%).
- Ramp to 5% → hold 5m; fix any headroom issues (autoscale up if HPA lags).
- Ramp to 25% → hold 10m; warm caches explicitly (
/warmup
if you have it) and pre-prime JIT. - Ramp to 50% → hold 10m; verify business KPIs (auth, payments, conversion).
- Ramp to 100%; keep shadow to old path for 10m for parity checks.
- Flip writers to target DB or confirm dual-writes are consistent; verify replication direction and lag < 500ms.
- Run post-cutover data diffs on hot tables; reconcile DLQ.
- Announce “provisional success”; hold for 30–60m with owners watching dashboards.
- If any hard gate trips, execute rollback immediately:
- Set traffic weight back to 0% on target (single command).
- Swap DB writers to source.
- Drain in-flight queues; preserve audit logs.
Commands we typically prepare:
# Istio traffic shift (ArgoCD-managed)
kubectl apply -f vs-weights-5.yaml
kubectl apply -f vs-weights-25.yaml
# NGINX weight flip (via GitOps)
git commit -m "canary 25%" && git push; argocd app sync payments-edge
# Feature flag kill switch
ld toggle payments.new_path off --reason "error_rate>0.5%"
# Read-only toggle (if needed for safety)
psql -c "ALTER SYSTEM SET default_transaction_read_only=on; SELECT pg_reload_conf();"
7) Aftercare: verify, decommission, and cost sanity
You’re not done at 100% traffic.
- Extended hold: 24–48 hours with elevated alerts. Nightly checksums on top tables.
- Decommission in phases:
- Keep the old path hot for 24h as a warm standby.
- Remove dual-writes after reconciliation is clean for 48h.
- Turn off CDC, then tear down old infra via
Terraform
with a change review.
- Cost and performance:
- Compare p95/p99 and error rate week-over-week.
- Validate autoscaling targets and right-size instances (don’t strand that 2x capacity forever).
- Postmortem (even if green): document surprises, timings vs expectations, and update the runbook.
- Security and compliance: update data flow diagrams, DPIA, and vendor scopes; notify auditors if required.
At a subscription SaaS, this phase alone saved ~22% monthly infra by right-sizing after the adrenaline wore off.
Tooling menu (pick what fits your stack)
- Infra and deployment:
Terraform
,ArgoCD
,Spinnaker
,GitHub Actions
,Flux
. - Traffic:
Istio
,Envoy
,NGINX
,HAProxy
,AWS ALB/NLB
,Cloudflare
,Route 53
. - Feature flags:
LaunchDarkly
,OpenFeature
,Flipt
. - Data migration:
- Postgres:
pglogical
,AWS DMS
,pg_dump
/COPY
,pg_checksums
. - MySQL:
gh-ost
,pt-online-schema-change
, binlog replication. - MongoDB: replica set promotions.
- Kafka:
MirrorMaker 2
,Confluent Replicator
.
- Postgres:
- Observability:
Prometheus
,Grafana
,Jaeger
/Tempo
,Blackbox Exporter
,Loki
/ELK
. - Load/replay:
k6
,Vegeta
,GoReplay
,mizu
.
If you need a second set of eyes, GitPlumbers can run a rehearsal, build the runbook, and sit in your war room so you sleep at night.
Key takeaways
- Zero downtime isn’t magic — it’s a checklist, SLOs, and rehearsals.
- Traffic and data are separate problems; treat them with different playbooks.
- Define hard abort criteria upfront; automate rollbacks and don’t negotiate with dashboards.
- Measure replication lag, error rate, p95 latency, and saturation; everything else is noise during cutover.
- Practice the migration on production-like data; game days beat slide decks.
Implementation checklist
- Freeze changes: code, schema, and infra, with an explicit change window and dry-run signed off by owners.
- Document SLOs and hard abort criteria (e.g., 5xx > 0.5% for 3m, p95 > 2x baseline for 5m, DB lag > 1s).
- Implement dual-run or shadow traffic to validate the target path before shifting user traffic.
- Set up data replication: logical replication (Postgres `pglogical`), `gh-ost` for MySQL, or DMS/CDC for cloud hops.
- Backfill and verify checksums/row counts; build a repeatable diff report with pass/fail gates.
- Wire traffic control: `Istio VirtualService` weights or LB weighted backends; set DNS TTL to 30–60s if needed.
- Pre-provision capacity to 1.5–2x expected peak; smoke test concurrency and failure modes.
- Automate rollback with a single switch (feature flag/LB weight/DNS); dry run a full rollback twice.
- Create a minute-by-minute runbook with owners, commands, dashboards, and comms channels.
- Run the cutover, monitor 4 golden signals, hold for stability, then decommission the old path in phases.
Questions we hear from teams
- Do I really need dual-writes for zero downtime?
- Not always, but you need continuous consistency. If app-level dual-writes are risky, use CDC (Debezium, DMS) to mirror changes. If you can guarantee idempotency and have a brief write freeze of 30–120 seconds, you can sometimes avoid dual-writes—just be honest about the SLO impact.
- What if the schema changes aren’t backward compatible?
- Use an expand/contract pattern. Deploy code that can read both versions, expand schema (add new fields/tables), backfill, flip writers, then contract (drop old fields) only after traffic is 100% on the new path and retention windows close.
- Can we do this with DNS only?
- You can, but it’s the spiciest option. Prep TTLs 24h in advance, expect stragglers to cache for minutes or hours, and have a fast rollback path that doesn’t depend on DNS propagation. Prefer LB or service mesh weight flips when possible.
- How do queues and async jobs affect cutover?
- Drain or duplicate them. Pause consumers, snapshot offsets, mirror topics/queues, then resume consumers pointing at the target. Monitor consumer lag and DLQ rate as first-class cutover metrics.
- What’s a realistic rehearsal success signal?
- Two clean end-to-end rehearsals with max replication lag < 500ms, p95 latency within 20% of baseline under replayed load, and a rollback that takes under 3 minutes from decision to steady state.
- What if we can’t double capacity due to budget?
- Prioritize prewarming the target and cut at off-peak. Use tighter canary steps (1%→2%→5%). Keep cache hit rates high and push heavier reports/batch to a maintenance window. But don’t skip abort gates—budget cuts don’t change physics.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.