Do I really need dual-writes for zero downtime?

Not always, but you need continuous consistency. If app-level dual-writes are risky, use CDC (Debezium, DMS) to mirror changes. If you can guarantee idempotency and have a brief write freeze of 30–120 seconds, you can sometimes avoid dual-writes—just be honest about the SLO impact.

What if the schema changes aren’t backward compatible?

Use an expand/contract pattern. Deploy code that can read both versions, expand schema (add new fields/tables), backfill, flip writers, then contract (drop old fields) only after traffic is 100% on the new path and retention windows close.

Can we do this with DNS only?

You can, but it’s the spiciest option. Prep TTLs 24h in advance, expect stragglers to cache for minutes or hours, and have a fast rollback path that doesn’t depend on DNS propagation. Prefer LB or service mesh weight flips when possible.

How do queues and async jobs affect cutover?

Drain or duplicate them. Pause consumers, snapshot offsets, mirror topics/queues, then resume consumers pointing at the target. Monitor consumer lag and DLQ rate as first-class cutover metrics.

What’s a realistic rehearsal success signal?

Two clean end-to-end rehearsals with max replication lag < 500ms, p95 latency within 20% of baseline under replayed load, and a rollback that takes under 3 minutes from decision to steady state.

What if we can’t double capacity due to budget?

Prioritize prewarming the target and cut at off-peak. Use tighter canary steps (1%→2%→5%). Keep cache hit rates high and push heavier reports/batch to a maintenance window. But don’t skip abort gates—budget cuts don’t change physics.

Guides · Oct 2, 2025 · 9 minute read

The Zero‑Downtime Cutover Checklist We Actually Use in Production

Q: What if we can’t double capacity due to budget?

Prioritize prewarming the target and cut at off-peak. Use tighter canary steps (1%→2%→5%). Keep cache hit rates high and push heavier reports/batch to a maintenance window. But don’t skip abort gates—budget cuts don’t change physics.

A hands-on, step-by-step runbook for migrating a critical workload without a blip — with the guardrails, metrics, and tooling that keep you out of pager hell.

Back to all posts

The Zero‑Downtime Cutover Checklist We Actually Use in Production

Key takeaways

Implementation checklist