The Zero-Downtime Cutover Checklist We Use When Failure Isn’t an Option
A field-tested, step-by-step plan to move a critical workload without users noticing—and without praying to the DNS gods.
You can’t engineer luck into a migration. You can engineer brakes, breadcrumbs, and a one-liner rollback.Back to all posts
The zero-downtime move: what it actually takes
You’ve done the dance: a “simple” cutover that turned into a 2 a.m. Slack war room and a retro full of sad graphs. I’ve been there—payments at a unicorn, ad serving at scale, and a monolith-to-K8s move that nearly bricked us because someone forgot DNS TTLs. Here’s the checklist we use at GitPlumbers when a critical workload has to move—with zero downtime and zero heroics.
If your rollback plan isn’t one command (or one flag flip), you don’t have a rollback plan.
This is a pragmatic, hands-on sequence with concrete tools. Use it as-is, or adapt to your stack. The shape is always the same: prove parity, move data, shift traffic gradually, and automate the brakes.
1) Establish guardrails and success criteria
Before touching production traffic, lock in the guardrails. You’re defining what “safe” looks like—and when to pull back.
- SLOs and rollback thresholds
- Targets:
p95 < 200ms, error rate< 0.5%, saturation< 70%on the target path. - Rollback triggers: any threshold breach for
> 5mor>= 3consecutive alert firings.
- Targets:
- Golden signals instrumentation (Prometheus + Grafana)
- Error rate:
sum(rate(http_requests_total{app="orders",status=~"5.."}[5m])) / sum(rate(http_requests_total{app="orders"}[5m])) - Latency:
histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket{app="orders"}[5m])))
- Error rate:
- Health probes and backpressure
- Kubernetes probes must reflect dependency health and readiness.
readinessProbe: httpGet: { path: /healthz/ready, port: 8080 } initialDelaySeconds: 10 periodSeconds: 5 failureThreshold: 3 livenessProbe: httpGet: { path: /healthz/live, port: 8080 } - Add a circuit breaker (Envoy/Istio) to avoid cascading failure:
apiVersion: networking.istio.io/v1beta1 kind: DestinationRule metadata: { name: orders } spec: host: orders trafficPolicy: connectionPool: tcp: { maxConnections: 100 } http: { http1MaxPendingRequests: 1000, maxRequestsPerConnection: 100 } outlierDetection: consecutive5xxErrors: 5 interval: 5s baseEjectionTime: 30s maxEjectionPercent: 50
- Kubernetes probes must reflect dependency health and readiness.
- DNS prep (if any endpoint moves)
- Reduce TTL to
60sat least 48h ahead so caches age out. - Verify propagation:
dig +nocmd api.example.com any +multiline +noall +answer @1.1.1.1 dig +nocmd api.example.com any +multiline +noall +answer @8.8.8.8
- Reduce TTL to
- Runbook and single-command rollback
- Example with Argo Rollouts:
kubectl argo rollouts promote-abort orders kubectl argo rollouts rollback orders --to-revision=stable
- Example with Argo Rollouts:
2) Prove parity with shadow traffic
No one should be your guinea pig. Mirror production requests to the target, but keep responses dark until you’re confident.
Istio mirroring (20% mirrored payloads, 0% user impact):
apiVersion: networking.istio.io/v1beta1 kind: VirtualService metadata: { name: orders } spec: hosts: ["orders.svc.cluster.local"] http: - route: - destination: { host: orders-v1, subset: stable } weight: 100 mirror: { host: orders-v2, subset: candidate } mirrorPercentage: { value: 20.0 }NGINX equivalent (if you’re not on a mesh):
location /orders { proxy_pass http://orders_v1; mirror /_mirror; } location = /_mirror { internal; proxy_pass http://orders_v2; }Response diffing (budgeted mismatch allowed, e.g., formatting)
- Sample approach with
k6to capture responses and compare:// k6 script (scripts/parity.js) import http from 'k6/http'; import { check } from 'k6'; export default function () { const res = http.get(`${__ENV.BASE_URL}/orders/123`); check(res, { 'status 200': r => r.status === 200 }); // Write body hash to logs for diffing downstream console.log(JSON.stringify({ path: '/orders/123', sha: __ENV.SHA(res.body) })); } - Track mismatch rate:
mismatches / total <= 0.1%or rollback investigation.
- Sample approach with
Data read parity
- Mirror reads against both old and new DBs (read-only on target) and compare aggregates: row counts, sums, last-updated timestamps.
- Automate with a small comparer job that runs every minute and posts to Slack on drift > threshold.
3) Move data safely: CDC, dual-writes, and backfill
Data is where “zero downtime” dies if you wing it. Treat the data path like a distributed system.
- Choose your migration path
- MySQL:
gh-ostorpt-online-schema-changefor schema changes without locks. - Postgres: logical replication or
pglogicalfor CDC; Debezium for cross-system. - Event/log-centric: Debezium + Kafka + target consumers.
- MySQL:
- CDC connector (Debezium for Postgres) example
{ "name": "orders-connector", "config": { "connector.class": "io.debezium.connector.postgresql.PostgresConnector", "database.hostname": "pg-primary", "database.port": "5432", "database.user": "debezium", "database.password": "*****", "database.dbname": "orders", "plugin.name": "pgoutput", "table.include.list": "public.orders,public.order_items", "tombstones.on.delete": "false" } } - Dual-writes with idempotency
- Write to both old and new stores behind a feature flag (
LaunchDarklyorUnleash). - Generate request IDs; make writes idempotent (
PUTsemantics or dedupe table with uniqueexternal_id). - Version your event schemas (
orders.v2) and accept both versions during the transition.
- Write to both old and new stores behind a feature flag (
- Backfill and verification
- Batch backfill in small windows; throttle to keep replica lag <
500ms. - Verification SQL example:
SELECT COUNT(*) FROM orders; -- equality SELECT MAX(updated_at) FROM orders; -- recency SELECT SUM(total_cents) FROM orders; -- aggregates
- Batch backfill in small windows; throttle to keep replica lag <
- Schema management
- Use
Flyway/Liquibasemigrations checked into Git. No snowflake DDL. - CI gate: migrations must run clean against a prod snapshot.
- Use
4) Progressive cutover: shift traffic with brakes on
Don’t “flip.” Bleed traffic gradually and let the system tell you when to proceed.
- Service-mesh weighted routing (Istio example)
apiVersion: networking.istio.io/v1beta1 kind: VirtualService metadata: { name: orders } spec: hosts: ["orders.svc.cluster.local"] http: - route: - destination: { host: orders-v1 } weight: 90 - destination: { host: orders-v2 } weight: 10 - Argo Rollouts canary with automated analysis
apiVersion: argoproj.io/v1alpha1 kind: Rollout metadata: { name: orders } spec: replicas: 10 strategy: canary: canaryService: orders-canary stableService: orders-stable steps: - setWeight: 10 - pause: { duration: 300 } - analysis: templates: - templateName: error-rate-check - setWeight: 25 - pause: { duration: 600 } - setWeight: 50 --- apiVersion: argoproj.io/v1alpha1 kind: AnalysisTemplate metadata: { name: error-rate-check } spec: metrics: - name: error-rate interval: 1m successCondition: result < 0.005 failureLimit: 3 provider: prometheus: address: http://prometheus:9090 query: | sum(rate(http_requests_total{app="orders",status=~"5..",role="canary"}[5m])) / sum(rate(http_requests_total{app="orders",role="canary"}[5m])) - DNS weighted routing (if not on mesh)
- Use
Route 53weighted records with Terraform:resource "aws_route53_record" "orders_v2" { zone_id = var.zone_id name = "api.example.com" type = "A" set_identifier = "v2" weighted_routing_policy { weight = 10 } alias { name = aws_lb.v2.dns_name zone_id = aws_lb.v2.zone_id evaluate_target_health = true } ttl = 60 }
- Use
- Success gates (don’t hand-wave this)
- Hold each step for 5–10 minutes. Advance only if:
p95within 10% of baseline- error rate below budget and stable
- saturation not growing
- Automatic rollback if any check fails for 2–5 consecutive intervals.
- Hold each step for 5–10 minutes. Advance only if:
5) Day-of migration runbook (make it boring)
Here’s the short version we actually run. Make it a Slack checklist the team can tick.
- Change freeze + comms: Announce window, page on-call rotations, and open a dedicated Slack channel.
- Reduce DNS TTLs (if used): Confirm low TTLs are active in all zones.
- Pre-warm capacity: Scale v2 to 2× expected peak for the first hour; warm caches.
- Feature flags staged: Dual-writes/reads flagged off → ready to toggle.
- Dashboards: p95, error rate, saturation, and queue depths pinned on a wallboard.
- Start shadow traffic (if not already running) and confirm parity metrics.
- Enable dual-writes: Confirm idempotency counters and dead-letter queues are empty and alerting.
- Begin canary: 10% → 25% → 50% with analysis at each step.
- Hold at 50%: Run synthetic checks (login, checkout, refunds) every 60s.
- Go 100% once stable for 15–30 minutes.
- Post-cut verification: Compare order counts, revenue, and key business KPIs across systems.
- Celebrate quietly: Don’t decommission anything yet.
Useful one-liners we actually use:
# Traffic weight change (Istio via kubectl and kustomize variables)
kubectl -n prod apply -k overlays/prod/weights/25-75
# Quick parity smoke (vegeta)
echo "GET https://api.example.com/orders/123" | vegeta attack -duration=60s | vegeta report
# Verify Route53 record weights
aws route53 list-resource-record-sets --hosted-zone-id Z123 | jq '.ResourceRecordSets[] | select(.Name=="api.example.com.")'6) Post-cut validation and safe decommission
Don’t rip out the old path the second the graphs look green. Hold, verify, then retire.
- Hold period: Minimum of 24–72 hours at 100% traffic with elevated alerting.
- Synthetic user journeys:
k6/Synthetics/Locustscheduled to run critical flows every minute. - Data reconciliation: Run diff jobs hourly; alert on >0.1% drift.
- Remove dual-writes after hold; keep CDC tailing for an extra day to catch stragglers.
- Rightsize: Drop v2 capacity back to normal; capture new baseline SLOs and costs.
- Decommission: Archive infra as code PR removing v1. Tag artifacts, snapshot DB, and set a 7-day restore window.
7) Common failure modes (seen in the wild) and fixes
- Hidden client timeouts: Mobile SDKs at 10s while server is 5s → retries dogpile. Fix: align timeouts; set
Retry-Afterand circuit breakers. - Sticky sessions with layer-7 load balancers. Fix: externalize session state (Redis) before cutover.
- DNS caching gremlins (Java
networkaddress.cache.ttl, CDNs). Fix: lower client TTLs; use mesh or ALB cutover when possible. - Inconsistent JSON (whitespace, field order). Fix: canonicalize before diffing; allow mismatch budget.
- Background jobs double-processing during dual-writes. Fix: idempotency keys and dedupe tables.
- Queue draining: Messages in-flight to old consumer. Fix: pause producers; drain old queue; resume to new consumer.
- AI-generated code paths with subtle differences (e.g., default timezones, float rounding). Fix: add contract tests and property checks; run shadow traffic longer. Call us if you’re in vibe-code hell.
What “good” looks like (numbers that matter)
- Time-to-100%: 45–120 minutes for most services with 3–4 steps.
- Error budget burn: <5% of monthly budget during migration.
- Backfill + CDC caught up: lag <500ms sustained during cut.
- Business KPIs: revenue/order volume/sign-ups within 1–3% of baseline during hold.
If you want a deeper dive into our playbook, we’ve documented variations for ALB-to-Envoy, monolith-to-K8s, and regional failovers here: GitPlumbers Zero-Downtime Migration Playbook and a real-world case study: Payment Pipeline Cutover with Argo Rollouts.
Key takeaways
- Your migration SLO is binary: either users notice or they don’t. Build guardrails (probes, budgets, rollbacks) before you touch traffic.
- Prove parity with shadow traffic before you move a single user. Use mirroring plus response-diffing and error budget math.
- Move data safely with CDC and dual-writes. Idempotency and versioned contracts are not optional.
- Cut over progressively with weighted routing and automatic rollback triggers wired to SLOs.
- Make the day-of boring: freeze, reduce TTLs, pre-warm capacity, staff a war room, and script the steps.
Implementation checklist
- Define SLOs and success criteria (p95 latency, error rate, saturation) with rollback thresholds.
- Instrument golden signals in Prometheus/Grafana; set alerts and dashboards before cutover.
- Reduce DNS TTLs (if DNS is in play); verify from multiple resolvers.
- Add readiness/liveness probes and circuit breakers; verify backpressure works.
- Set up traffic shadowing to the target (Istio mirror or NGINX mirror); compare responses.
- Prepare data migration: CDC pipeline (Debezium/Kafka), backfill, dual-writes with idempotency.
- Dry-run the entire plan in staging with prod-like load and data; document rollback commands.
- Execute progressive canary (Argo Rollouts or Istio weights); gate increases on SLO health checks.
- Run day-of checklist: comms, freezes, capacity, feature flags, dashboards, and war room staffing.
- Post-cut: hold period, synthetic checks, remove dual-writes, right-size capacity, and decommission safely.
Questions we hear from teams
- Do I need a service mesh for zero-downtime?
- No. A mesh (Istio/Linkerd) makes mirroring and weighted routing easier, but you can do zero-downtime with NGINX/Envoy, DNS weights (Route 53), and good discipline. The key is progressive traffic shifting and automated SLO gates.
- How do I handle schema changes without downtime?
- Use online schema tools (`gh-ost`, `pt-online-schema-change`) and versioned contracts. Apply backward-compatible changes first (expand), run dual-writes, backfill, then switch reads, then contract.
- What’s the minimum I need to automate?
- Automate rollbacks, traffic weight changes, and SLO checks. Humans can watch dashboards, but the brakes must be automatic. If you can’t test the rollback in staging, you don’t have it.
- What about cost and performance during the window?
- Expect 1.5–2× capacity for an hour to absorb mirroring and canary overhead. Plan it, get sign-off, and right-size after the hold period.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
