The Zero-Downtime Cutover Checklist We Use When Failure Isn’t an Option

A field-tested, step-by-step plan to move a critical workload without users noticing—and without praying to the DNS gods.

You can’t engineer luck into a migration. You can engineer brakes, breadcrumbs, and a one-liner rollback.
Back to all posts

The zero-downtime move: what it actually takes

You’ve done the dance: a “simple” cutover that turned into a 2 a.m. Slack war room and a retro full of sad graphs. I’ve been there—payments at a unicorn, ad serving at scale, and a monolith-to-K8s move that nearly bricked us because someone forgot DNS TTLs. Here’s the checklist we use at GitPlumbers when a critical workload has to move—with zero downtime and zero heroics.

If your rollback plan isn’t one command (or one flag flip), you don’t have a rollback plan.

This is a pragmatic, hands-on sequence with concrete tools. Use it as-is, or adapt to your stack. The shape is always the same: prove parity, move data, shift traffic gradually, and automate the brakes.

1) Establish guardrails and success criteria

Before touching production traffic, lock in the guardrails. You’re defining what “safe” looks like—and when to pull back.

  • SLOs and rollback thresholds
    • Targets: p95 < 200ms, error rate < 0.5%, saturation < 70% on the target path.
    • Rollback triggers: any threshold breach for > 5m or >= 3 consecutive alert firings.
  • Golden signals instrumentation (Prometheus + Grafana)
    • Error rate:
      sum(rate(http_requests_total{app="orders",status=~"5.."}[5m]))
      /
      sum(rate(http_requests_total{app="orders"}[5m]))
    • Latency: histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket{app="orders"}[5m])))
  • Health probes and backpressure
    • Kubernetes probes must reflect dependency health and readiness.
      readinessProbe:
        httpGet: { path: /healthz/ready, port: 8080 }
        initialDelaySeconds: 10
        periodSeconds: 5
        failureThreshold: 3
      livenessProbe:
        httpGet: { path: /healthz/live, port: 8080 }
    • Add a circuit breaker (Envoy/Istio) to avoid cascading failure:
      apiVersion: networking.istio.io/v1beta1
      kind: DestinationRule
      metadata: { name: orders }
      spec:
        host: orders
        trafficPolicy:
          connectionPool:
            tcp: { maxConnections: 100 }
            http: { http1MaxPendingRequests: 1000, maxRequestsPerConnection: 100 }
          outlierDetection:
            consecutive5xxErrors: 5
            interval: 5s
            baseEjectionTime: 30s
            maxEjectionPercent: 50
  • DNS prep (if any endpoint moves)
    • Reduce TTL to 60s at least 48h ahead so caches age out.
    • Verify propagation:
      dig +nocmd api.example.com any +multiline +noall +answer @1.1.1.1
      dig +nocmd api.example.com any +multiline +noall +answer @8.8.8.8
  • Runbook and single-command rollback
    • Example with Argo Rollouts:
      kubectl argo rollouts promote-abort orders
      kubectl argo rollouts rollback orders --to-revision=stable

2) Prove parity with shadow traffic

No one should be your guinea pig. Mirror production requests to the target, but keep responses dark until you’re confident.

  • Istio mirroring (20% mirrored payloads, 0% user impact):

    apiVersion: networking.istio.io/v1beta1
    kind: VirtualService
    metadata: { name: orders }
    spec:
      hosts: ["orders.svc.cluster.local"]
      http:
        - route:
            - destination: { host: orders-v1, subset: stable }
              weight: 100
          mirror: { host: orders-v2, subset: candidate }
          mirrorPercentage: { value: 20.0 }
  • NGINX equivalent (if you’re not on a mesh):

    location /orders {
      proxy_pass http://orders_v1;
      mirror /_mirror;
    }
    location = /_mirror {
      internal;
      proxy_pass http://orders_v2;
    }
  • Response diffing (budgeted mismatch allowed, e.g., formatting)

    • Sample approach with k6 to capture responses and compare:
      // k6 script (scripts/parity.js)
      import http from 'k6/http';
      import { check } from 'k6';
      
      export default function () {
        const res = http.get(`${__ENV.BASE_URL}/orders/123`);
        check(res, { 'status 200': r => r.status === 200 });
        // Write body hash to logs for diffing downstream
        console.log(JSON.stringify({ path: '/orders/123', sha: __ENV.SHA(res.body) }));
      }
    • Track mismatch rate: mismatches / total <= 0.1% or rollback investigation.
  • Data read parity

    • Mirror reads against both old and new DBs (read-only on target) and compare aggregates: row counts, sums, last-updated timestamps.
    • Automate with a small comparer job that runs every minute and posts to Slack on drift > threshold.

3) Move data safely: CDC, dual-writes, and backfill

Data is where “zero downtime” dies if you wing it. Treat the data path like a distributed system.

  • Choose your migration path
    • MySQL: gh-ost or pt-online-schema-change for schema changes without locks.
    • Postgres: logical replication or pglogical for CDC; Debezium for cross-system.
    • Event/log-centric: Debezium + Kafka + target consumers.
  • CDC connector (Debezium for Postgres) example
    {
      "name": "orders-connector",
      "config": {
        "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
        "database.hostname": "pg-primary",
        "database.port": "5432",
        "database.user": "debezium",
        "database.password": "*****",
        "database.dbname": "orders",
        "plugin.name": "pgoutput",
        "table.include.list": "public.orders,public.order_items",
        "tombstones.on.delete": "false"
      }
    }
  • Dual-writes with idempotency
    • Write to both old and new stores behind a feature flag (LaunchDarkly or Unleash).
    • Generate request IDs; make writes idempotent (PUT semantics or dedupe table with unique external_id).
    • Version your event schemas (orders.v2) and accept both versions during the transition.
  • Backfill and verification
    • Batch backfill in small windows; throttle to keep replica lag < 500ms.
    • Verification SQL example:
      SELECT COUNT(*) FROM orders;              -- equality
      SELECT MAX(updated_at) FROM orders;       -- recency
      SELECT SUM(total_cents) FROM orders;      -- aggregates
  • Schema management
    • Use Flyway/Liquibase migrations checked into Git. No snowflake DDL.
    • CI gate: migrations must run clean against a prod snapshot.

4) Progressive cutover: shift traffic with brakes on

Don’t “flip.” Bleed traffic gradually and let the system tell you when to proceed.

  • Service-mesh weighted routing (Istio example)
    apiVersion: networking.istio.io/v1beta1
    kind: VirtualService
    metadata: { name: orders }
    spec:
      hosts: ["orders.svc.cluster.local"]
      http:
        - route:
            - destination: { host: orders-v1 }
              weight: 90
            - destination: { host: orders-v2 }
              weight: 10
  • Argo Rollouts canary with automated analysis
    apiVersion: argoproj.io/v1alpha1
    kind: Rollout
    metadata: { name: orders }
    spec:
      replicas: 10
      strategy:
        canary:
          canaryService: orders-canary
          stableService: orders-stable
          steps:
            - setWeight: 10
            - pause: { duration: 300 }
            - analysis:
                templates:
                  - templateName: error-rate-check
            - setWeight: 25
            - pause: { duration: 600 }
            - setWeight: 50
    ---
    apiVersion: argoproj.io/v1alpha1
    kind: AnalysisTemplate
    metadata: { name: error-rate-check }
    spec:
      metrics:
        - name: error-rate
          interval: 1m
          successCondition: result < 0.005
          failureLimit: 3
          provider:
            prometheus:
              address: http://prometheus:9090
              query: |
                sum(rate(http_requests_total{app="orders",status=~"5..",role="canary"}[5m]))
                /
                sum(rate(http_requests_total{app="orders",role="canary"}[5m]))
  • DNS weighted routing (if not on mesh)
    • Use Route 53 weighted records with Terraform:
      resource "aws_route53_record" "orders_v2" {
        zone_id = var.zone_id
        name    = "api.example.com"
        type    = "A"
        set_identifier = "v2"
        weighted_routing_policy { weight = 10 }
        alias {
          name                   = aws_lb.v2.dns_name
          zone_id                = aws_lb.v2.zone_id
          evaluate_target_health = true
        }
        ttl = 60
      }
  • Success gates (don’t hand-wave this)
    • Hold each step for 5–10 minutes. Advance only if:
      • p95 within 10% of baseline
      • error rate below budget and stable
      • saturation not growing
    • Automatic rollback if any check fails for 2–5 consecutive intervals.

5) Day-of migration runbook (make it boring)

Here’s the short version we actually run. Make it a Slack checklist the team can tick.

  1. Change freeze + comms: Announce window, page on-call rotations, and open a dedicated Slack channel.
  2. Reduce DNS TTLs (if used): Confirm low TTLs are active in all zones.
  3. Pre-warm capacity: Scale v2 to 2× expected peak for the first hour; warm caches.
  4. Feature flags staged: Dual-writes/reads flagged off → ready to toggle.
  5. Dashboards: p95, error rate, saturation, and queue depths pinned on a wallboard.
  6. Start shadow traffic (if not already running) and confirm parity metrics.
  7. Enable dual-writes: Confirm idempotency counters and dead-letter queues are empty and alerting.
  8. Begin canary: 10% → 25% → 50% with analysis at each step.
  9. Hold at 50%: Run synthetic checks (login, checkout, refunds) every 60s.
  10. Go 100% once stable for 15–30 minutes.
  11. Post-cut verification: Compare order counts, revenue, and key business KPIs across systems.
  12. Celebrate quietly: Don’t decommission anything yet.

Useful one-liners we actually use:

# Traffic weight change (Istio via kubectl and kustomize variables)
kubectl -n prod apply -k overlays/prod/weights/25-75

# Quick parity smoke (vegeta)
echo "GET https://api.example.com/orders/123" | vegeta attack -duration=60s | vegeta report

# Verify Route53 record weights
aws route53 list-resource-record-sets --hosted-zone-id Z123 | jq '.ResourceRecordSets[] | select(.Name=="api.example.com.")'

6) Post-cut validation and safe decommission

Don’t rip out the old path the second the graphs look green. Hold, verify, then retire.

  • Hold period: Minimum of 24–72 hours at 100% traffic with elevated alerting.
  • Synthetic user journeys: k6/Synthetics/Locust scheduled to run critical flows every minute.
  • Data reconciliation: Run diff jobs hourly; alert on >0.1% drift.
  • Remove dual-writes after hold; keep CDC tailing for an extra day to catch stragglers.
  • Rightsize: Drop v2 capacity back to normal; capture new baseline SLOs and costs.
  • Decommission: Archive infra as code PR removing v1. Tag artifacts, snapshot DB, and set a 7-day restore window.

7) Common failure modes (seen in the wild) and fixes

  • Hidden client timeouts: Mobile SDKs at 10s while server is 5s → retries dogpile. Fix: align timeouts; set Retry-After and circuit breakers.
  • Sticky sessions with layer-7 load balancers. Fix: externalize session state (Redis) before cutover.
  • DNS caching gremlins (Java networkaddress.cache.ttl, CDNs). Fix: lower client TTLs; use mesh or ALB cutover when possible.
  • Inconsistent JSON (whitespace, field order). Fix: canonicalize before diffing; allow mismatch budget.
  • Background jobs double-processing during dual-writes. Fix: idempotency keys and dedupe tables.
  • Queue draining: Messages in-flight to old consumer. Fix: pause producers; drain old queue; resume to new consumer.
  • AI-generated code paths with subtle differences (e.g., default timezones, float rounding). Fix: add contract tests and property checks; run shadow traffic longer. Call us if you’re in vibe-code hell.

What “good” looks like (numbers that matter)

  • Time-to-100%: 45–120 minutes for most services with 3–4 steps.
  • Error budget burn: <5% of monthly budget during migration.
  • Backfill + CDC caught up: lag <500ms sustained during cut.
  • Business KPIs: revenue/order volume/sign-ups within 1–3% of baseline during hold.

If you want a deeper dive into our playbook, we’ve documented variations for ALB-to-Envoy, monolith-to-K8s, and regional failovers here: GitPlumbers Zero-Downtime Migration Playbook and a real-world case study: Payment Pipeline Cutover with Argo Rollouts.

Related Resources

Key takeaways

  • Your migration SLO is binary: either users notice or they don’t. Build guardrails (probes, budgets, rollbacks) before you touch traffic.
  • Prove parity with shadow traffic before you move a single user. Use mirroring plus response-diffing and error budget math.
  • Move data safely with CDC and dual-writes. Idempotency and versioned contracts are not optional.
  • Cut over progressively with weighted routing and automatic rollback triggers wired to SLOs.
  • Make the day-of boring: freeze, reduce TTLs, pre-warm capacity, staff a war room, and script the steps.

Implementation checklist

  • Define SLOs and success criteria (p95 latency, error rate, saturation) with rollback thresholds.
  • Instrument golden signals in Prometheus/Grafana; set alerts and dashboards before cutover.
  • Reduce DNS TTLs (if DNS is in play); verify from multiple resolvers.
  • Add readiness/liveness probes and circuit breakers; verify backpressure works.
  • Set up traffic shadowing to the target (Istio mirror or NGINX mirror); compare responses.
  • Prepare data migration: CDC pipeline (Debezium/Kafka), backfill, dual-writes with idempotency.
  • Dry-run the entire plan in staging with prod-like load and data; document rollback commands.
  • Execute progressive canary (Argo Rollouts or Istio weights); gate increases on SLO health checks.
  • Run day-of checklist: comms, freezes, capacity, feature flags, dashboards, and war room staffing.
  • Post-cut: hold period, synthetic checks, remove dual-writes, right-size capacity, and decommission safely.

Questions we hear from teams

Do I need a service mesh for zero-downtime?
No. A mesh (Istio/Linkerd) makes mirroring and weighted routing easier, but you can do zero-downtime with NGINX/Envoy, DNS weights (Route 53), and good discipline. The key is progressive traffic shifting and automated SLO gates.
How do I handle schema changes without downtime?
Use online schema tools (`gh-ost`, `pt-online-schema-change`) and versioned contracts. Apply backward-compatible changes first (expand), run dual-writes, backfill, then switch reads, then contract.
What’s the minimum I need to automate?
Automate rollbacks, traffic weight changes, and SLO checks. Humans can watch dashboards, but the brakes must be automatic. If you can’t test the rollback in staging, you don’t have it.
What about cost and performance during the window?
Expect 1.5–2× capacity for an hour to absorb mirroring and canary overhead. Plan it, get sign-off, and right-size after the hold period.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to GitPlumbers about your upcoming cutover Download the Zero-Downtime Migration Checklist (PDF)

Related resources