The Zero‑Downtime Migration Checklist We Actually Use in Production

A pragmatic, step-by-step playbook to cut over critical workloads without waking up pager duty or your CFO.

You don’t get points for being brave. You get points for never paging the on-call.
Back to all posts

The Zero‑Downtime Migration Checklist We Actually Use in Production

You don’t get points for being brave. You get points for never paging the on-call. Here’s the exact checklist we’ve used to move payments, auth, and serving paths with no customer-visible downtime. No fairy dust—just controlled blast radius, boring automation, and rollback you can actually trust.

The only migrations I regret were the ones we couldn’t roll back in one command.

We’ll assume Kubernetes + GitOps, but the patterns apply to VMs too. Tools referenced: Terraform, ArgoCD, Argo Rollouts, Istio, Prometheus, Debezium, PostgreSQL logical replication, k6, gor, OpenFeature/LaunchDarkly.


1) Baseline, Blast Radius, and Success Criteria

Before touching manifests, lock your targets. I’ve seen teams skip this and argue during the cutover whether a 1% error spike is “fine.” Don’t do that.

  • Inventory: upstream callers, downstream stores, queues, third-party APIs, cron jobs, webhooks, feature flag dependencies.
  • Traffic profile: peak RPS, p95/99 latency, payload sizes, read/write split, long-lived connections.
  • SLOs and error budget: what’s acceptable during migration? Example: maintain 99.9% availability, p95 latency +15% max, 5xx < 0.3%.
  • Kill switch: a single flag to route all user traffic back to the old path.

Sample PromQL you can paste into Grafana and alerts:

# Error rate (5xx) over 5m per service
sum(rate(http_requests_total{service="orders",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{service="orders"}[5m]))

# p95 latency
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service="orders"}[5m])) by (le))

# Saturation (CPU) for target deployment
sum(rate(container_cpu_usage_seconds_total{namespace="prod",pod=~"orders-.*"}[5m]))
/
sum(kube_pod_container_resource_limits{namespace="prod",pod=~"orders-.*",resource="cpu"})

Checkpoint: Document SLOs, KPIs, and rollback criteria in the runbook. If a VP asks, you should be able to point to a line that says, “Rollback if 5xx > 0.3% for 2 minutes.”


2) Rehearsal Environment and Traffic Replay

Zero-downtime happens because you already did it yesterday in rehearsal.

  1. Provision a prod-like env with Terraform and sync apps with ArgoCD.
terraform workspace select staging
terraform apply -var-file=staging.tfvars
argocd app create orders --repo https://github.com/acme/ops --path apps/orders --dest-namespace staging --dest-server https://kubernetes.default.svc
argocd app sync orders
  1. Shadow real traffic from prod into staging. With Istio:
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: orders
spec:
  hosts: ["orders.svc.cluster.local"]
  http:
  - route:
    - destination: { host: orders.prod }
    mirror: { host: orders.staging }
    mirrorPercentage: { value: 100.0 }

Or use gor to mirror from an ingress:

sudo gor --input-raw :80 --output-http http://orders.staging.svc.cluster.local
  1. Synthetic load with k6 to hit edge cases:
// k6 script (orders.js)
import http from 'k6/http';
import { sleep, check } from 'k6';
export let options = { vus: 100, duration: '10m' };
export default function () {
  const res = http.post('https://api.example.com/orders', JSON.stringify({sku: 'ABC', qty: 1}), { headers: { 'Content-Type': 'application/json' }});
  check(res, { 'status is 200': (r) => r.status === 200 });
  sleep(0.1);
}

Checkpoint: staging handles peak RPS at p95 within +10% of prod; error rate under 0.1%. If not, fix before you touch prod.


3) Data Plan: Expand/Contract, Backfill, Dual Writes

Downtime almost always hides in the data layer. The move that saves you: expand/contract + CDC.

  • Expand: add new schema without breaking old code.
  • Backfill: move data incrementally.
  • Dual-write: write to both old and new behind a flag.
  • Flip reads: switch read path.
  • Contract: remove old fields/tables after a quiet period.

Example: adding shipping_zone to orders and moving to a new service.

Expand DDL:

-- Expand phase: add nullable/optional structures first
ALTER TABLE orders ADD COLUMN shipping_zone TEXT NULL;
CREATE TABLE orders_v2 (
  id BIGINT PRIMARY KEY,
  user_id BIGINT NOT NULL,
  sku TEXT NOT NULL,
  qty INT NOT NULL,
  shipping_zone TEXT,
  created_at TIMESTAMP NOT NULL
);

CDC backfill with Debezium (Kafka → consumer):

{
  "name": "orders-connector",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "pg-primary",
    "database.port": "5432",
    "database.user": "debezium",
    "database.password": "*****",
    "database.dbname": "app",
    "plugin.name": "pgoutput",
    "table.include.list": "public.orders",
    "slot.name": "orders_slot"
  }
}

App dual-write guarded by a flag (OpenFeature shown):

// Node/TypeScript snippet
const flag = await client.getBooleanValue('orders.dual_write_v2', false);
await oldRepo.save(order);
if (flag) await newRepo.save(orderV2);

Flip reads behind a flag, then let both systems run in parallel for days. Verify row counts and sampled deep diffs:

SELECT COUNT(*) FROM orders;
SELECT COUNT(*) FROM orders_v2;

-- Sampled consistency check
SELECT o.id FROM orders o
LEFT JOIN orders_v2 v ON o.id = v.id
WHERE (v.id IS NULL OR v.shipping_zone <> o.shipping_zone)
AND random() < 0.001;

Checkpoint: backfill lag < 5s, consistency mismatches < 0.1% sampled, dual-write latency impact < 5ms p95.


4) Traffic Control: Progressive Delivery That Rolls Back Itself

Use the mesh to steer traffic and a rollout controller to automagically revert if SLOs degrade.

Argo Rollouts canary with Prometheus analysis:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: orders
spec:
  strategy:
    canary:
      canaryService: orders-canary
      stableService: orders-stable
      trafficRouting:
        istio:
          virtualService:
            name: orders
            routes: [ primary ]
      steps:
      - setWeight: 1
      - pause: { duration: 120 }
      - analysis:
          templates:
          - templateName: orders-slo
      - setWeight: 5
      - pause: { duration: 180 }
      - analysis:
          templates:
          - templateName: orders-slo
      - setWeight: 20
      - pause: { duration: 300 }
      - analysis:
          templates:
          - templateName: orders-slo
      - setWeight: 50
      - pause: { duration: 300 }
      - setWeight: 100

AnalysisTemplate gating on 5xx and p95:

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: orders-slo
spec:
  metrics:
  - name: errors
    interval: 60s
    count: 2
    failureLimit: 1
    provider:
      prometheus:
        address: http://prometheus.monitoring:9090
        query: |
          sum(rate(http_requests_total{service="orders-canary",status=~"5.."}[2m]))
          /
          sum(rate(http_requests_total{service="orders-canary"}[2m])) > 0.003
  - name: latency
    interval: 60s
    count: 3
    failureLimit: 1
    provider:
      prometheus:
        address: http://prometheus.monitoring:9090
        query: |
          histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service="orders-canary"}[2m])) by (le))
          > 0.25

If metrics breach, Rollouts sets it back to stable. No heroics required.

Checkpoint: automation can move traffic 1% → 100% with zero manual edits; rollback proven in rehearsal.


5) Observability, Runbooks, and Circuit Breakers

You can’t fix what you can’t see. Make failure boring and reversible.

  • Dashboards: one panel per KPI; overlay rollout weights on graphs.
  • Alerts: only on migration KPIs with page-to-ack < 5 min. Everything else is muted during the window.
  • Golden signals: latency, traffic, errors, saturation. Add business KPIs (checkout success rate, auth success).
  • Circuit breakers: set upstream timeouts/retries and fallback paths.

Envoy/Istio circuit breaker example:

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: orders
spec:
  host: orders
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 1000
      http:
        http1MaxPendingRequests: 1000
        maxRequestsPerConnection: 100
    outlierDetection:
      consecutive5xx: 5
      interval: 5s
      baseEjectionTime: 30s

Runbook skeleton:

  • Preconditions checked (SLOs, backfill, flags, alerts muted, paging rotation aware)
  • argocd app sync orders
  • kubectl argo rollouts get rollout orders -w
  • Hold points after 5%, 20%, 50% with sign-off in Slack change thread
  • Rollback command documented and tested

Checkpoint: a new SRE can run the play without a senior whispering over Zoom.


6) Cutover Procedure (The Hour You’ll Remember)

This is the sequence we used moving a card authorization path at a fintech. We kept 99.95% availability; the only spike was a transient +12% p95 for 3 minutes at 50% canary—well within budget.

  1. Freeze: change freeze for non-migration repos; mute noisy alerts.
  2. Dark launch: deploy new stack at 0% traffic; warm caches.
  3. Shadow traffic on: compare responses for sampled requests; log diffs only.
  4. Dual writes on: enable orders.dual_write_v2 at 10% cohort (e.g., internal users) for 30 minutes.
  5. Start rollout: 1% → 5% → 20% → 50% with automated analysis gates. Hold at 50% for 10–15 minutes; run queries:
-- Hot checks
SELECT COUNT(*) FROM orders WHERE created_at > now() - interval '10 minutes';
SELECT COUNT(*) FROM orders_v2 WHERE created_at > now() - interval '10 minutes';
  1. Flip reads: enable orders.read_from_v2 flag for canary pods only; verify parity metrics.
  2. Go 100%: complete rollout; keep dual writes for 24–72 hours depending on risk.
  3. Watch like a hawk: error budgets, business KPIs; have a human approve after 30 minutes to declare success.
  4. Rollback triggers: any KPI breach beyond thresholds for >2 minutes or on-call gut feel. Rollback commands:
# Traffic rollback
kubectl argo rollouts rollback orders

# Feature flag rollback
openfeature set orders.read_from_v2=false
openfeature set orders.dual_write_v2=false

Checkpoint: time from “breach detected” to “traffic stable on old path” < 2 minutes.


7) Post-Cutover Hardening and Cleanup

You’re not done until the old path is boring again.

  • Soak period: 24–72 hours. Keep dual writes, sample diffs nightly.
  • Contract: remove old schema after N days without diffs.
-- Contract phase after soak
ALTER TABLE orders DROP COLUMN shipping_zone; -- if moved fully to v2
DROP TABLE orders_legacy;
  • Kill flags: migrate from dynamic to static config; delete flags to avoid accidental toggles.
  • Cost and perf: scale down old infra; right-size new autoscaling based on observed p95/99.
  • Postmortem: even if it went fine. Capture “we got lucky” items; add chaos tests.

Checkpoint: no dual-writes, no dangling infra, and a PR that removes every migration flag.


8) Traps I’ve Seen (So You Don’t)

  • Long-lived connections: gRPC streams or WebSockets pin to old pods. Use drain policies and connection max age.
  • Sticky sessions: ELB/NLB with stickiness keep sending users to old stack. Disable or shorten TTL during cutover.
  • Third-party rate limits: shadow traffic can double call volume—throttle or stub.
  • Idempotency: dual-writes can duplicate side effects. Require idempotency keys.
  • Token scopes and JWKs: auth migrations fail on unnoticed JWK rotation. Pre-distribute keys and cache TTLs.
  • Cross-DC latency: backfills across regions will surprise you. Compress payloads, batch, and run close to source.
  • Clock skew: CDC and dedupe logic break with skew. Enforce NTP across fleets.

If two or more of these smell like your stack, make the rehearsal nastier: chaos test the exact failure you fear.


What “Good” Looks Like (Receipts)

  • Rollout duration: 45–90 minutes with three hold points.
  • Max p95 delta: < 15% during canary; returns to baseline post cutover.
  • 5xx rate: < 0.2% during entire window.
  • Rollback test: executed at least once in rehearsal and once in production dry-run at 1%.
  • Cleanup PR merged within 7 days.

We’ve done this at retailers on Black Friday traffic and fintechs during market open. The secret isn’t a magic tool; it’s respecting the data, rehearsing like you mean it, and making rollback the easiest button in the room.

If you want a second set of eyes or someone to run the game day, GitPlumbers lives for this stuff.

Related Resources

Key takeaways

  • Define SLOs and guardrail metrics before you touch a single kube manifest.
  • Rehearse with production-like traffic via shadowing and synthetic load; automate the rollback path.
  • Use expand/contract for data, dual-writes behind flags, and backfill with change data capture.
  • Gate cutovers with Prometheus-based analysis and progressive delivery (Argo Rollouts/Istio).
  • Treat the migration as a product release: runbooks, comms plan, rollback checkpoint, and postmortem.

Implementation checklist

  • Inventory dependencies and define blast radius; set success criteria and error budget.
  • Create prod-like rehearsal env via Terraform/ArgoCD; shadow traffic with gor or mesh traffic mirroring.
  • Design data plan: expand/contract schema, backfill with CDC (Debezium/DMS), dual-writes behind flags.
  • Instrument KPIs: p95 latency, 5xx rate, saturation; wire PromQL into automated analysis.
  • Implement progressive delivery: 1%-5%-20%-50%-100% canary with Argo Rollouts or Istio routing.
  • Prepare rollback: versioned artifacts, database compatibility window, traffic switch back.
  • Execute cutover runbook with hold points and sign-offs; verify data consistency.
  • Clean up: kill flags, contract schema, remove old infra, and run a blameless review.

Questions we hear from teams

How do I handle stateful sessions during migration?
Eliminate stateful sessions by externalizing session state (Redis/Memcached) and using short TTLs. During cutover, reduce stickiness TTL to seconds or disable it. Drain old pods with connection max age and preStop hooks to close sessions gracefully.
What if my database can’t support CDC?
Use native logical replication where possible (Postgres, MySQL). If not, schedule chunked backfills with application-level version markers and strictly idempotent writes. Expect a longer soak period and consider a brief read-only window for truly hard edges (rare).
Can I skip feature flags and just deploy the new version?
You can, but then rollback is slow and blunt. Flags make read/write paths and behaviors independently switchable, which shortens MTTR dramatically. If you hate vendor lock-in, use OpenFeature with your provider of choice.
How do I prove readiness to leadership?
Share rehearsal metrics: peak RPS sustained, p95/99 latencies within budget, automated rollback firing in staging, data parity reports with mismatch rates, and the exact runbook with decision points and owners.
What’s the minimum toolset to pull this off?
Kubernetes or a reliable orchestrator, GitOps (ArgoCD/Flux), traffic shaping (Istio/NGINX), observability (Prometheus/Grafana), and a feature flag system. Everything else is optimization.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Run a Migration Game Day with GitPlumbers Read the Black Friday cutover case study

Related resources