The Zero‑Downtime Migration Checklist That Survives Real Traffic (and Real Humans)

A hands-on, step-by-step runbook for moving a critical workload without paging the team into oblivion. Includes checkpoints, exit criteria, and battle-tested tooling.

Zero downtime isn’t a promise—it’s an SLO plus a rollback button you’re willing to press.
Back to all posts

The uncomfortable truth about “zero downtime”

I’ve watched “zero downtime” migrations go sideways for reasons that had nothing to do with Kubernetes YAML. The usual killers are unknown dependencies, silent data drift, and no objective rollback trigger—so you end up doing “hero debugging” at 2am while your CFO refreshes a revenue dashboard.

What actually works is boring:

  • Treat the migration like an SRE exercise: SLOs, error budgets, and a rollback plan.
  • Run old and new in parallel until metrics (not vibes) say it’s safe.
  • Make rollback a button, not a meeting.

This checklist assumes a critical workload (payments, auth, order pipeline) and a move like one of these:

  • On-prem → cloud
  • VM/ASG → Kubernetes
  • Legacy service → rewritten service
  • RDS → Aurora / Postgres major version / sharded cluster

Pre-flight: define success, rollback, and observability (before you migrate anything)

If you don’t measure it, you can’t prove it. And if you can’t prove it, you won’t sleep.

Checklist (pre-flight exit criteria):

  1. SLOs are written down (user-impacting, not “CPU is fine”). Examples:
    • Availability: 99.95% over 30 days
    • Latency: p95 < 250ms, p99 < 800ms
    • Error rate: 5xx < 0.2% per 5 minutes
  2. Rollback triggers are explicit:
    • “If p95 increases by >20% for 10 minutes”
    • “If checkout conversion drops >2% for 15 minutes”
  3. Golden signals exist for both old and new: latency, traffic, errors, saturation.
  4. Distributed tracing is end-to-end with consistent sampling.

Minimum viable tooling that doesn’t lie:

  • OpenTelemetry + Prometheus + Grafana
  • Loki (or your log stack) with correlation IDs
  • A synthetics runner (k6, Grafana k6, or Datadog Synthetics)

Example: Prometheus alert that reflects user pain (not node trivia):

# prometheus/alerts.yaml
- alert: Service5xxTooHigh
  expr: |
    sum(rate(http_server_requests_total{job="checkout",status=~"5.."}[5m]))
    /
    sum(rate(http_server_requests_total{job="checkout"}[5m]))
    > 0.005
  for: 5m
  labels:
    severity: page
  annotations:
    summary: "Checkout 5xx rate > 0.5% for 5m"
    runbook: "https://runbooks.yourco/checkout/rollback"

If you can’t answer “what number makes us roll back?” in under 10 seconds, you don’t have a rollback plan.

Build the parallel path: infrastructure and routing you can control

Zero downtime requires running two production-capable paths at the same time: old and new. That’s the whole game.

Checklist (parallel path exit criteria):

  • New stack is deployed via Terraform/Helm/ArgoCD with reproducible environments.
  • New stack can handle at least peak traffic × 1.2 (headroom matters during incidents).
  • Routing supports gradual traffic shifting and instant rollback.

Routing options (pick one you can operate):

  • Kubernetes: Argo Rollouts (works with NGINX Ingress, Istio, AWS ALB)
  • Service mesh: Istio / Linkerd traffic splits
  • Edge: Cloudflare, Fastly, or LB weighted target groups

Concrete example: Argo Rollouts canary with automated pause/analysis:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: checkout
spec:
  replicas: 20
  strategy:
    canary:
      canaryService: checkout-canary
      stableService: checkout-stable
      trafficRouting:
        nginx:
          stableIngress: checkout
      steps:
        - setWeight: 1
        - pause: {duration: 5m}
        - setWeight: 5
        - pause: {duration: 10m}
        - setWeight: 25
        - pause: {duration: 15m}
        - setWeight: 50
        - pause: {duration: 20m}
        - setWeight: 100
  selector:
    matchLabels:
      app: checkout
  template:
    metadata:
      labels:
        app: checkout
    spec:
      containers:
        - name: checkout
          image: ghcr.io/yourco/checkout:2.3.7
          ports:
            - containerPort: 8080

Checkpoint: before any user traffic moves, confirm:

  • p95 latency under synthetic load on the new stack
  • No error spikes when you run shadow traffic (more on that below)
  • Dependency parity: secrets, TLS, outbound allowlists, queue topics, IAM

Data migration without downtime: expand/contract + backfill + proof

I’ve seen teams nail canary routing and still crater because the database plan was “run the migration and pray.” For critical workloads, you want expand/contract and compatibility.

Rule of thumb: during the migration window, old and new must both survive:

  • reading the “old” shape
  • writing the “new” shape (or vice versa)

1) Expand: add new structures without breaking old code

Examples:

  • Add nullable columns
  • Add new tables
  • Add new indexes concurrently (Postgres) to avoid lock pain
-- postgres expand: add column safely
ALTER TABLE payments ADD COLUMN processor_trace_id text;

-- avoid blocking writes with a long lock (Postgres)
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_payments_created_at ON payments(created_at);

2) Backfill: migrate data in chunks with idempotency

Backfill jobs fail. Plan for it.

  • Chunk by primary key or timestamp
  • Keep it idempotent
  • Emit progress metrics (rows_processed, lag_seconds)

3) Verify: prove the backfill worked

Verification ideas that catch real bugs:

  • Row counts per partition/date
  • Checksums on canonical fields
  • Sampled read comparisons (old vs new)
# crude but effective: compare counts by day
psql $OLD -c "select date(created_at), count(*) from orders group by 1 order by 1" > old.txt
psql $NEW -c "select date(created_at), count(*) from orders group by 1 order by 1" > new.txt
diff -u old.txt new.txt

4) Contract: remove old structures after you’re truly done

Only after:

  • traffic is 100% on new
  • you’ve cleared a full business cycle (end-of-month, peak day, billing run)
  • you’ve validated no old consumers exist

If you must keep two systems in sync during the transition, use CDC:

  • DebeziumKafka to stream DB changes
  • Or dual writes (but dual writes are where consistency goes to die unless you’re careful)

Checkpoint: define acceptable replication lag (e.g., CDC lag < 5s p95) and page on it.

Progressive delivery runbook: shift traffic like you mean it

This is where “we’ll just do a canary” becomes an actual operational practice.

Checklist (progressive delivery exit criteria):

  1. Canary steps defined (1% → 5% → 25% → 50% → 100%) with time-boxed pauses.
  2. Automated analysis gates on error rate + latency + saturation.
  3. Rollback is automatic (or at least a single command).

Shadow traffic (highly recommended for critical paths):

  • Mirror a copy of requests to the new service
  • Don’t return its response to the user
  • Compare results asynchronously (careful with side effects)

Tooling patterns:

  • Istio request mirroring
  • App-level tee with a queue (safer for side effects)

Load testing example (k6) that you can run against canary/stable:

import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
  stages: [
    { duration: '2m', target: 50 },
    { duration: '5m', target: 200 },
    { duration: '2m', target: 0 },
  ],
  thresholds: {
    http_req_failed: ['rate<0.002'],
    http_req_duration: ['p(95)<250', 'p(99)<800'],
  },
};

export default function () {
  const res = http.post(`${__ENV.BASE_URL}/checkout`, JSON.stringify({ cartId: 'abc' }), {
    headers: { 'Content-Type': 'application/json' },
  });
  check(res, { 'status is 200': (r) => r.status === 200 });
  sleep(1);
}

Checkpoint: before increasing weight, confirm:

  • p95/p99 delta between stable and canary is within bounds (e.g., <10%)
  • 5xx and 4xx anomalies aren’t increasing
  • DB metrics: connection pool saturation, slow queries, replication lag
  • Business KPI sanity: conversion, auth success, order completion

Cutover day: the boring, disciplined version (that actually works)

If you want zero downtime, cutover day should feel anticlimactic. When it feels “exciting,” it’s usually because you skipped rehearsal.

Cutover checklist (print this):

  1. Change freeze on dependent systems (or at least “no schema changes”).
  2. Open an ops bridge with roles:
    • Decision owner (can say “rollback”)
    • Driver (executes steps)
    • Observer (watches dashboards, calls anomalies)
  3. Confirm dashboards for stable vs canary are side-by-side:
    • RED (Rate, Errors, Duration)
    • saturation (CPU, memory, threadpools, DB)
  4. Confirm rollback command and permissions:
# Argo Rollouts: instant rollback
kubectl argo rollouts undo rollout/checkout

# Or promote if all green
kubectl argo rollouts promote rollout/checkout
  1. Shift traffic in planned increments; don’t “jump to 50%” because someone’s impatient.
  2. If a rollback trigger fires, rollback first, analyze second.

Exit criteria for 100%:

  • At least 30–60 minutes at peak-ish traffic (or a representative synthetic peak)
  • No SLO violations
  • No “unknown unknowns” in logs (timeouts to that one legacy SOAP service you forgot)

Aftercare: keep the parachute packed, then decommission deliberately

The most expensive downtime I’ve seen happened after a “successful” cutover—because someone tore down the old system immediately and discovered a hidden consumer two days later.

Aftercare checklist:

  • Keep old stack running (warm) for one full business cycle.
  • Monitor for stragglers:
    • old endpoints still receiving requests
    • old DB still being written to
  • Remove dual writes / CDC only after verification is clean.
  • Decommission in layers:
    • routing → app → queues → DB replicas → data retention/export

Metrics you should report to leadership (proof points that aren’t hand-wavy):

  • Customer-visible downtime: 0 minutes
  • Error budget consumed during migration: e.g., <5% of monthly budget
  • Performance delta: p95 latency -12% (or whatever reality is)
  • MTTR readiness: rollback tested in staging and executed once in prod drill (<5 minutes)

Where GitPlumbers fits (when you’re done doing this alone)

If you’re staring at a critical workload with a pile of legacy constraints (or AI-generated “helpful” changes sprinkled throughout) and you need a migration plan that won’t blow up your quarter, GitPlumbers does this kind of code rescue + production cutover work for teams that can’t afford drama.

  • Zero-downtime migration help: /services/zero-downtime-migrations
  • Vibe code cleanup / AI code refactoring before cutover: /services/vibe-code-cleanup
  • Example case study (payments cutover): /case-studies/payments-cutover

CTA: If you want, share your current architecture (diagram + top 5 dependencies + SLOs). We’ll tell you what’s missing from your rollback plan in one call.

Related Resources

Key takeaways

  • “Zero downtime” is an SLO + rollback plan, not a vibe.
  • You don’t cut over once—you run old and new side-by-side until metrics prove safety.
  • Data is the long pole: use expand/contract, backfill + verification, and CDC where needed.
  • Automate progressive delivery and rollback (Argo Rollouts/Flagger) so humans don’t fat-finger prod.
  • Define exit criteria for every phase: latency, error rate, saturation, and business KPIs.

Implementation checklist

  • Define SLOs and a hard rollback trigger (e.g., 5xx > 0.5% for 5m, p95 latency +20%).
  • Instrument both stacks with the same OpenTelemetry semantic conventions and dashboards.
  • Stand up parallel infrastructure via IaC (Terraform) and confirm capacity headroom (CPU, memory, DB connections).
  • Implement routing control (service mesh / ingress / LB) that supports canary + instant rollback.
  • Execute database expand/contract migrations; avoid breaking schema changes during the cutover window.
  • If data must move: backfill + verify (row counts, checksums) and consider CDC (Debezium/Kafka).
  • Run synthetic tests and load tests against the new path (k6) before shifting users.
  • Progressively shift traffic (1% → 5% → 25% → 50% → 100%) with automated analysis.
  • Freeze risky changes, staff an on-call bridge, and pre-stage comms + decision owners.
  • After cutover: keep the old path warm until you’ve cleared a full business cycle; then decommission deliberately.

Questions we hear from teams

Is “zero downtime” realistic for database migrations?
Yes, but not with big-bang breaking schema changes. Use expand/contract, keep old+new compatible during the transition, backfill in chunks, and verify. For cross-system moves, CDC (e.g., Debezium→Kafka) is usually safer than ad-hoc dual writes.
What’s the minimum set of metrics to gate a canary?
At minimum: request rate, error rate (5xx and key 4xx), p95/p99 latency, and saturation (CPU/memory plus DB connections/replication lag). If you can add one business KPI (conversion/auth success), do it.
How do we avoid “unknown dependencies” during cutover?
Inventory outbound calls (service mesh telemetry helps), scan configs for endpoints/queues/topics, and run shadow traffic. Keep the old system warm post-cutover to catch stragglers, and alert on any traffic to deprecated routes.
When should we choose blue/green vs canary?
Blue/green is simpler when state is stable and rollback is easy (stateless services, minimal data coupling). Canary is better when you need to observe behavior under real traffic gradually—especially for performance-sensitive or dependency-heavy workloads.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Get a zero-downtime migration review Fix risky AI-generated changes before cutover

Related resources