The Cutover Checklist We Use When Moving Money: Zero-Downtime Migration, Step by Step

You don’t get a second chance with critical workloads. Here’s the exact, battle-tested checklist we use to move prod with no customer-facing blips—and fast rollbacks when reality bites.

If your rollback isn’t one command and under a minute, it’s not a rollback—it’s a second migration.
Back to all posts

The situation you’ve lived through

You’ve got a payments API doing 2k rps, 99.9% SLO, and a CFO who will personally call you if auth declines spike 0.1%. You’re moving it from a crusty K8s 1.20 cluster to a new hardened 1.29 setup, plus a Postgres upgrade. There’s no maintenance window. I’ve seen this fail when teams treat DNS TTLs as a rollout plan and hope CloudFront caches behave. Don’t. Here’s the cutover checklist we use at GitPlumbers when the workload is critical and downtime is not an option.

1) Pre-flight: align on SLOs, guardrails, and rollback

If you can’t say what “safe” looks like, you can’t ship safely.

  • Define SLOs and hard rollback criteria
    • p95 latency within +10% of baseline for 15 minutes
    • Error rate (5xx + timeouts) < 0.5% over 5-minute windows
    • Saturation: CPU < 70%, DB CPU < 60%, queue lag < baseline + 20%
    • Data: CDC/replication lag < 5s; no write amplification causing throttling
  • Freeze surface area
    • Feature freeze on related services; no schema changes unrelated to this migration
    • Lock IaC versions (e.g., terraform v1.7.x), cluster versions, operator charts
  • Baseline and record
    • Golden signals in Prometheus/Datadog/Honeycomb; capture 24h baseline
    • Traffic profile: peak RPS, request mix, top N endpoints, read/write ratio
    • DB stats: pg_stat_statements heavy hitters, buffer hit ratio, autovacuum activity
  • Dry run in staging with prod-like load
    • Generate traffic using k6 or Locust with prod distributions
    • Rehearse rollback (more than once) under load

If rollback takes longer than 60 seconds, it’s not a rollback—it’s a second migration.

2) Build the parallel lane (infra + edge)

Run the new stack in parallel without touching prod users yet.

  • Stand up the target environment
    • New cluster (Kubernetes 1.29), autoscaler tuned, same node classes
    • Same secrets via sealed-secrets/External Secrets
    • Same runtime deps (Redis version match, JVM flags, pgbouncer settings)
  • Duplicate ingress and wire for dual routing
    • If you use Envoy/Istio: create a VirtualService that mirrors traffic to the new service
    • If you use Nginx/HAProxy/ALB: configure weighted backends and health checks
# Istio VirtualService with shadow traffic
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: payments
spec:
  hosts: ["payments.prod.svc.cluster.local"]
  http:
    - route:
        - destination: { host: payments-old, subset: v1, port: { number: 8080 } }
      mirror: { host: payments-new, subset: v2, port: { number: 8080 } }
      mirrorPercentage: { value: 100.0 }
      timeout: 5s
  • Observability parity
    • OpenTelemetry exporters configured identically
    • Prometheus scraping both old and new with consistent labels
    • Log correlation IDs preserved across both paths

3) Data plan: expand/migrate/contract without drama

Code is easy to roll back. Data isn’t. Do additive changes and keep both worlds speaking the same dialect until you’re sure.

  • Expand (additive, backwards-compatible)
-- Add new column/index without breaking old code
ALTER TABLE payments ADD COLUMN txn_ref text NULL;
CREATE INDEX CONCURRENTLY idx_payments_txn_ref ON payments (txn_ref);
  • Backfill in chunks, idempotently
# example: chunked backfill runner
for start in $(seq 1 50000 5000000); do
  end=$((start+49999))
  psql "$PGURL" -c "UPDATE payments SET txn_ref = legacy_ref WHERE txn_ref IS NULL AND id BETWEEN $start AND $end;"
  sleep 0.2 # keep autovacuum happy
done
  • Compatibility triggers or views (keep old writers working)
CREATE OR REPLACE FUNCTION payments_compat_trigger() RETURNS trigger AS $$
BEGIN
  NEW.legacy_ref := COALESCE(NEW.legacy_ref, NEW.txn_ref);
  RETURN NEW;
END; $$ LANGUAGE plpgsql;

CREATE TRIGGER payments_compat BEFORE INSERT OR UPDATE ON payments
FOR EACH ROW EXECUTE FUNCTION payments_compat_trigger();
  • Move state safely: CDC/logical replication
    • Postgres: pglogical or AWS DMS; Kafka: MirrorMaker 2; Mongo: changeStreams
    • Debezium for validation and drift detection
{
  "name": "payments-connector",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "legacy-db",
    "database.port": "5432",
    "database.user": "repl",
    "database.password": "***",
    "database.dbname": "payments",
    "table.include.list": "public.payments",
    "slot.name": "payments_slot",
    "tombstones.on.delete": "false"
  }
}
  • Contract (only after cutover + soak)
    • Remove old columns, triggers, and dual-write code when metrics are green for 72h

4) Progressive delivery: canary, gates, and instant rollback

We don’t “flip.” We ratchet forward with automated analysis and an always-on reverse gear.

  • Use rollout tooling
    • Argo Rollouts, Flagger, or Spinnaker with Prometheus/Datadog webhooks
    • Feature flags (LaunchDarkly/OpenFeature) for expensive code paths
# Argo Rollouts: progressive + Prometheus analysis
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: payments
spec:
  strategy:
    canary:
      steps:
        - setWeight: 1
        - pause: { duration: 180 }
        - analysis:
            templates:
              - templateName: error-rate-check
        - setWeight: 10
        - pause: { duration: 300 }
        - analysis:
            templates:
              - templateName: latency-check
        - setWeight: 50
        - pause: { duration: 600 }
        - setWeight: 100
  selector: { matchLabels: { app: payments } }
  template:
    metadata: { labels: { app: payments } }
    spec:
      containers:
        - name: svc
          image: registry.example.com/payments:v2.3.1
  • Automated analysis (examples)

    • Error rate: rate(http_requests_total{app="payments",status=~"5.."}[5m]) / rate(http_requests_total{app="payments"}[5m]) < 0.005
    • Latency: histogram_quantile(0.95, sum(rate(http_server_duration_seconds_bucket{app="payments"}[5m])) by (le)) < baseline*1.1
    • Saturation: sum(container_cpu_usage_seconds_total{pod=~"payments.*"}) by (pod) < cpu_limit*0.7
  • Rollback must be one command and under 60s

    • Argo: kubectl argo rollouts undo payments
    • Istio route weight back to 100% old
    • DB: keep replication running so rollback doesn’t strand writes

5) Shadow traffic, then real traffic

Before you test with real users, test with real traffic—safely.

  • Shadow (mirror) traffic
    • Mirror 100% of requests to new stack, responses ignored
    • Check: Request volume parity, no 500s on shadow path, no DB saturation
    • Validate business metrics on shadow path (idempotency, queue sizes)
  • Diff responses safely
    • For pure reads, sample 1% and diff JSON bodies (tolerate ordering/format differences)
    • For writes, do not double-commit; use idempotency keys to avoid duplication if any bleed occurs
  • Ramp real user traffic with weights
    • 1% → 10% → 50% → 100% with pauses and auto-analysis between
    • If at any step a hard threshold breaks, auto-rollback and open an incident (SRE rules, not vibes)

6) Observability-driven gates (block on signals, not hope)

You should be blocked by facts.

  • Dashboards to gate on
    • Golden signals per environment (old vs new) side-by-side
    • Business KPIs: auth success, conversion, cart completes, payment capture latency
    • Data plane: replication/CDC lag, Kafka consumer lag, dead-letter queue rate
    • Error tracking: Sentry/Honeycomb trace waterfalls for new path only
  • Synthetic checks
    • Checkly/Pingdom hitting the new path IDs directly (bypass caches)
    • k6 load test during canary steps to ensure p95 stays in-budget
k6 run --vus 500 --duration 10m --tag env=new ./payment-flow.js
  • Alert policies (examples)
    • Page on error_rate_new > 0.5% for 5m OR p95_new > baseline*1.2 for 5m
    • Ticket on replication_lag > 5s for 2m or backfill throughput < target for 10m

7) The cutover timeline (60-minute play)

This is the hour that matters. Assume you’ve rehearsed and your runbook is in Git.

  1. T-10m: Confirm freezes, paging on-call, warm caches, confirm backups and WAL archiving
  2. T-8m: Enable 100% shadow traffic; verify no errors on the new path
  3. T-6m: Start 1% canary; Argo/Flagger begins automated analysis
    • Checkpoints: error rate < 0.5%, p95 < +10%, CPU < 70%, DB lag < 5s
  4. T-3m: Move to 10%; keep analysis running; synthetic checks stay green
  5. T-0m: Move to 50%; run a targeted k6 burst matching peak profile; verify no regression
  6. T+10m: Move to 100% new; keep old stack hot and replication flowing
  7. T+20m: Confirm business KPIs (auth success, capture success) are within baseline
  8. T+30m: Announce provisional success; keep rollback lever hot for 24-72h
  9. Rollback trigger (if needed): any hard threshold breach for 5m → route weights to 100% old, keep CDC running, open incident, diff and fix

Notes I’ve learned the hard way:

  • Never change DNS during the hour. Use layer-7 weights or ALB target weights.
  • Keep idempotency keys intact end-to-end. Payments will double-charge otherwise.
  • Don’t let an AI-generated IaC “optimization” slip in; lock Terraform plans and kubectl diff before go-time.

8) Post-cutover: contract and clean up

After a clean soak, make it boring again.

  • 24-72h after cutover (no anomalies):
    • Stop shadowing, pin routes at 100% new
    • Freeze old writers; stop replication after confirming zero lag and consistent row counts
    • Contract schema: drop compat triggers, old columns, obsolete indexes
    • Decommission old infra via GitOps PRs; tag AMIs/images for 14d fallback if policy allows
  • Close the loop
    • Postmortem even if success: what paging rules or alerts were noisy, what runbook steps were unclear
    • Capture timings: each step duration, any rollbacks, impact on error budget
    • Update the migration RFC with the real metrics and PR links

If you made it here with no customer tickets, congratulations—you did the unglamorous, disciplined work that actually prevents executive escalations.

Related Resources

Key takeaways

  • Zero downtime isn’t magic—it’s disciplined gating, progressive delivery, and reversible data changes.
  • Define objective cutover criteria (latency, error rate, saturation) and block on signals, not calendar time.
  • Mirror traffic first, then canary with weighted routing and automated rollback on hard thresholds.
  • Use expand/migrate/contract for the schema and CDC or replication for safe state moves.
  • Automate the cutover timeline with IaC, GitOps, and rollout tooling (Argo Rollouts, Flagger).
  • Have a single-command rollback plan and rehearse it under load before the real day.

Implementation checklist

  • Freeze non-essential changes and publish a migration RFC with SLOs and rollback criteria.
  • Baseline golden metrics (latency, error rate, saturation), capacity, and data drift.
  • Stand up parallel stack and wire dual ingress with shadow traffic.
  • Implement additive schema changes and backfill with idempotent, chunked jobs.
  • Enable CDC or logical replication; validate lag/consistency and reconcile drift.
  • Run canaries with automated analysis and guardrails; rehearse rollback under load.
  • Execute final cutover with a 1%-10%-50%-100% progression and hard stops.
  • Decommission only after 24-72h of clean signals and postmortem learning captured.

Questions we hear from teams

How do we handle database changes with zero downtime?
Use expand/migrate/contract. Additive schema first (columns, indexes), backfill idempotently in chunks, introduce compatibility triggers or views, and keep old code paths working. Only remove old fields once the new path has soaked cleanly for 24–72 hours.
Do we need service mesh to do this?
No, but it helps. You can pull this off with Nginx/HAProxy or an ALB with weighted target groups. A mesh like Istio gives you clean mirroring, retries, circuit breakers, and telemetry consistency.
What if we discover data drift during the cutover?
Don’t proceed. Keep replication running, pause the canary, and run a targeted reconcile using CDC offsets or row-level diffs. Only resume when drift is zero and lag is back under your threshold.
How do we test rollback safely?
Rehearse in staging with prod-like load using k6/Locust. Time the command sequence and make sure traffic weights snap back within 60s, state replication continues, and business KPIs recover to baseline.
What about AI-generated code/configs in the migration?
Treat them as untrusted drafts. Run static checks, policy-as-code (Open Policy Agent/Conftest), and peer reviews. We’ve seen AI-written Helm charts misconfigure liveness probes and take clusters down. Lock versions and diff everything before go-time.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Plan your zero-downtime migration with us Read the payments cutover case study

Related resources