The Zero‑Downtime Migration Checklist You Actually Use at 2 A.M.

A battle-tested, step-by-step runbook with checkpoints, metrics, and tooling that keeps revenue flowing while you move a critical workload.

“Rollback is a product. Treat it like one.”
Back to all posts

Scope, SLOs, and a Hard Abort Plan

You don’t get zero-downtime by accident. The last migration I watched go sideways at a unicorn wasn’t because the new stack was bad — it was because nobody wrote down what “good” meant. Fix that first.

  • Define SLOs: e.g., 99.95% availability, p95 latency < 250ms during migration, error budget <= 2 minutes for the window.

  • Business KPIs: auth success rate, checkout conversion, ingestion throughput. These matter more than CPU graphs.

  • Abort criteria: e.g., 5-minute rolling 5xx > 0.3% or p95 > 500ms for two consecutive intervals. Write it down.

  • Roles and comms: one DRI, one comms lead, Slack war room, Zoom bridge, pager rotation ready.

  • Freeze windows: no schema or API changes unless they’re in this plan.

  • Dry run: rehearse in staging with recorded prod traffic and a synthetic spike.

If you can’t crisply answer “When do we abort?” you’re not ready.

Tools: Prometheus, Grafana, SLO burn alerts; PagerDuty; feature flags (LaunchDarkly, Unleash, Flipt).

Inventory and Architecture: Map the Blast Radius

You can’t migrate what you can’t see. Every outage I’ve debugged had a “forgotten” dependency.

  • Dependencies: DBs (Postgres, MySQL), caches (Redis, Memcached), queues (Kafka, SQS), object stores, third parties (payments, auth), cron/batch jobs.

  • Session model: sticky sessions? JWT? Redis-backed? Plan for session migration or statelessness.

  • Data flow: reads vs writes, idempotency, eventual consistency tolerance.

  • Networking: ALB/Ingress, TLS termination, WAF, IP allowlists, egress NAT.

  • Back-pressure: circuit breakers, queues, retry policies.

  1. Draw the current and target architectures. Keep a one-page diagram in the runbook.

  2. Trace hot paths with OpenTelemetry + Jaeger for the top 5 revenue flows.

  3. Record 24h traffic shape and peak QPS. Capture seasonality.

Tools: otlp, Jaeger, Tempo, flow logs, terraform graph, kubectl top, redis-cli, pg_stat_activity.

Build the Parallel Stack: Blue/Green with Safe Defaults

Stand up the target stack fully, side-by-side. Blue is current; Green is target. Your job is to make Green boring.

  • Infra: use Terraform or Pulumi to build VPCs, subnets, security groups, ALB/NLB, EKS/GKE/AKS, and databases with replicas.

  • GitOps: ship app manifests via ArgoCD or Flux. Freeze manual kubectl edits.

  • Config parity: same env vars, secrets, feature flags. No surprises.

  • Session strategy: move to stateless or shared session store.

  • Connection draining: enable on ALB/NLB and ingress.

Example ALB target group draining (AWS):

resource "aws_lb_target_group" "app" {
  name                 = "app-green"
  port                 = 80
  protocol             = "HTTP"
  vpc_id               = var.vpc_id
  deregistration_delay = 60 # seconds
  health_check { path = "/healthz" }
}

ArgoCD app for the new stack:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: app-green
spec:
  source:
    repoURL: https://github.com/acme/app
    path: deploy/overlays/green
    targetRevision: main
  destination:
    server: https://kubernetes.default.svc
    namespace: app
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

Checkpoints

  • Green passes health checks, can run full e2e tests, and serves shadow traffic (no user impact).

  • Dashboards for green exist and match blue within ±10% for shadowed load.

Tools: Terraform, ArgoCD, ExternalDNS, cert-manager, AWS ALB, Istio/Linkerd, Envoy, HAProxy.

Data: CDC, Backfill, Dual Writes, Shadow Reads

Data is where zero-downtime migrations go to die. Don’t wing it. Use CDC and prove correctness.

  • Choose CDC: Debezium + Kafka, AWS DMS, pglogical, MySQL binlog. Set it up from Blue->Green DB.

  • Backfill: bulk copy historical data first, then stream deltas via CDC until lag ~0.

  • Dual writes: app writes to both stores behind a flag. Make writes idempotent with keys.

  • Shadow reads: read from Blue but shadow read from Green; compare results in background.

  • Consistency checks: row counts, checksums, sampled field equality.

Postgres logical replication status (lag):

SELECT application_name, state, sync_state, pg_last_wal_replay_lsn(),
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) AS replay_lag
FROM pg_stat_replication;

Idempotent write example:

# include an idempotency key per business object
headers = {"Idempotency-Key": order_id}
requests.post(url, json=payload, headers=headers, timeout=2)

Simple shadow read compare:

blue := blueClient.Get(id)
green := greenClient.Get(id)
if !reflect.DeepEqual(project(blue), project(green)) {
  metrics.Counter("shadow_mismatch").Inc()
}

Checkpoints

  • Backfill complete; CDC lag < 2s for 95% of time during peak.

  • Dual writes enabled in production for a subset of traffic; no increase in write errors.

  • Shadow read mismatch rate < 0.1% on sampled keys for 24h.

Tools: Debezium, Kafka, AWS DMS, pglogical, Flyway/Liquibase, pgBouncer, Vitess (MySQL), gh-ost (schema changes).

Traffic: Shadow, Canary, and Gradual Cutover

Move traffic like you’re defusing a bomb: gently, with a timer in hand.

  • Shadow first: mirror requests to Green without affecting responses. Validate latency and error shape.

  • Canary: shift 1%, 5%, 10%, 25%, 50%, 100% with automated rollback.

  • Sticky sessions: if you must keep them, scope canaries to session boundary or migrate to stateless first.

  • DNS: keep TTL <= 60s for the window; use weighted records if helpful.

  • Connection draining: never hard flip at the LB; let keepalives die naturally.

Istio traffic split:

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: app
spec:
  hosts: ["app.internal"]
  http:
  - route:
    - destination: { host: app-blue, subset: v1 }
      weight: 90
    - destination: { host: app-green, subset: v2 }
      weight: 10
    retries: { attempts: 3, perTryTimeout: 500ms }
    timeout: 3s

Nginx mirroring (shadow):

location /api/ {
  proxy_pass http://blue;
  mirror /api_shadow;
}
location = /api_shadow {
  internal;
  proxy_pass http://green;
}

Weighted Route53 record:

app.example.com  A  weight=90 -> blue-ALB
app.example.com  A  weight=10 -> green-ALB
TTL=60

Checkpoints

  • Shadowed latency within ±10% p95; error shape matches.

  • At 10% canary, business KPIs stable (±1% conversion, ±2% auth success).

  • Automated rollback verified (flip back within 2 minutes).

Tools: Istio, Envoy, Nginx, AWS ALB/NLB, Route53, GCP Traffic Director, Cloudflare, Akamaized frontends if applicable.

Observability, Load, and Chaos Before Prod Flip

Trust charts, not vibes. If it isn’t graphed, it doesn’t exist.

  • Dashboards: golden signals (latency, traffic, errors, saturation) for Blue and Green side-by-side.

  • SLOs: burn-rate alerts (e.g., 2x and 14x) wired to Slack + PagerDuty.

  • DB metrics: replication/CDC lag, deadlocks, queue depth, cache hit rate.

  • Business metrics: add them to Grafana from Snowflake/BigQuery or stream via Kafka -> Prometheus exporter.

  • Load test: replay prod traces with k6/Vegeta/Locust at 1.2x peak. Watch p99 and tail lat.

  • Chaos: kill one AZ, throttle network with tc, kill pods, fail a replica. Confirm circuit breakers and retries behave.

k6 replay stub:

import http from 'k6/http';
import { check, sleep } from 'k6';
export const options = { stages: [ { duration: '10m', target: 1.2*__ENV.PEAK_QPS } ] };
export default function() {
  const res = http.get(`${__ENV.TARGET}/healthz`);
  check(res, { 'status 200': (r) => r.status === 200 });
  sleep(1/(__ENV.PEAK_QPS));
}

Checkpoints

  • Tail latency p99 within SLO under 1.2x peak; GC/CPU stable; no persistent error budget burn.

  • CDC lag remains < 2s under load; queue depth bounded.

  • Chaos tests pass; MTTR under 5m for the injected failures.

Tools: Prometheus, Grafana, OpenTelemetry, Jaeger, Loki/ELK, k6, Vegeta, tc, chaos-mesh/Litmus.

Cutover Day Runbook: 30-Minute Increments with Go/No-Go Gates

Here’s the cadence I’ve used at fintechs and marketplaces without waking up the CFO.

  1. T-30m Announce start, verify comms, freeze deploys. Validate dashboards green, on-call present, rollback plan open.

  2. Lower DNS TTL to 30–60s if not already. Verify propagations on authoritative DNS.

  3. Enable shadow traffic (100%). Watch 10 minutes. Check: 5xx < 0.1%, p95 < 250ms, CDC lag < 2s.

  4. Start canary at 1% via mesh/LB weight. Hold 10m. Watch business KPIs. Go/No-Go #1.

  5. Increase to 5% then 10%. Hold 10–15m each. Run a targeted k6 burst at 1.2x normal. Go/No-Go #2.

  6. Flip 25% then 50%. Verify no sustained burn-rate alerts. Check DB write amplification and queue depth.

  7. At 50%+, flip read path to Green if you were shadow-reading. Keep dual writes on.

  8. Push to 100%. Keep Blue draining; do not kill it. Hold 30–60m.

  9. Turn off dual writes only after 24h of stable KPIs and zero mismatches in shadow read audits.

  10. Archive logs/metrics and snapshot DBs. Announce completion. Keep Blue warm for 24–48h as cold standby.

Go/No-Go Gates (examples)

  • Any 5xx > 0.3% over 5m? No-Go.

  • p95 delta > +25% vs. baseline for 10m? No-Go.

  • CDC lag > 5s sustained > 3m? No-Go.

  • Checkout/auth KPI delta worse than -1% for 10m? No-Go.

Rollback Plan (rehearsed)

  • Set LB/mesh weights to 100% Blue, 0% Green.

  • Re-enable single-write to Blue in the app flag.

  • Keep CDC flowing from Blue->Green to avoid split-brain.

  • Page DBA to verify replication health and reconcile any in-flight dual writes by idempotency key.

Rollback is a product. Treat it like one: code, test, runbook, owner.

Tools: Istio/Envoy weight flips, Route53 weighted records, AWS ALB target group stickiness, feature flags, runbooks in Backstage/Confluence.

Aftercare: Remove Training Wheels and Pay Down the Debt

You’re not done until the old stack is boring to delete.

  • Remove dual writes and shadow reads behind flags. Delete dead code. Kill the toggle debt.

  • Decommission Blue: DB replicas, instances, LBs, DNS, firewall rules. Tag-orphan scan with IaC.

  • Cost check: right-size Green now that it’s proven. Turn off overprovisioned nodes and over-replicated storage.

  • Observability upkeep: archive dashboards and alerts for Blue. Keep Green SLOs living in on-call.

  • Retro: blameless, with concrete changes to the checklist and runbook. Update runbooks and architecture docs.

  • Security: rotate secrets used during migration, revoke temporary access, update threat models.

Tools: Terraform drift detection, cost (Infracost, CloudZero), OPA policies, AWS Config, kube-downscaler.

If this sounds like the way you want to operate, this is literally what we do at GitPlumbers. We come in, map the blast radius, build the parallel track, wire CDC, and run the cutover with your team so no one has to play hero at 3 a.m.

Related Resources

Key takeaways

  • Define SLOs and a hard abort plan before touching anything.
  • Stand up a parallel stack with config parity and traffic shadowing.
  • Handle data with CDC + backfill + dual writes; verify with checksums and shadow reads.
  • Shift traffic gradually with canaries; keep DNS TTL low and use connection draining.
  • Gate every phase with measurable thresholds and a rollback you’ve already rehearsed.
  • Instrument everything: golden signals, lag, error budgets, and business KPIs.
  • Finish the job: remove dual writes, decommission safely, and cut hardware/cloud waste.

Implementation checklist

  • Lock SLOs and success/abort criteria with business stakeholders.
  • Map dependencies (DBs, caches, queues, cron, third parties).
  • Stand up blue/green infra with IaC and GitOps, freeze config drift.
  • Wire CDC, backfill, and dual writes; add idempotency keys.
  • Build dashboards with golden signals + lag + business KPIs.
  • Shadow traffic, then canary with automated rollback.
  • Execute cutover runbook with 15–30 min Go/No-Go gates.
  • Aftercare: remove toggles, decommission old stack, and run a blameless retro.

Questions we hear from teams

How do we avoid split-brain during dual writes?
Keep one authoritative write path (Blue) until Green proves itself. Enable dual writes but never enable dual reads that alter state. Use idempotency keys for writes, keep CDC Blue->Green, and only flip write authority when you are ready to decommission Blue.
What if our workload uses sticky sessions?
Either migrate to stateless sessions first or scope canaries by session boundary and drain old sessions before increasing weights. A shared Redis session store can bridge the gap. Enable connection draining at the LB and set short session TTLs during the window.
Is DNS switching reliable enough?
Use DNS for coarse weighting and mesh/LB for fine-grained routing. Keep TTL low (30–60s), but rely on ALB/Istio weights and connection draining for precision. Test client caching assumptions ahead of time.
Can we do this without a service mesh?
Yes. Use Nginx/HAProxy/Envoy at the edge, ALB weighted target groups, and application flags. Mesh just gives you nicer knobs and telemetry.
How long should we keep dual writes on?
At least 24 hours of peak traffic with shadow reads and zero mismatches, plus a full business cycle if your data has daily/weekly quirks. Only then turn off dual writes and decommission the old store.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Run your next cutover with GitPlumbers Download the detailed migration runbook template

Related resources