The Zero‑Downtime Migration Checklist You Actually Use at 2 A.M.
A battle-tested, step-by-step runbook with checkpoints, metrics, and tooling that keeps revenue flowing while you move a critical workload.
“Rollback is a product. Treat it like one.”Back to all posts
Scope, SLOs, and a Hard Abort Plan
You don’t get zero-downtime by accident. The last migration I watched go sideways at a unicorn wasn’t because the new stack was bad — it was because nobody wrote down what “good” meant. Fix that first.
Define SLOs: e.g.,
99.95% availability,p95 latency < 250msduring migration,error budget <= 2 minutesfor the window.Business KPIs: auth success rate, checkout conversion, ingestion throughput. These matter more than CPU graphs.
Abort criteria: e.g.,
5-minuterolling5xx > 0.3%orp95 > 500msfor two consecutive intervals. Write it down.Roles and comms: one DRI, one comms lead, Slack war room, Zoom bridge, pager rotation ready.
Freeze windows: no schema or API changes unless they’re in this plan.
Dry run: rehearse in staging with recorded prod traffic and a synthetic spike.
If you can’t crisply answer “When do we abort?” you’re not ready.
Tools: Prometheus, Grafana, SLO burn alerts; PagerDuty; feature flags (LaunchDarkly, Unleash, Flipt).
Inventory and Architecture: Map the Blast Radius
You can’t migrate what you can’t see. Every outage I’ve debugged had a “forgotten” dependency.
Dependencies: DBs (
Postgres,MySQL), caches (Redis,Memcached), queues (Kafka,SQS), object stores, third parties (payments, auth), cron/batch jobs.Session model: sticky sessions? JWT? Redis-backed? Plan for session migration or statelessness.
Data flow: reads vs writes, idempotency, eventual consistency tolerance.
Networking: ALB/Ingress, TLS termination, WAF, IP allowlists, egress NAT.
Back-pressure: circuit breakers, queues, retry policies.
Draw the current and target architectures. Keep a one-page diagram in the runbook.
Trace hot paths with
OpenTelemetry+Jaegerfor the top 5 revenue flows.Record 24h traffic shape and peak QPS. Capture seasonality.
Tools: otlp, Jaeger, Tempo, flow logs, terraform graph, kubectl top, redis-cli, pg_stat_activity.
Build the Parallel Stack: Blue/Green with Safe Defaults
Stand up the target stack fully, side-by-side. Blue is current; Green is target. Your job is to make Green boring.
Infra: use
TerraformorPulumito build VPCs, subnets, security groups, ALB/NLB,EKS/GKE/AKS, and databases with replicas.GitOps: ship app manifests via
ArgoCDorFlux. Freeze manual kubectl edits.Config parity: same env vars, secrets, feature flags. No surprises.
Session strategy: move to stateless or shared session store.
Connection draining: enable on ALB/NLB and ingress.
Example ALB target group draining (AWS):
resource "aws_lb_target_group" "app" {
name = "app-green"
port = 80
protocol = "HTTP"
vpc_id = var.vpc_id
deregistration_delay = 60 # seconds
health_check { path = "/healthz" }
}ArgoCD app for the new stack:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: app-green
spec:
source:
repoURL: https://github.com/acme/app
path: deploy/overlays/green
targetRevision: main
destination:
server: https://kubernetes.default.svc
namespace: app
syncPolicy:
automated:
prune: true
selfHeal: trueCheckpoints
Green passes health checks, can run full e2e tests, and serves shadow traffic (no user impact).
Dashboards for green exist and match blue within ±10% for shadowed load.
Tools: Terraform, ArgoCD, ExternalDNS, cert-manager, AWS ALB, Istio/Linkerd, Envoy, HAProxy.
Data: CDC, Backfill, Dual Writes, Shadow Reads
Data is where zero-downtime migrations go to die. Don’t wing it. Use CDC and prove correctness.
Choose CDC:
Debezium+Kafka,AWS DMS,pglogical,MySQLbinlog. Set it up from Blue->Green DB.Backfill: bulk copy historical data first, then stream deltas via CDC until lag ~0.
Dual writes: app writes to both stores behind a flag. Make writes idempotent with keys.
Shadow reads: read from Blue but shadow read from Green; compare results in background.
Consistency checks: row counts, checksums, sampled field equality.
Postgres logical replication status (lag):
SELECT application_name, state, sync_state, pg_last_wal_replay_lsn(),
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) AS replay_lag
FROM pg_stat_replication;Idempotent write example:
# include an idempotency key per business object
headers = {"Idempotency-Key": order_id}
requests.post(url, json=payload, headers=headers, timeout=2)Simple shadow read compare:
blue := blueClient.Get(id)
green := greenClient.Get(id)
if !reflect.DeepEqual(project(blue), project(green)) {
metrics.Counter("shadow_mismatch").Inc()
}Checkpoints
Backfill complete; CDC lag <
2sfor 95% of time during peak.Dual writes enabled in production for a subset of traffic; no increase in write errors.
Shadow read mismatch rate <
0.1%on sampled keys for 24h.
Tools: Debezium, Kafka, AWS DMS, pglogical, Flyway/Liquibase, pgBouncer, Vitess (MySQL), gh-ost (schema changes).
Traffic: Shadow, Canary, and Gradual Cutover
Move traffic like you’re defusing a bomb: gently, with a timer in hand.
Shadow first: mirror requests to Green without affecting responses. Validate latency and error shape.
Canary: shift 1%, 5%, 10%, 25%, 50%, 100% with automated rollback.
Sticky sessions: if you must keep them, scope canaries to session boundary or migrate to stateless first.
DNS: keep TTL
<= 60sfor the window; use weighted records if helpful.Connection draining: never hard flip at the LB; let keepalives die naturally.
Istio traffic split:
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: app
spec:
hosts: ["app.internal"]
http:
- route:
- destination: { host: app-blue, subset: v1 }
weight: 90
- destination: { host: app-green, subset: v2 }
weight: 10
retries: { attempts: 3, perTryTimeout: 500ms }
timeout: 3sNginx mirroring (shadow):
location /api/ {
proxy_pass http://blue;
mirror /api_shadow;
}
location = /api_shadow {
internal;
proxy_pass http://green;
}Weighted Route53 record:
app.example.com A weight=90 -> blue-ALB
app.example.com A weight=10 -> green-ALB
TTL=60Checkpoints
Shadowed latency within ±10% p95; error shape matches.
At 10% canary, business KPIs stable (±1% conversion, ±2% auth success).
Automated rollback verified (flip back within 2 minutes).
Tools: Istio, Envoy, Nginx, AWS ALB/NLB, Route53, GCP Traffic Director, Cloudflare, Akamaized frontends if applicable.
Observability, Load, and Chaos Before Prod Flip
Trust charts, not vibes. If it isn’t graphed, it doesn’t exist.
Dashboards: golden signals (latency, traffic, errors, saturation) for Blue and Green side-by-side.
SLOs: burn-rate alerts (e.g., 2x and 14x) wired to Slack + PagerDuty.
DB metrics: replication/CDC lag, deadlocks, queue depth, cache hit rate.
Business metrics: add them to Grafana from
Snowflake/BigQueryor stream viaKafka->Prometheusexporter.Load test: replay prod traces with
k6/Vegeta/Locustat 1.2x peak. Watch p99 and tail lat.Chaos: kill one AZ, throttle network with
tc, kill pods, fail a replica. Confirm circuit breakers and retries behave.
k6 replay stub:
import http from 'k6/http';
import { check, sleep } from 'k6';
export const options = { stages: [ { duration: '10m', target: 1.2*__ENV.PEAK_QPS } ] };
export default function() {
const res = http.get(`${__ENV.TARGET}/healthz`);
check(res, { 'status 200': (r) => r.status === 200 });
sleep(1/(__ENV.PEAK_QPS));
}Checkpoints
Tail latency p99 within SLO under 1.2x peak; GC/CPU stable; no persistent error budget burn.
CDC lag remains <
2sunder load; queue depth bounded.Chaos tests pass; MTTR under
5mfor the injected failures.
Tools: Prometheus, Grafana, OpenTelemetry, Jaeger, Loki/ELK, k6, Vegeta, tc, chaos-mesh/Litmus.
Cutover Day Runbook: 30-Minute Increments with Go/No-Go Gates
Here’s the cadence I’ve used at fintechs and marketplaces without waking up the CFO.
T-30mAnnounce start, verify comms, freeze deploys. Validate dashboards green, on-call present, rollback plan open.Lower DNS TTL to
30–60sif not already. Verify propagations on authoritative DNS.Enable shadow traffic (100%). Watch 10 minutes. Check:
5xx < 0.1%,p95 < 250ms, CDC lag <2s.Start canary at
1%via mesh/LB weight. Hold10m. Watch business KPIs. Go/No-Go #1.Increase to
5%then10%. Hold10–15meach. Run a targeted k6 burst at 1.2x normal. Go/No-Go #2.Flip
25%then50%. Verify no sustained burn-rate alerts. Check DB write amplification and queue depth.At
50%+, flip read path to Green if you were shadow-reading. Keep dual writes on.Push to
100%. Keep Blue draining; do not kill it. Hold30–60m.Turn off dual writes only after
24hof stable KPIs and zero mismatches in shadow read audits.Archive logs/metrics and snapshot DBs. Announce completion. Keep Blue warm for
24–48has cold standby.
Go/No-Go Gates (examples)
Any
5xx > 0.3%over5m? No-Go.p95delta >+25%vs. baseline for10m? No-Go.CDC lag > 5ssustained> 3m? No-Go.Checkout/auth KPI delta worse than
-1%for10m? No-Go.
Rollback Plan (rehearsed)
Set LB/mesh weights to
100%Blue,0%Green.Re-enable single-write to Blue in the app flag.
Keep CDC flowing from Blue->Green to avoid split-brain.
Page DBA to verify replication health and reconcile any in-flight dual writes by idempotency key.
Rollback is a product. Treat it like one: code, test, runbook, owner.
Tools: Istio/Envoy weight flips, Route53 weighted records, AWS ALB target group stickiness, feature flags, runbooks in Backstage/Confluence.
Aftercare: Remove Training Wheels and Pay Down the Debt
You’re not done until the old stack is boring to delete.
Remove dual writes and shadow reads behind flags. Delete dead code. Kill the toggle debt.
Decommission Blue: DB replicas, instances, LBs, DNS, firewall rules. Tag-orphan scan with IaC.
Cost check: right-size Green now that it’s proven. Turn off overprovisioned nodes and over-replicated storage.
Observability upkeep: archive dashboards and alerts for Blue. Keep Green SLOs living in on-call.
Retro: blameless, with concrete changes to the checklist and runbook. Update runbooks and architecture docs.
Security: rotate secrets used during migration, revoke temporary access, update threat models.
Tools: Terraform drift detection, cost (Infracost, CloudZero), OPA policies, AWS Config, kube-downscaler.
If this sounds like the way you want to operate, this is literally what we do at GitPlumbers. We come in, map the blast radius, build the parallel track, wire CDC, and run the cutover with your team so no one has to play hero at 3 a.m.
Key takeaways
- Define SLOs and a hard abort plan before touching anything.
- Stand up a parallel stack with config parity and traffic shadowing.
- Handle data with CDC + backfill + dual writes; verify with checksums and shadow reads.
- Shift traffic gradually with canaries; keep DNS TTL low and use connection draining.
- Gate every phase with measurable thresholds and a rollback you’ve already rehearsed.
- Instrument everything: golden signals, lag, error budgets, and business KPIs.
- Finish the job: remove dual writes, decommission safely, and cut hardware/cloud waste.
Implementation checklist
- Lock SLOs and success/abort criteria with business stakeholders.
- Map dependencies (DBs, caches, queues, cron, third parties).
- Stand up blue/green infra with IaC and GitOps, freeze config drift.
- Wire CDC, backfill, and dual writes; add idempotency keys.
- Build dashboards with golden signals + lag + business KPIs.
- Shadow traffic, then canary with automated rollback.
- Execute cutover runbook with 15–30 min Go/No-Go gates.
- Aftercare: remove toggles, decommission old stack, and run a blameless retro.
Questions we hear from teams
- How do we avoid split-brain during dual writes?
- Keep one authoritative write path (Blue) until Green proves itself. Enable dual writes but never enable dual reads that alter state. Use idempotency keys for writes, keep CDC Blue->Green, and only flip write authority when you are ready to decommission Blue.
- What if our workload uses sticky sessions?
- Either migrate to stateless sessions first or scope canaries by session boundary and drain old sessions before increasing weights. A shared Redis session store can bridge the gap. Enable connection draining at the LB and set short session TTLs during the window.
- Is DNS switching reliable enough?
- Use DNS for coarse weighting and mesh/LB for fine-grained routing. Keep TTL low (30–60s), but rely on ALB/Istio weights and connection draining for precision. Test client caching assumptions ahead of time.
- Can we do this without a service mesh?
- Yes. Use Nginx/HAProxy/Envoy at the edge, ALB weighted target groups, and application flags. Mesh just gives you nicer knobs and telemetry.
- How long should we keep dual writes on?
- At least 24 hours of peak traffic with shadow reads and zero mismatches, plus a full business cycle if your data has daily/weekly quirks. Only then turn off dual writes and decommission the old store.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
