The Zero‑Downtime Migration Checklist You Actually Use at 2 A.M.
A battle-tested, step-by-step runbook with checkpoints, metrics, and tooling that keeps revenue flowing while you move a critical workload.
“Rollback is a product. Treat it like one.”Back to all posts
Scope, SLOs, and a Hard Abort Plan
You don’t get zero-downtime by accident. The last migration I watched go sideways at a unicorn wasn’t because the new stack was bad — it was because nobody wrote down what “good” meant. Fix that first.
Define SLOs: e.g.,
99.95% availability
,p95 latency < 250ms
during migration,error budget <= 2 minutes
for the window.Business KPIs: auth success rate, checkout conversion, ingestion throughput. These matter more than CPU graphs.
Abort criteria: e.g.,
5-minute
rolling5xx > 0.3%
orp95 > 500ms
for two consecutive intervals. Write it down.Roles and comms: one DRI, one comms lead, Slack war room, Zoom bridge, pager rotation ready.
Freeze windows: no schema or API changes unless they’re in this plan.
Dry run: rehearse in staging with recorded prod traffic and a synthetic spike.
If you can’t crisply answer “When do we abort?” you’re not ready.
Tools: Prometheus
, Grafana
, SLO
burn alerts; PagerDuty
; feature flags (LaunchDarkly
, Unleash
, Flipt
).
Inventory and Architecture: Map the Blast Radius
You can’t migrate what you can’t see. Every outage I’ve debugged had a “forgotten” dependency.
Dependencies: DBs (
Postgres
,MySQL
), caches (Redis
,Memcached
), queues (Kafka
,SQS
), object stores, third parties (payments, auth), cron/batch jobs.Session model: sticky sessions? JWT? Redis-backed? Plan for session migration or statelessness.
Data flow: reads vs writes, idempotency, eventual consistency tolerance.
Networking: ALB/Ingress, TLS termination, WAF, IP allowlists, egress NAT.
Back-pressure: circuit breakers, queues, retry policies.
Draw the current and target architectures. Keep a one-page diagram in the runbook.
Trace hot paths with
OpenTelemetry
+Jaeger
for the top 5 revenue flows.Record 24h traffic shape and peak QPS. Capture seasonality.
Tools: otlp
, Jaeger
, Tempo
, flow logs, terraform graph
, kubectl top
, redis-cli
, pg_stat_activity
.
Build the Parallel Stack: Blue/Green with Safe Defaults
Stand up the target stack fully, side-by-side. Blue is current; Green is target. Your job is to make Green boring.
Infra: use
Terraform
orPulumi
to build VPCs, subnets, security groups, ALB/NLB,EKS/GKE/AKS
, and databases with replicas.GitOps: ship app manifests via
ArgoCD
orFlux
. Freeze manual kubectl edits.Config parity: same env vars, secrets, feature flags. No surprises.
Session strategy: move to stateless or shared session store.
Connection draining: enable on ALB/NLB and ingress.
Example ALB target group draining (AWS):
resource "aws_lb_target_group" "app" {
name = "app-green"
port = 80
protocol = "HTTP"
vpc_id = var.vpc_id
deregistration_delay = 60 # seconds
health_check { path = "/healthz" }
}
ArgoCD app for the new stack:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: app-green
spec:
source:
repoURL: https://github.com/acme/app
path: deploy/overlays/green
targetRevision: main
destination:
server: https://kubernetes.default.svc
namespace: app
syncPolicy:
automated:
prune: true
selfHeal: true
Checkpoints
Green passes health checks, can run full e2e tests, and serves shadow traffic (no user impact).
Dashboards for green exist and match blue within ±10% for shadowed load.
Tools: Terraform
, ArgoCD
, ExternalDNS
, cert-manager
, AWS ALB
, Istio
/Linkerd
, Envoy
, HAProxy
.
Data: CDC, Backfill, Dual Writes, Shadow Reads
Data is where zero-downtime migrations go to die. Don’t wing it. Use CDC and prove correctness.
Choose CDC:
Debezium
+Kafka
,AWS DMS
,pglogical
,MySQL
binlog. Set it up from Blue->Green DB.Backfill: bulk copy historical data first, then stream deltas via CDC until lag ~0.
Dual writes: app writes to both stores behind a flag. Make writes idempotent with keys.
Shadow reads: read from Blue but shadow read from Green; compare results in background.
Consistency checks: row counts, checksums, sampled field equality.
Postgres logical replication status (lag):
SELECT application_name, state, sync_state, pg_last_wal_replay_lsn(),
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) AS replay_lag
FROM pg_stat_replication;
Idempotent write example:
# include an idempotency key per business object
headers = {"Idempotency-Key": order_id}
requests.post(url, json=payload, headers=headers, timeout=2)
Simple shadow read compare:
blue := blueClient.Get(id)
green := greenClient.Get(id)
if !reflect.DeepEqual(project(blue), project(green)) {
metrics.Counter("shadow_mismatch").Inc()
}
Checkpoints
Backfill complete; CDC lag <
2s
for 95% of time during peak.Dual writes enabled in production for a subset of traffic; no increase in write errors.
Shadow read mismatch rate <
0.1%
on sampled keys for 24h.
Tools: Debezium
, Kafka
, AWS DMS
, pglogical
, Flyway
/Liquibase
, pgBouncer
, Vitess
(MySQL), gh-ost
(schema changes).
Traffic: Shadow, Canary, and Gradual Cutover
Move traffic like you’re defusing a bomb: gently, with a timer in hand.
Shadow first: mirror requests to Green without affecting responses. Validate latency and error shape.
Canary: shift 1%, 5%, 10%, 25%, 50%, 100% with automated rollback.
Sticky sessions: if you must keep them, scope canaries to session boundary or migrate to stateless first.
DNS: keep TTL
<= 60s
for the window; use weighted records if helpful.Connection draining: never hard flip at the LB; let keepalives die naturally.
Istio traffic split:
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: app
spec:
hosts: ["app.internal"]
http:
- route:
- destination: { host: app-blue, subset: v1 }
weight: 90
- destination: { host: app-green, subset: v2 }
weight: 10
retries: { attempts: 3, perTryTimeout: 500ms }
timeout: 3s
Nginx mirroring (shadow):
location /api/ {
proxy_pass http://blue;
mirror /api_shadow;
}
location = /api_shadow {
internal;
proxy_pass http://green;
}
Weighted Route53 record:
app.example.com A weight=90 -> blue-ALB
app.example.com A weight=10 -> green-ALB
TTL=60
Checkpoints
Shadowed latency within ±10% p95; error shape matches.
At 10% canary, business KPIs stable (±1% conversion, ±2% auth success).
Automated rollback verified (flip back within 2 minutes).
Tools: Istio
, Envoy
, Nginx
, AWS ALB/NLB
, Route53
, GCP Traffic Director
, Cloudflare
, Akamaized
frontends if applicable.
Observability, Load, and Chaos Before Prod Flip
Trust charts, not vibes. If it isn’t graphed, it doesn’t exist.
Dashboards: golden signals (latency, traffic, errors, saturation) for Blue and Green side-by-side.
SLOs: burn-rate alerts (e.g., 2x and 14x) wired to Slack + PagerDuty.
DB metrics: replication/CDC lag, deadlocks, queue depth, cache hit rate.
Business metrics: add them to Grafana from
Snowflake
/BigQuery
or stream viaKafka
->Prometheus
exporter.Load test: replay prod traces with
k6
/Vegeta
/Locust
at 1.2x peak. Watch p99 and tail lat.Chaos: kill one AZ, throttle network with
tc
, kill pods, fail a replica. Confirm circuit breakers and retries behave.
k6 replay stub:
import http from 'k6/http';
import { check, sleep } from 'k6';
export const options = { stages: [ { duration: '10m', target: 1.2*__ENV.PEAK_QPS } ] };
export default function() {
const res = http.get(`${__ENV.TARGET}/healthz`);
check(res, { 'status 200': (r) => r.status === 200 });
sleep(1/(__ENV.PEAK_QPS));
}
Checkpoints
Tail latency p99 within SLO under 1.2x peak; GC/CPU stable; no persistent error budget burn.
CDC lag remains <
2s
under load; queue depth bounded.Chaos tests pass; MTTR under
5m
for the injected failures.
Tools: Prometheus
, Grafana
, OpenTelemetry
, Jaeger
, Loki/ELK
, k6
, Vegeta
, tc
, chaos-mesh
/Litmus
.
Cutover Day Runbook: 30-Minute Increments with Go/No-Go Gates
Here’s the cadence I’ve used at fintechs and marketplaces without waking up the CFO.
T-30m
Announce start, verify comms, freeze deploys. Validate dashboards green, on-call present, rollback plan open.Lower DNS TTL to
30–60s
if not already. Verify propagations on authoritative DNS.Enable shadow traffic (100%). Watch 10 minutes. Check:
5xx < 0.1%
,p95 < 250ms
, CDC lag <2s
.Start canary at
1%
via mesh/LB weight. Hold10m
. Watch business KPIs. Go/No-Go #1.Increase to
5%
then10%
. Hold10–15m
each. Run a targeted k6 burst at 1.2x normal. Go/No-Go #2.Flip
25%
then50%
. Verify no sustained burn-rate alerts. Check DB write amplification and queue depth.At
50%+
, flip read path to Green if you were shadow-reading. Keep dual writes on.Push to
100%
. Keep Blue draining; do not kill it. Hold30–60m
.Turn off dual writes only after
24h
of stable KPIs and zero mismatches in shadow read audits.Archive logs/metrics and snapshot DBs. Announce completion. Keep Blue warm for
24–48h
as cold standby.
Go/No-Go Gates (examples)
Any
5xx > 0.3%
over5m
? No-Go.p95
delta >+25%
vs. baseline for10m
? No-Go.CDC lag > 5s
sustained> 3m
? No-Go.Checkout/auth KPI delta worse than
-1%
for10m
? No-Go.
Rollback Plan (rehearsed)
Set LB/mesh weights to
100%
Blue,0%
Green.Re-enable single-write to Blue in the app flag.
Keep CDC flowing from Blue->Green to avoid split-brain.
Page DBA to verify replication health and reconcile any in-flight dual writes by idempotency key.
Rollback is a product. Treat it like one: code, test, runbook, owner.
Tools: Istio
/Envoy
weight flips, Route53
weighted records, AWS ALB
target group stickiness, feature flags, runbooks in Backstage
/Confluence
.
Aftercare: Remove Training Wheels and Pay Down the Debt
You’re not done until the old stack is boring to delete.
Remove dual writes and shadow reads behind flags. Delete dead code. Kill the toggle debt.
Decommission Blue: DB replicas, instances, LBs, DNS, firewall rules. Tag-orphan scan with IaC.
Cost check: right-size Green now that it’s proven. Turn off overprovisioned nodes and over-replicated storage.
Observability upkeep: archive dashboards and alerts for Blue. Keep Green SLOs living in on-call.
Retro: blameless, with concrete changes to the checklist and runbook. Update runbooks and architecture docs.
Security: rotate secrets used during migration, revoke temporary access, update threat models.
Tools: Terraform
drift detection, cost (Infracost
, CloudZero
), OPA
policies, AWS Config
, kube-downscaler
.
If this sounds like the way you want to operate, this is literally what we do at GitPlumbers. We come in, map the blast radius, build the parallel track, wire CDC, and run the cutover with your team so no one has to play hero at 3 a.m.
Key takeaways
- Define SLOs and a hard abort plan before touching anything.
- Stand up a parallel stack with config parity and traffic shadowing.
- Handle data with CDC + backfill + dual writes; verify with checksums and shadow reads.
- Shift traffic gradually with canaries; keep DNS TTL low and use connection draining.
- Gate every phase with measurable thresholds and a rollback you’ve already rehearsed.
- Instrument everything: golden signals, lag, error budgets, and business KPIs.
- Finish the job: remove dual writes, decommission safely, and cut hardware/cloud waste.
Implementation checklist
- Lock SLOs and success/abort criteria with business stakeholders.
- Map dependencies (DBs, caches, queues, cron, third parties).
- Stand up blue/green infra with IaC and GitOps, freeze config drift.
- Wire CDC, backfill, and dual writes; add idempotency keys.
- Build dashboards with golden signals + lag + business KPIs.
- Shadow traffic, then canary with automated rollback.
- Execute cutover runbook with 15–30 min Go/No-Go gates.
- Aftercare: remove toggles, decommission old stack, and run a blameless retro.
Questions we hear from teams
- How do we avoid split-brain during dual writes?
- Keep one authoritative write path (Blue) until Green proves itself. Enable dual writes but never enable dual reads that alter state. Use idempotency keys for writes, keep CDC Blue->Green, and only flip write authority when you are ready to decommission Blue.
- What if our workload uses sticky sessions?
- Either migrate to stateless sessions first or scope canaries by session boundary and drain old sessions before increasing weights. A shared Redis session store can bridge the gap. Enable connection draining at the LB and set short session TTLs during the window.
- Is DNS switching reliable enough?
- Use DNS for coarse weighting and mesh/LB for fine-grained routing. Keep TTL low (30–60s), but rely on ALB/Istio weights and connection draining for precision. Test client caching assumptions ahead of time.
- Can we do this without a service mesh?
- Yes. Use Nginx/HAProxy/Envoy at the edge, ALB weighted target groups, and application flags. Mesh just gives you nicer knobs and telemetry.
- How long should we keep dual writes on?
- At least 24 hours of peak traffic with shadow reads and zero mismatches, plus a full business cycle if your data has daily/weekly quirks. Only then turn off dual writes and decommission the old store.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.