The Zero‑Downtime Migration Checklist We Actually Use in Production
A pragmatic, step-by-step playbook to cut over critical workloads without waking up pager duty or your CFO.
You don’t get points for being brave. You get points for never paging the on-call.Back to all posts
The Zero‑Downtime Migration Checklist We Actually Use in Production
You don’t get points for being brave. You get points for never paging the on-call. Here’s the exact checklist we’ve used to move payments, auth, and serving paths with no customer-visible downtime. No fairy dust—just controlled blast radius, boring automation, and rollback you can actually trust.
The only migrations I regret were the ones we couldn’t roll back in one command.
We’ll assume Kubernetes + GitOps, but the patterns apply to VMs too. Tools referenced: Terraform, ArgoCD, Argo Rollouts, Istio, Prometheus, Debezium, PostgreSQL logical replication, k6, gor, OpenFeature/LaunchDarkly.
1) Baseline, Blast Radius, and Success Criteria
Before touching manifests, lock your targets. I’ve seen teams skip this and argue during the cutover whether a 1% error spike is “fine.” Don’t do that.
- Inventory: upstream callers, downstream stores, queues, third-party APIs, cron jobs, webhooks, feature flag dependencies.
- Traffic profile: peak RPS, p95/99 latency, payload sizes, read/write split, long-lived connections.
- SLOs and error budget: what’s acceptable during migration? Example: maintain 99.9% availability, p95 latency +15% max, 5xx < 0.3%.
- Kill switch: a single flag to route all user traffic back to the old path.
Sample PromQL you can paste into Grafana and alerts:
# Error rate (5xx) over 5m per service
sum(rate(http_requests_total{service="orders",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{service="orders"}[5m]))
# p95 latency
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service="orders"}[5m])) by (le))
# Saturation (CPU) for target deployment
sum(rate(container_cpu_usage_seconds_total{namespace="prod",pod=~"orders-.*"}[5m]))
/
sum(kube_pod_container_resource_limits{namespace="prod",pod=~"orders-.*",resource="cpu"})Checkpoint: Document SLOs, KPIs, and rollback criteria in the runbook. If a VP asks, you should be able to point to a line that says, “Rollback if 5xx > 0.3% for 2 minutes.”
2) Rehearsal Environment and Traffic Replay
Zero-downtime happens because you already did it yesterday in rehearsal.
- Provision a prod-like env with
Terraformand sync apps withArgoCD.
terraform workspace select staging
terraform apply -var-file=staging.tfvars
argocd app create orders --repo https://github.com/acme/ops --path apps/orders --dest-namespace staging --dest-server https://kubernetes.default.svc
argocd app sync orders- Shadow real traffic from prod into staging. With Istio:
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: orders
spec:
hosts: ["orders.svc.cluster.local"]
http:
- route:
- destination: { host: orders.prod }
mirror: { host: orders.staging }
mirrorPercentage: { value: 100.0 }Or use gor to mirror from an ingress:
sudo gor --input-raw :80 --output-http http://orders.staging.svc.cluster.local- Synthetic load with
k6to hit edge cases:
// k6 script (orders.js)
import http from 'k6/http';
import { sleep, check } from 'k6';
export let options = { vus: 100, duration: '10m' };
export default function () {
const res = http.post('https://api.example.com/orders', JSON.stringify({sku: 'ABC', qty: 1}), { headers: { 'Content-Type': 'application/json' }});
check(res, { 'status is 200': (r) => r.status === 200 });
sleep(0.1);
}Checkpoint: staging handles peak RPS at p95 within +10% of prod; error rate under 0.1%. If not, fix before you touch prod.
3) Data Plan: Expand/Contract, Backfill, Dual Writes
Downtime almost always hides in the data layer. The move that saves you: expand/contract + CDC.
- Expand: add new schema without breaking old code.
- Backfill: move data incrementally.
- Dual-write: write to both old and new behind a flag.
- Flip reads: switch read path.
- Contract: remove old fields/tables after a quiet period.
Example: adding shipping_zone to orders and moving to a new service.
Expand DDL:
-- Expand phase: add nullable/optional structures first
ALTER TABLE orders ADD COLUMN shipping_zone TEXT NULL;
CREATE TABLE orders_v2 (
id BIGINT PRIMARY KEY,
user_id BIGINT NOT NULL,
sku TEXT NOT NULL,
qty INT NOT NULL,
shipping_zone TEXT,
created_at TIMESTAMP NOT NULL
);CDC backfill with Debezium (Kafka → consumer):
{
"name": "orders-connector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"database.hostname": "pg-primary",
"database.port": "5432",
"database.user": "debezium",
"database.password": "*****",
"database.dbname": "app",
"plugin.name": "pgoutput",
"table.include.list": "public.orders",
"slot.name": "orders_slot"
}
}App dual-write guarded by a flag (OpenFeature shown):
// Node/TypeScript snippet
const flag = await client.getBooleanValue('orders.dual_write_v2', false);
await oldRepo.save(order);
if (flag) await newRepo.save(orderV2);Flip reads behind a flag, then let both systems run in parallel for days. Verify row counts and sampled deep diffs:
SELECT COUNT(*) FROM orders;
SELECT COUNT(*) FROM orders_v2;
-- Sampled consistency check
SELECT o.id FROM orders o
LEFT JOIN orders_v2 v ON o.id = v.id
WHERE (v.id IS NULL OR v.shipping_zone <> o.shipping_zone)
AND random() < 0.001;Checkpoint: backfill lag < 5s, consistency mismatches < 0.1% sampled, dual-write latency impact < 5ms p95.
4) Traffic Control: Progressive Delivery That Rolls Back Itself
Use the mesh to steer traffic and a rollout controller to automagically revert if SLOs degrade.
Argo Rollouts canary with Prometheus analysis:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: orders
spec:
strategy:
canary:
canaryService: orders-canary
stableService: orders-stable
trafficRouting:
istio:
virtualService:
name: orders
routes: [ primary ]
steps:
- setWeight: 1
- pause: { duration: 120 }
- analysis:
templates:
- templateName: orders-slo
- setWeight: 5
- pause: { duration: 180 }
- analysis:
templates:
- templateName: orders-slo
- setWeight: 20
- pause: { duration: 300 }
- analysis:
templates:
- templateName: orders-slo
- setWeight: 50
- pause: { duration: 300 }
- setWeight: 100AnalysisTemplate gating on 5xx and p95:
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: orders-slo
spec:
metrics:
- name: errors
interval: 60s
count: 2
failureLimit: 1
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
sum(rate(http_requests_total{service="orders-canary",status=~"5.."}[2m]))
/
sum(rate(http_requests_total{service="orders-canary"}[2m])) > 0.003
- name: latency
interval: 60s
count: 3
failureLimit: 1
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service="orders-canary"}[2m])) by (le))
> 0.25If metrics breach, Rollouts sets it back to stable. No heroics required.
Checkpoint: automation can move traffic 1% → 100% with zero manual edits; rollback proven in rehearsal.
5) Observability, Runbooks, and Circuit Breakers
You can’t fix what you can’t see. Make failure boring and reversible.
- Dashboards: one panel per KPI; overlay rollout weights on graphs.
- Alerts: only on migration KPIs with page-to-ack < 5 min. Everything else is muted during the window.
- Golden signals: latency, traffic, errors, saturation. Add business KPIs (checkout success rate, auth success).
- Circuit breakers: set upstream timeouts/retries and fallback paths.
Envoy/Istio circuit breaker example:
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: orders
spec:
host: orders
trafficPolicy:
connectionPool:
tcp:
maxConnections: 1000
http:
http1MaxPendingRequests: 1000
maxRequestsPerConnection: 100
outlierDetection:
consecutive5xx: 5
interval: 5s
baseEjectionTime: 30sRunbook skeleton:
- Preconditions checked (SLOs, backfill, flags, alerts muted, paging rotation aware)
argocd app sync orderskubectl argo rollouts get rollout orders -w- Hold points after 5%, 20%, 50% with sign-off in Slack change thread
- Rollback command documented and tested
Checkpoint: a new SRE can run the play without a senior whispering over Zoom.
6) Cutover Procedure (The Hour You’ll Remember)
This is the sequence we used moving a card authorization path at a fintech. We kept 99.95% availability; the only spike was a transient +12% p95 for 3 minutes at 50% canary—well within budget.
- Freeze: change freeze for non-migration repos; mute noisy alerts.
- Dark launch: deploy new stack at 0% traffic; warm caches.
- Shadow traffic on: compare responses for sampled requests; log diffs only.
- Dual writes on: enable
orders.dual_write_v2at 10% cohort (e.g., internal users) for 30 minutes. - Start rollout: 1% → 5% → 20% → 50% with automated analysis gates. Hold at 50% for 10–15 minutes; run queries:
-- Hot checks
SELECT COUNT(*) FROM orders WHERE created_at > now() - interval '10 minutes';
SELECT COUNT(*) FROM orders_v2 WHERE created_at > now() - interval '10 minutes';- Flip reads: enable
orders.read_from_v2flag for canary pods only; verify parity metrics. - Go 100%: complete rollout; keep dual writes for 24–72 hours depending on risk.
- Watch like a hawk: error budgets, business KPIs; have a human approve after 30 minutes to declare success.
- Rollback triggers: any KPI breach beyond thresholds for >2 minutes or on-call gut feel. Rollback commands:
# Traffic rollback
kubectl argo rollouts rollback orders
# Feature flag rollback
openfeature set orders.read_from_v2=false
openfeature set orders.dual_write_v2=falseCheckpoint: time from “breach detected” to “traffic stable on old path” < 2 minutes.
7) Post-Cutover Hardening and Cleanup
You’re not done until the old path is boring again.
- Soak period: 24–72 hours. Keep dual writes, sample diffs nightly.
- Contract: remove old schema after N days without diffs.
-- Contract phase after soak
ALTER TABLE orders DROP COLUMN shipping_zone; -- if moved fully to v2
DROP TABLE orders_legacy;- Kill flags: migrate from dynamic to static config; delete flags to avoid accidental toggles.
- Cost and perf: scale down old infra; right-size new autoscaling based on observed p95/99.
- Postmortem: even if it went fine. Capture “we got lucky” items; add chaos tests.
Checkpoint: no dual-writes, no dangling infra, and a PR that removes every migration flag.
8) Traps I’ve Seen (So You Don’t)
- Long-lived connections: gRPC streams or WebSockets pin to old pods. Use drain policies and connection max age.
- Sticky sessions: ELB/NLB with stickiness keep sending users to old stack. Disable or shorten TTL during cutover.
- Third-party rate limits: shadow traffic can double call volume—throttle or stub.
- Idempotency: dual-writes can duplicate side effects. Require idempotency keys.
- Token scopes and JWKs: auth migrations fail on unnoticed JWK rotation. Pre-distribute keys and cache TTLs.
- Cross-DC latency: backfills across regions will surprise you. Compress payloads, batch, and run close to source.
- Clock skew: CDC and dedupe logic break with skew. Enforce NTP across fleets.
If two or more of these smell like your stack, make the rehearsal nastier: chaos test the exact failure you fear.
What “Good” Looks Like (Receipts)
- Rollout duration: 45–90 minutes with three hold points.
- Max p95 delta: < 15% during canary; returns to baseline post cutover.
- 5xx rate: < 0.2% during entire window.
- Rollback test: executed at least once in rehearsal and once in production dry-run at 1%.
- Cleanup PR merged within 7 days.
We’ve done this at retailers on Black Friday traffic and fintechs during market open. The secret isn’t a magic tool; it’s respecting the data, rehearsing like you mean it, and making rollback the easiest button in the room.
If you want a second set of eyes or someone to run the game day, GitPlumbers lives for this stuff.
Key takeaways
- Define SLOs and guardrail metrics before you touch a single kube manifest.
- Rehearse with production-like traffic via shadowing and synthetic load; automate the rollback path.
- Use expand/contract for data, dual-writes behind flags, and backfill with change data capture.
- Gate cutovers with Prometheus-based analysis and progressive delivery (Argo Rollouts/Istio).
- Treat the migration as a product release: runbooks, comms plan, rollback checkpoint, and postmortem.
Implementation checklist
- Inventory dependencies and define blast radius; set success criteria and error budget.
- Create prod-like rehearsal env via Terraform/ArgoCD; shadow traffic with gor or mesh traffic mirroring.
- Design data plan: expand/contract schema, backfill with CDC (Debezium/DMS), dual-writes behind flags.
- Instrument KPIs: p95 latency, 5xx rate, saturation; wire PromQL into automated analysis.
- Implement progressive delivery: 1%-5%-20%-50%-100% canary with Argo Rollouts or Istio routing.
- Prepare rollback: versioned artifacts, database compatibility window, traffic switch back.
- Execute cutover runbook with hold points and sign-offs; verify data consistency.
- Clean up: kill flags, contract schema, remove old infra, and run a blameless review.
Questions we hear from teams
- How do I handle stateful sessions during migration?
- Eliminate stateful sessions by externalizing session state (Redis/Memcached) and using short TTLs. During cutover, reduce stickiness TTL to seconds or disable it. Drain old pods with connection max age and preStop hooks to close sessions gracefully.
- What if my database can’t support CDC?
- Use native logical replication where possible (Postgres, MySQL). If not, schedule chunked backfills with application-level version markers and strictly idempotent writes. Expect a longer soak period and consider a brief read-only window for truly hard edges (rare).
- Can I skip feature flags and just deploy the new version?
- You can, but then rollback is slow and blunt. Flags make read/write paths and behaviors independently switchable, which shortens MTTR dramatically. If you hate vendor lock-in, use OpenFeature with your provider of choice.
- How do I prove readiness to leadership?
- Share rehearsal metrics: peak RPS sustained, p95/99 latencies within budget, automated rollback firing in staging, data parity reports with mismatch rates, and the exact runbook with decision points and owners.
- What’s the minimum toolset to pull this off?
- Kubernetes or a reliable orchestrator, GitOps (ArgoCD/Flux), traffic shaping (Istio/NGINX), observability (Prometheus/Grafana), and a feature flag system. Everything else is optimization.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
