Zero-Downtime or Bust: The Migration Checklist I Trust for Payments, Search, and Auth
A pragmatic, step-by-step runbook with gates, metrics, and tooling to move critical workloads without a blip.
I don’t ship ‘no-risk’ migrations. I ship migrations with tiny, reversible risks and hard gates you can bet your SLOs on.Back to all posts
The situation you’ve lived through
You’ve got a payments service or auth gateway that hasn’t had a minute of downtime in two years. The CFO says “no maintenance window.” You’re moving to a new cluster, a new DB version, or a new region. I’ve done these at fintechs and marketplaces where a 5‑minute blip meant a PR crisis and a chargeback wave. This is the checklist I use when failure isn’t an option.
Zero downtime means SLO-neutral. You can have a handful of 5xxs. You cannot breach the SLO or burn the error budget for the window.
Non‑negotiables and guardrails
- Written SLOs: e.g.,
99.9% availability,p95 < 300ms,5xx rate < 0.2%. Define acceptable regression during migration:p95 +10% max,5xx delta ≤ 0.1%vs baseline. - Rollback must be one action:
argo rollouts abort,kubectl rollout undo, orfeature flag off. If rollback takes a war room, you don’t have rollback. - Data is king: migrate data safely before traffic. Use expand/contract, idempotent backfills, dual-writes behind flags, and CDC validation.
- Automate gates: Promotions should be blocked by Prometheus queries, not gut feel.
- Practice: Rehearse with synthetic load and chaos. If you haven’t run it cold, you haven’t run it.
Pre‑flight: inventory, SLOs, blast radius
- Inventory dependencies: external APIs, DBs, caches, queues, secrets, feature flags, webhooks. Map timeouts and retries.
- Set SLO guardrails just for the migration window. Capture 7‑day baseline for
p95,p99,5xx,Apdex, queue lag, and GC time. - Reduce DNS/ingress TTL to
30sat least 48h before. For CloudFront/ALB, set connection draining:deregistration_delay = 30. - Freeze risky deploys. Only migration PRs allowed.
- Dry‑run in staging with realistic data. Use
k6and seed with prod-shaped payloads.
# Example k6 smoke for a payment API
k6 run -e BASE_URL=https://staging.payments.example.com load.js// load.js
import http from 'k6/http';
import { check, sleep } from 'k6';
export let options = { vus: 50, duration: '10m', thresholds: {
http_req_failed: ['rate<0.01'], http_req_duration: ['p(95)<300'] } };
export default function () {
const res = http.post(`${__ENV.BASE_URL}/charge`, JSON.stringify({ amount: 4200, currency: 'USD' }), { headers: { 'Content-Type': 'application/json' } });
check(res, { 'status is 200/202': (r) => [200,202].includes(r.status) });
sleep(0.5);
}Checkpoint: Baselines recorded, TTL reduced, smoke tests passing, dry run completed with SLOs intact.
Data‑first: schema, backfills, dual‑writes, CDC
- Expand migration: Add new columns/tables/indexes first, non‑breaking.
-- Flyway/Liquibase expand step
ALTER TABLE payments ADD COLUMN processor_ref TEXT NULL; -- new world needs this
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_payments_processor_ref ON payments(processor_ref);- Idempotent backfill: Run in small batches, resume on failure, track checkpoints.
#!/usr/bin/env bash
# backfill.sh
set -euo pipefail
BATCH=1000
LAST_ID=${LAST_ID:-0}
while true; do
psql "$DATABASE_URL" -c "UPDATE payments SET processor_ref = 'legacy:'||id WHERE id > $LAST_ID AND processor_ref IS NULL ORDER BY id ASC LIMIT $BATCH RETURNING id" -tA > ids.txt
[ ! -s ids.txt ] && break
LAST_ID=$(tail -n1 ids.txt)
echo "checkpoint=$LAST_ID"
done- Dual‑writes behind a flag: Use
LaunchDarklyorOpenFeatureto guard second write.
// pseudocode in TypeScript (NestJS/Express)
if (flags.isEnabled('dual_write_processor')) {
await Promise.allSettled([
legacyClient.charge(req),
newProcessor.charge(req)
]);
}- CDC validation: Use
Debeziumto stream both sources into Kafka; compare record counts, hashes, or key invariants.
# Debezium connector (Postgres) snippet
name: payments-cdc
config:
connector.class: io.debezium.connector.postgresql.PostgresConnector
database.hostname: db
database.user: debezium
table.include.list: public.paymentsCheckpoint: Expand done, backfill completed within error budget, CDC parity ≥ 99.99%, dual‑writes on for 24–72h with no drift alerts.
Traffic strategy: shadow → canary → blue/green
- Shadow (traffic mirroring): Send read‑only copies to the new stack; discard responses. Watch p95, p99, 5xx, GC, and resource saturation.
# Istio VirtualService mirror example
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata: { name: payments }
spec:
hosts: ["payments.svc.cluster.local"]
http:
- route:
- destination: { host: payments-v1 }
mirror: { host: payments-v2 }
mirrorPercentage: { value: 100.0 }- Canary with gates: Use
Argo RolloutsorFlaggerto increment traffic only when Prometheus queries pass.
# Argo Rollouts canary with Prometheus analysis
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata: { name: payments }
spec:
strategy:
canary:
steps:
- setWeight: 1
- pause: { duration: 120 }
- analysis:
templates:
- templateName: p95-latency
args:
- name: threshold
value: "0.10" # +10%
- setWeight: 10
- pause: { duration: 300 }
- setWeight: 30
- pause: { duration: 600 }
- setWeight: 50
selector: { matchLabels: { app: payments } }
template:
metadata: { labels: { app: payments } }
spec:
containers:
- name: svc
image: registry/payments:v2.3.1
readinessProbe: { httpGet: { path: /healthz, port: 8080 }, periodSeconds: 5 }- Blue/green: Keep the old stack as hot standby; final switch is instantaneous and reversible.
Checkpoint: Shadow stable, canary to 50% with no SLO gate violations, blue/green ready.
The cutover runbook (T‑60 to T+30)
Roles: TL (Decider), SRE (Observability), App Lead (Release), DB Lead (Data), Comms (Status updates).
- T‑60: Announce start in Slack/Statuspage. Silence non‑critical alerts. Confirm error budget available.
- T‑50: Enable circuit breakers/timeouts in mesh (
Istio,Envoy).
# Envoy circuit breaker snippet via Istio DestinationRule
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata: { name: payments }
spec:
host: payments
trafficPolicy:
outlierDetection:
consecutive5xxErrors: 5
interval: 5s
baseEjectionTime: 30s- T‑45: Freeze batch jobs and pause noisy consumers to reduce churn.
- T‑40: Start shadow at 100%, verify resource headroom ≥ 30% CPU/mem.
- T‑30: Start canary at 1% → 10% with
Argo Rollouts. Gates:p95 +10%,5xx delta ≤ 0.1%,queue_lag < 2xbaseline.
argo rollouts promote payments # if analysis passes- T‑15: Canary to 50%, keep dual‑writes on. Monitor DB replica lag (
pg_stat_replicationoraws rds describe-db-instances). - T‑10: Flip to green (100%). Keep blue hot for rollback.
- T‑5: Drain connections on old target.
# ALB connection draining (Terraform)
resource "aws_lb_target_group" "payments" {
deregistration_delay = 30
}- T‑0: Verify business KPIs (auth success rate, charge approval rate). If any gate trips, execute rollback immediately.
- T+10: Rotate on-call back to normal alerts. Keep dual‑writes and CDC checks running.
Checkpoint: Green at 100%, blue idling, SLOs within thresholds for 30 minutes, no backlog growth.
Observability gates, rollback, and kill switches
- PromQL gates you can automate in Argo/Flagger:
# 5xx delta vs baseline (<0.1%)
(sum(rate(http_requests_total{status=~"5..",app="payments",version="canary"}[5m]))
/
sum(rate(http_requests_total{app="payments",version="canary"}[5m])))
< 0.001
# p95 latency regression (<10%)
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{app="payments",version="canary"}[5m])) by (le))
/
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{app="payments",version="stable"}[1h])) by (le)) < 1.10
# Kafka consumer lag
max(kafka_consumergroup_lag{consumergroup="payments"}) < 1000- Kill switches: Feature flags to disable dual‑writes, circuit breaker to eject bad upstreams, traffic weight back to 0.
# Rollback in one move
argo rollouts abort payments && ldctl flag set dual_write_processor false- Tracing: Ensure
OpenTelemetryspans include version labels; comparev1vsv2service maps in Grafana Tempo/Jaeger. - Audit: Record decision timestamps and metrics for postmortem.
Checkpoint: Rollback tested in staging; kill switches documented; on-call knows the single rollback command.
Aftercare, cleanup, and costs
- Hold dual‑writes and CDC for 24–48h. Alert on any divergence.
- Run the contract migration to drop old columns/indices only after the stability window.
-- Contract step (post-stability)
ALTER TABLE payments DROP COLUMN legacy_processor_id;
DROP INDEX CONCURRENTLY IF EXISTS idx_payments_legacy_processor_id;- Turn off shadowing; scale down blue. Keep snapshots and backups per retention.
- Remove feature flags and dead code paths; update runbooks/CMDB.
- Review costs: unused capacity, cross‑AZ data transfer, oversized nodes. Rightsize autoscaling based on the new baseline.
- Lessons learned: what gates were too tight/loose, what runbook steps were unclear.
Checkpoint: Contract safely applied, blue torn down, flags removed, costs optimized, learnings shared.
Tooling that’s earned its keep
- Infra:
Terraform,ArgoCD(GitOps),Argo RolloutsorFlaggerfor progressive delivery. - Mesh/ingress:
Istio/Envoy,Nginx Ingress,AWS ALB/GCLB. - Data:
Flyway/Liquibase,Debeziumfor CDC,Kafka/Kinesis. - Observability:
Prometheus/Grafana,Alertmanager,OpenTelemetry,Jaeger/Tempo. - Flags:
LaunchDarkly,OpenFeature. - Testing:
k6for load,Gremlin/Litmusfor chaos.
What actually works: small, reversible steps gated by metrics, with data correctness proven before traffic moves. We’ve run this playbook at fintechs, marketplaces, and B2B SaaS without burning error budgets.
Key takeaways
- Zero downtime means SLO-neutral, not zero errors—set explicit gates and rollback criteria.
- Data moves first: use expand/contract, idempotent backfills, dual-writes behind a flag, and CDC validation.
- Shadow traffic → canary → blue/green is the safest progression; automate gates with Prometheus and Argo Rollouts.
- Write the cutover timeline like a pre-flight checklist and rehearse it with synthetic load.
- Rollback must be one command or one flag flip—practice it until it’s boring.
Implementation checklist
- Freeze risky changes; tag and branch the release to be migrated.
- Define SLOs and error budget gates for the migration window.
- Reduce DNS/ingress TTL to 30s at least 48h before cutover.
- Provision target infra with IaC (Terraform) and validate with smoke tests.
- Add schema via expand migration (non-breaking), deploy read-only first.
- Run idempotent backfill and verify consistency via CDC checks.
- Enable dual-writes behind a feature flag; validate row-level diffs for 24–72h.
- Mirror traffic (shadow) and measure p95 latency/5xx deltas <1%.
- Configure canary with automated rollback on SLO breach.
- Gate promotions on Prometheus queries; no manual eyeballing.
- Plan connection draining and session pinning (sticky sessions off or centralized).
- Freeze queues/batch jobs that could amplify risk during cutover.
- Run a dry run with synthetic load (k6) and chaos faults (Gremlin) in staging.
- Execute the T-60 to T+30 playbook with roles and timestamps.
- Have a one-click rollback (Argo Rollouts abort or feature flag off).
- Post-cutover monitor for 24–48h; keep dual-writes on with alerts.
- Contract migration (drop old columns/paths) only after stability window.
- Remove temporary flags and infra; update runbooks and CMDB.
- Audit cost, SLO impact, and lessons learned; share internally.
Questions we hear from teams
- What if my workload maintains sticky sessions?
- Centralize session state (Redis, DynamoDB) or temporarily reduce stickiness during cutover. Ensure both blue and green can read the same session store and validate serialization format compatibility.
- Can I skip dual-writes if I trust my backfill?
- You can, but I wouldn’t for critical paths. Dual-writes behind a flag buy you proof over time and a fast rollback if the new path corrupts data. Turn it off after 24–48h of clean CDC parity.
- How do I handle long-lived connections (WebSockets, gRPC)?
- Introduce max connection age and aggressive readines probes. Use connection draining on LBs and graceful shutdown hooks. For gRPC, set max connection age/timeouts in server and enforce retryable idempotent methods.
- We don’t use Kubernetes. Does this still apply?
- Yes. Replace Rollouts with Spinnaker or AWS CodeDeploy blue/green, Istio with ALB/Nginx canaries, and Prometheus gates with CloudWatch/Datadog monitors. The principles—data-first, progressive traffic, automated gates—stay the same.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
