Zero-Downtime or Bust: The Migration Checklist I Trust for Payments, Search, and Auth

A pragmatic, step-by-step runbook with gates, metrics, and tooling to move critical workloads without a blip.

I don’t ship ‘no-risk’ migrations. I ship migrations with tiny, reversible risks and hard gates you can bet your SLOs on.
Back to all posts

The situation you’ve lived through

You’ve got a payments service or auth gateway that hasn’t had a minute of downtime in two years. The CFO says “no maintenance window.” You’re moving to a new cluster, a new DB version, or a new region. I’ve done these at fintechs and marketplaces where a 5‑minute blip meant a PR crisis and a chargeback wave. This is the checklist I use when failure isn’t an option.

Zero downtime means SLO-neutral. You can have a handful of 5xxs. You cannot breach the SLO or burn the error budget for the window.

Non‑negotiables and guardrails

  • Written SLOs: e.g., 99.9% availability, p95 < 300ms, 5xx rate < 0.2%. Define acceptable regression during migration: p95 +10% max, 5xx delta ≤ 0.1% vs baseline.
  • Rollback must be one action: argo rollouts abort, kubectl rollout undo, or feature flag off. If rollback takes a war room, you don’t have rollback.
  • Data is king: migrate data safely before traffic. Use expand/contract, idempotent backfills, dual-writes behind flags, and CDC validation.
  • Automate gates: Promotions should be blocked by Prometheus queries, not gut feel.
  • Practice: Rehearse with synthetic load and chaos. If you haven’t run it cold, you haven’t run it.

Pre‑flight: inventory, SLOs, blast radius

  1. Inventory dependencies: external APIs, DBs, caches, queues, secrets, feature flags, webhooks. Map timeouts and retries.
  2. Set SLO guardrails just for the migration window. Capture 7‑day baseline for p95, p99, 5xx, Apdex, queue lag, and GC time.
  3. Reduce DNS/ingress TTL to 30s at least 48h before. For CloudFront/ALB, set connection draining: deregistration_delay = 30.
  4. Freeze risky deploys. Only migration PRs allowed.
  5. Dry‑run in staging with realistic data. Use k6 and seed with prod-shaped payloads.
# Example k6 smoke for a payment API
k6 run -e BASE_URL=https://staging.payments.example.com load.js
// load.js
import http from 'k6/http';
import { check, sleep } from 'k6';
export let options = { vus: 50, duration: '10m', thresholds: {
  http_req_failed: ['rate<0.01'], http_req_duration: ['p(95)<300'] } };
export default function () {
  const res = http.post(`${__ENV.BASE_URL}/charge`, JSON.stringify({ amount: 4200, currency: 'USD' }), { headers: { 'Content-Type': 'application/json' } });
  check(res, { 'status is 200/202': (r) => [200,202].includes(r.status) });
  sleep(0.5);
}

Checkpoint: Baselines recorded, TTL reduced, smoke tests passing, dry run completed with SLOs intact.

Data‑first: schema, backfills, dual‑writes, CDC

  • Expand migration: Add new columns/tables/indexes first, non‑breaking.
-- Flyway/Liquibase expand step
ALTER TABLE payments ADD COLUMN processor_ref TEXT NULL; -- new world needs this
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_payments_processor_ref ON payments(processor_ref);
  • Idempotent backfill: Run in small batches, resume on failure, track checkpoints.
#!/usr/bin/env bash
# backfill.sh
set -euo pipefail
BATCH=1000
LAST_ID=${LAST_ID:-0}
while true; do
  psql "$DATABASE_URL" -c "UPDATE payments SET processor_ref = 'legacy:'||id WHERE id > $LAST_ID AND processor_ref IS NULL ORDER BY id ASC LIMIT $BATCH RETURNING id" -tA > ids.txt
  [ ! -s ids.txt ] && break
  LAST_ID=$(tail -n1 ids.txt)
  echo "checkpoint=$LAST_ID"
done
  • Dual‑writes behind a flag: Use LaunchDarkly or OpenFeature to guard second write.
// pseudocode in TypeScript (NestJS/Express)
if (flags.isEnabled('dual_write_processor')) {
  await Promise.allSettled([
    legacyClient.charge(req),
    newProcessor.charge(req)
  ]);
}
  • CDC validation: Use Debezium to stream both sources into Kafka; compare record counts, hashes, or key invariants.
# Debezium connector (Postgres) snippet
name: payments-cdc
config:
  connector.class: io.debezium.connector.postgresql.PostgresConnector
  database.hostname: db
  database.user: debezium
  table.include.list: public.payments

Checkpoint: Expand done, backfill completed within error budget, CDC parity ≥ 99.99%, dual‑writes on for 24–72h with no drift alerts.

Traffic strategy: shadow → canary → blue/green

  • Shadow (traffic mirroring): Send read‑only copies to the new stack; discard responses. Watch p95, p99, 5xx, GC, and resource saturation.
# Istio VirtualService mirror example
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata: { name: payments }
spec:
  hosts: ["payments.svc.cluster.local"]
  http:
  - route:
    - destination: { host: payments-v1 }
    mirror: { host: payments-v2 }
    mirrorPercentage: { value: 100.0 }
  • Canary with gates: Use Argo Rollouts or Flagger to increment traffic only when Prometheus queries pass.
# Argo Rollouts canary with Prometheus analysis
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata: { name: payments }
spec:
  strategy:
    canary:
      steps:
      - setWeight: 1
      - pause: { duration: 120 }
      - analysis:
          templates:
          - templateName: p95-latency
          args:
          - name: threshold
            value: "0.10" # +10%
      - setWeight: 10
      - pause: { duration: 300 }
      - setWeight: 30
      - pause: { duration: 600 }
      - setWeight: 50
  selector: { matchLabels: { app: payments } }
  template:
    metadata: { labels: { app: payments } }
    spec:
      containers:
      - name: svc
        image: registry/payments:v2.3.1
        readinessProbe: { httpGet: { path: /healthz, port: 8080 }, periodSeconds: 5 }
  • Blue/green: Keep the old stack as hot standby; final switch is instantaneous and reversible.

Checkpoint: Shadow stable, canary to 50% with no SLO gate violations, blue/green ready.

The cutover runbook (T‑60 to T+30)

Roles: TL (Decider), SRE (Observability), App Lead (Release), DB Lead (Data), Comms (Status updates).

  1. T‑60: Announce start in Slack/Statuspage. Silence non‑critical alerts. Confirm error budget available.
  2. T‑50: Enable circuit breakers/timeouts in mesh (Istio, Envoy).
# Envoy circuit breaker snippet via Istio DestinationRule
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata: { name: payments }
spec:
  host: payments
  trafficPolicy:
    outlierDetection:
      consecutive5xxErrors: 5
      interval: 5s
      baseEjectionTime: 30s
  1. T‑45: Freeze batch jobs and pause noisy consumers to reduce churn.
  2. T‑40: Start shadow at 100%, verify resource headroom ≥ 30% CPU/mem.
  3. T‑30: Start canary at 1% → 10% with Argo Rollouts. Gates: p95 +10%, 5xx delta ≤ 0.1%, queue_lag < 2x baseline.
argo rollouts promote payments # if analysis passes
  1. T‑15: Canary to 50%, keep dual‑writes on. Monitor DB replica lag (pg_stat_replication or aws rds describe-db-instances).
  2. T‑10: Flip to green (100%). Keep blue hot for rollback.
  3. T‑5: Drain connections on old target.
# ALB connection draining (Terraform)
resource "aws_lb_target_group" "payments" {
  deregistration_delay = 30
}
  1. T‑0: Verify business KPIs (auth success rate, charge approval rate). If any gate trips, execute rollback immediately.
  2. T+10: Rotate on-call back to normal alerts. Keep dual‑writes and CDC checks running.

Checkpoint: Green at 100%, blue idling, SLOs within thresholds for 30 minutes, no backlog growth.

Observability gates, rollback, and kill switches

  • PromQL gates you can automate in Argo/Flagger:
# 5xx delta vs baseline (<0.1%)
(sum(rate(http_requests_total{status=~"5..",app="payments",version="canary"}[5m]))
 /
 sum(rate(http_requests_total{app="payments",version="canary"}[5m])))
< 0.001

# p95 latency regression (<10%)
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{app="payments",version="canary"}[5m])) by (le))
/
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{app="payments",version="stable"}[1h])) by (le)) < 1.10

# Kafka consumer lag
max(kafka_consumergroup_lag{consumergroup="payments"}) < 1000
  • Kill switches: Feature flags to disable dual‑writes, circuit breaker to eject bad upstreams, traffic weight back to 0.
# Rollback in one move
argo rollouts abort payments && ldctl flag set dual_write_processor false
  • Tracing: Ensure OpenTelemetry spans include version labels; compare v1 vs v2 service maps in Grafana Tempo/Jaeger.
  • Audit: Record decision timestamps and metrics for postmortem.

Checkpoint: Rollback tested in staging; kill switches documented; on-call knows the single rollback command.

Aftercare, cleanup, and costs

  • Hold dual‑writes and CDC for 24–48h. Alert on any divergence.
  • Run the contract migration to drop old columns/indices only after the stability window.
-- Contract step (post-stability)
ALTER TABLE payments DROP COLUMN legacy_processor_id;
DROP INDEX CONCURRENTLY IF EXISTS idx_payments_legacy_processor_id;
  • Turn off shadowing; scale down blue. Keep snapshots and backups per retention.
  • Remove feature flags and dead code paths; update runbooks/CMDB.
  • Review costs: unused capacity, cross‑AZ data transfer, oversized nodes. Rightsize autoscaling based on the new baseline.
  • Lessons learned: what gates were too tight/loose, what runbook steps were unclear.

Checkpoint: Contract safely applied, blue torn down, flags removed, costs optimized, learnings shared.

Tooling that’s earned its keep

  • Infra: Terraform, ArgoCD (GitOps), Argo Rollouts or Flagger for progressive delivery.
  • Mesh/ingress: Istio/Envoy, Nginx Ingress, AWS ALB/GCLB.
  • Data: Flyway/Liquibase, Debezium for CDC, Kafka/Kinesis.
  • Observability: Prometheus/Grafana, Alertmanager, OpenTelemetry, Jaeger/Tempo.
  • Flags: LaunchDarkly, OpenFeature.
  • Testing: k6 for load, Gremlin/Litmus for chaos.

What actually works: small, reversible steps gated by metrics, with data correctness proven before traffic moves. We’ve run this playbook at fintechs, marketplaces, and B2B SaaS without burning error budgets.

Related Resources

Key takeaways

  • Zero downtime means SLO-neutral, not zero errors—set explicit gates and rollback criteria.
  • Data moves first: use expand/contract, idempotent backfills, dual-writes behind a flag, and CDC validation.
  • Shadow traffic → canary → blue/green is the safest progression; automate gates with Prometheus and Argo Rollouts.
  • Write the cutover timeline like a pre-flight checklist and rehearse it with synthetic load.
  • Rollback must be one command or one flag flip—practice it until it’s boring.

Implementation checklist

  • Freeze risky changes; tag and branch the release to be migrated.
  • Define SLOs and error budget gates for the migration window.
  • Reduce DNS/ingress TTL to 30s at least 48h before cutover.
  • Provision target infra with IaC (Terraform) and validate with smoke tests.
  • Add schema via expand migration (non-breaking), deploy read-only first.
  • Run idempotent backfill and verify consistency via CDC checks.
  • Enable dual-writes behind a feature flag; validate row-level diffs for 24–72h.
  • Mirror traffic (shadow) and measure p95 latency/5xx deltas <1%.
  • Configure canary with automated rollback on SLO breach.
  • Gate promotions on Prometheus queries; no manual eyeballing.
  • Plan connection draining and session pinning (sticky sessions off or centralized).
  • Freeze queues/batch jobs that could amplify risk during cutover.
  • Run a dry run with synthetic load (k6) and chaos faults (Gremlin) in staging.
  • Execute the T-60 to T+30 playbook with roles and timestamps.
  • Have a one-click rollback (Argo Rollouts abort or feature flag off).
  • Post-cutover monitor for 24–48h; keep dual-writes on with alerts.
  • Contract migration (drop old columns/paths) only after stability window.
  • Remove temporary flags and infra; update runbooks and CMDB.
  • Audit cost, SLO impact, and lessons learned; share internally.

Questions we hear from teams

What if my workload maintains sticky sessions?
Centralize session state (Redis, DynamoDB) or temporarily reduce stickiness during cutover. Ensure both blue and green can read the same session store and validate serialization format compatibility.
Can I skip dual-writes if I trust my backfill?
You can, but I wouldn’t for critical paths. Dual-writes behind a flag buy you proof over time and a fast rollback if the new path corrupts data. Turn it off after 24–48h of clean CDC parity.
How do I handle long-lived connections (WebSockets, gRPC)?
Introduce max connection age and aggressive readines probes. Use connection draining on LBs and graceful shutdown hooks. For gRPC, set max connection age/timeouts in server and enforce retryable idempotent methods.
We don’t use Kubernetes. Does this still apply?
Yes. Replace Rollouts with Spinnaker or AWS CodeDeploy blue/green, Istio with ALB/Nginx canaries, and Prometheus gates with CloudWatch/Datadog monitors. The principles—data-first, progressive traffic, automated gates—stay the same.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to an engineer about your migration Get the printable runbook PDF

Related resources