Rollback-First: The Boring Friday Deploy Playbook

If you can’t reverse a bad change in five minutes, you don’t have continuous delivery—you have continuous roulette.

“If you can’t rollback in five minutes, you don’t have continuous delivery—you have continuous roulette.”
Back to all posts

The Friday deploy that didn’t page anyone

It’s 4:50 p.m. Friday. We ship a payments service to 5% with Argo Rollouts. Five minutes later, the error-burn check spikes—2.4x over our SLO window. The rollout auto-aborts, traffic shifts back, Slack posts “Canary failed, rollback complete.” No PagerDuty page. No war room. The only action item? Monday fix.

That’s not luck. It’s design. We built the rollback path first, then the deploy path. And we measure success with three north-star metrics: change failure rate, lead time, and recovery time.

Rollback is a design choice, not a panic button

I’ve watched smart teams ship beautiful pipelines that crumble when a change misbehaves. The patterns are consistent:

  • Rollback steps live in someone’s head or a stale wiki
  • Databases can’t go backward without data loss
  • Canary is “logs look fine” instead of SLO-driven gates
  • Artifacts aren’t immutable; the “latest” tag points who-knows-where

What actually works:

  • Treat rollback as a product requirement with acceptance criteria
  • Make rollbacks operationally trivial (one command or toggle)
  • Instrument deployments with SLO-aware guards that auto-abort on risk

Why it matters to your P&L:

  • Lower change failure rate shrinks incident volume and support load
  • Faster lead time increases feature throughput and learning cycles
  • Shorter recovery time reduces revenue leakage and brand damage

Set aggressive but achievable targets:

  • Change failure rate: <5% (from DORA)
  • Lead time (commit-to-prod): <60 minutes for normal changes
  • Recovery time (bad change to user impact resolved): <10 minutes

Patterns that make rollbacks trivial

If rollback isn’t easy, you won’t do it fast. Bake these in from day one.

  • Immutable, versioned artifacts

    • Publish containers by SHA, not latest
    • Track deploy provenance (git SHA, build ID) in Helm annotations or Deployment labels
    • Keep N-2 artifacts hot in the registry and cache warm
  • Canary with SLO gates

    • Argo Rollouts, Flagger, or Spinnaker + Kayenta
    • Weights: 5% → 20% → 50% → 100% with pauses, analysis, and auto-abort
  • Blue/green or shadow traffic

    • Two target groups or two Kubernetes services; flip DNS/weights in seconds
    • Mirror a slice of traffic to the new stack for read-only validation
  • Feature flags for business risk

    • LaunchDarkly or Unleash for logic toggles and kill switches
    • Separate release toggles from experiment flags; assign owners and TTLs
  • Database expand/contract

    • Add columns/tables first, backfill, dual-write, cut over, then drop later
    • Use gh-ost or pt-online-schema-change for MySQL; liquibase/flyway for versioning
    • Design for roll-forward; rollback means reverting code path, not dropping data
  • Schema-compatible events

    • Enforce backward-compat in Avro or Protobuf registries
    • Keep consumers tolerant to unknown fields

Concrete playbooks by stack

Here’s the stuff you put in runbooks, not in slide decks.

  • Kubernetes (GKE/EKS/AKS)
    • Roll back a deployment:
      • kubectl rollout undo deployment/api --to-revision=3
      • helm rollback api 23
      • argo rollouts undo rollout/api
    • Canary with Prometheus guard (snippet):
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: http-5xx-burn
spec:
  metrics:
  - name: error-burn
    interval: 1m
    count: 5
    successCondition: result < 1.5
    provider:
      prometheus:
        address: http://prometheus.monitoring:9090
        query: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])
  • ECS/EC2 with ALB

    • Blue/green using two target groups; rollback = flip weights:
      • aws elbv2 modify-listener --listener-arn ... --default-actions '[{"Type":"forward","ForwardConfig":{"TargetGroups":[{"TargetGroupArn":"blue","Weight":100},{"TargetGroupArn":"green","Weight":0}]}}]'
  • Serverless (AWS Lambda)

    • Use aliases and weighted routing; rollback = move alias back:
      • aws lambda update-alias --function-name checkout --name prod --function-version 42 --routing-config AdditionalVersionWeights={}
    • With CodeDeploy: aws deploy stop-deployment --deployment-id d-123 --auto-rollback-enabled
  • Edge configs (Fastly/Cloudflare)

    • Versioned configs; rollback = activate prior version:
      • fastly service-version activate --service-id $ID --version 67
  • Terraform-managed infra

    • Keep launch_template_version pinned; rollback by applying the prior version:
      • terraform apply -var lt_version=42
    • Use state locks and plan files; never hotfix in the console
  • Databases

    • Phase 1: Add nullable column, dual-write, backfill (idempotent)
    • Phase 2: Switch reads, monitor, keep old path intact
    • Phase 3: Remove old column in a separate, reversible change window
    • If you must revert: toggle code path off; never drop recovered data

Guardrails that auto-abort the bad stuff

Make the safe path the default path.

  • SLO-driven automated rollback

    • Argo Rollouts with Prometheus queries; abort on error-rate, p95 latency, or saturation spikes
    • Spinnaker + Kayenta does automated canary analysis across metrics sets
  • Circuit breakers and timeouts

    • Envoy/Istio outlier_detection, resilience4j for JVMs, and exponential backoff
    • Prevent a bad deploy from melting neighbors while your pipeline rolls back
  • Policy as code and GitOps

    • OPA/Kyverno reject non-rollbackable manifests (e.g., latest tags, missing probes)
    • ArgoCD enforces desired state; revert = git revert + sync
  • Observability you can trust

    • Golden signals are pre-wired into dashboards and alerts: error rate, latency, saturation, correctness
    • Track deploy IDs in logs/traces (X-Deploy-Id), so you can correlate impact quickly

Checklists that scale with team size

Print these. Tape them to the wall. Automate them next.

  • Pre-deploy (every change)

    1. Does this change have a one-step rollback? List the exact command/toggle/weight.
    2. Is the database step backward compatible? If not, split the change.
    3. Are SLO gates enabled for the canary? Which metrics?
    4. Are feature flags and kill switches wired and owned?
    5. Announce scope and rollback plan in #deploys with links to dashboards.
  • During deploy

    1. Start at 5% traffic; pause 5–10 minutes.
    2. Watch error burn, p95 latency, and key business events (checkout success, login rate).
    3. Promote only if metrics pass; otherwise let automation abort.
  • Rollback (execute within 5 minutes)

    • Kubernetes: helm rollback <release> <rev> or kubectl rollout undo ...
    • Lambda: move prod alias back to prior version
    • ALB/NGINX: shift weights to stable target group
    • Feature flags: kill switch OFF for new path
    • Verify recovery via dashboards; post the “resolved” note in Slack
  • Monthly drills

    • Randomly select a service; break a synthetic canary; time rollback to stable
    • Track MTTR; create issues for any manual steps
    • Rotate who runs the drill so the knowledge scales

Metrics that prove it’s working

If you can’t measure it, you’ll argue about it in the retro.

  • Change Failure Rate (CFR)

    • Numerator: deployments that require rollback, hotfix, or flag kill
    • Denominator: total production deployments
    • Goal: trend down toward <5%
  • Lead Time for Changes

    • From merge to production traffic at 100%
    • Instrument via pipeline events + rollout promotion logs
    • Goal: <60 minutes for routine changes
  • Recovery Time (MTTR)

    • From detection (alert/gate fail) to user impact resolved
    • Goal: <10 minutes for rollback-capable services
  • Dashboards

    • Grafana: CFR by service, deployment frequency, MTTR P50/P90, error budget burn
    • Tag traces/logs with deploy_id so queries like “errors by deploy” are one click

I’ve seen teams cut CFR by half in a quarter just by shipping SLO-gated canaries and practicing rollback drills. The engineering hours you save stop going into firefighting and start going into features.

Case notes: what changed and what we got back

  • Fintech on GKE (payments + auth)

    • Before: CFR 22%, MTTR 97m, informal Friday freeze
    • After GitPlumbers’ 4-week rollout-first push (Argo Rollouts + Prometheus gates, Helm provenance, LD kill switches, expand/contract DB): CFR 6%, MTTR 9m, freeze removed. Lead time from 2 days to 45 minutes.
  • E-comm on ECS/ALB (search + checkout)

    • Before: Blue/green by hand in the console, database changes tied to deploys, no flags
    • After: Terraform-managed target weights, CodeDeploy canary with auto-rollback, Unleash for risky logic. Result: 3 consecutive holiday Fridays shipped with zero pages; on-call costs down ~30%.
  • SaaS analytics on Lambda + CloudFront

    • Before: Alias drift, hard rollbacks, broken dashboards
    • After: Aliased, versioned deploys with weighted canary, Fastly rollback runbook, OPA policy to ban latest images. MTTR from 40m → 6m.

None of this is flashy. It’s plumbing. But boring plumbing is why Friday deploys become boring, too.

Related Resources

Key takeaways

  • Design rollback first. Make it a product requirement, not an afterthought.
  • Use DORA metrics as north stars: change failure rate, lead time, and recovery time.
  • Prefer canary + SLO-driven automated aborts over manual guesswork.
  • Keep rollbacks operationally simple: one command, one toggle, or one traffic weight change.
  • Make databases rollback-safe with expand/contract and roll-forward mindset.
  • Codify checklists and drills so any engineer can safely revert at 2 a.m. or 4:55 p.m. Friday.

Implementation checklist

  • Every deploy must have a documented rollback path (command, toggle, or traffic shift).
  • Artifacts are immutable, versioned, and quickly addressable (image SHA, Helm revision, Lambda alias).
  • All database changes follow expand/contract; destructive steps are isolated and slow.
  • SLO/metric gates enforce auto-abort on canaries; humans approve promotions, not rescues.
  • Run monthly rollback drills; track MTTR from “bad deploy detected” to “user impact resolved.”
  • Feature flag debt has owners, TTLs, and cleanup tasks on the board.

Questions we hear from teams

Should we stop Friday deploys?
No. You should stop unsafe deploys. If you can auto-abort canaries, flip traffic, and toggle features within five minutes, Friday is just another day. If you can’t, the day isn’t the problem—your rollback design is.
How do we handle database rollbacks?
Don’t. Design for roll-forward. Use expand/contract: add new structures, dual-write, backfill, then switch reads. Rollback means toggling code paths off. Only perform destructive drops in a separate, low-risk change after stability is proven.
What about feature flag debt?
Treat flags like code. Each flag has an owner, a TTL, and a cleanup ticket. Separate kill switches (operational) from experiments (product). Instrument flags in logs and dashboards so you can correlate behavior with toggle states.
Who owns rollback in the org?
Platform owns the tooling and guardrails. Service teams own their rollback runbooks and drills. SRE validates SLOs and gates. Execs track CFR, lead time, and MTTR. Everybody practices.
How do we test rollbacks without scaring customers?
Shadow traffic, synthetic transactions, and monthly drills on non-critical windows. Use canary releases with tiny weights (1–5%), real SLO gates, and automatic aborts. Make practice boring so production is boring.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Ship with rollback-first confidence See how we cut CFR and MTTR for teams like yours

Related resources