Rollback-First: The Boring Friday Deploy Playbook
If you can’t reverse a bad change in five minutes, you don’t have continuous delivery—you have continuous roulette.
“If you can’t rollback in five minutes, you don’t have continuous delivery—you have continuous roulette.”Back to all posts
The Friday deploy that didn’t page anyone
It’s 4:50 p.m. Friday. We ship a payments service to 5% with Argo Rollouts
. Five minutes later, the error-burn check spikes—2.4x over our SLO window. The rollout auto-aborts, traffic shifts back, Slack
posts “Canary failed, rollback complete.” No PagerDuty page. No war room. The only action item? Monday fix.
That’s not luck. It’s design. We built the rollback path first, then the deploy path. And we measure success with three north-star metrics: change failure rate, lead time, and recovery time.
Rollback is a design choice, not a panic button
I’ve watched smart teams ship beautiful pipelines that crumble when a change misbehaves. The patterns are consistent:
- Rollback steps live in someone’s head or a stale wiki
- Databases can’t go backward without data loss
- Canary is “logs look fine” instead of SLO-driven gates
- Artifacts aren’t immutable; the “latest” tag points who-knows-where
What actually works:
- Treat rollback as a product requirement with acceptance criteria
- Make rollbacks operationally trivial (one command or toggle)
- Instrument deployments with SLO-aware guards that auto-abort on risk
Why it matters to your P&L:
- Lower change failure rate shrinks incident volume and support load
- Faster lead time increases feature throughput and learning cycles
- Shorter recovery time reduces revenue leakage and brand damage
Set aggressive but achievable targets:
- Change failure rate: <5% (from DORA)
- Lead time (commit-to-prod): <60 minutes for normal changes
- Recovery time (bad change to user impact resolved): <10 minutes
Patterns that make rollbacks trivial
If rollback isn’t easy, you won’t do it fast. Bake these in from day one.
Immutable, versioned artifacts
- Publish containers by SHA, not
latest
- Track deploy provenance (
git SHA
,build ID
) inHelm
annotations orDeployment
labels - Keep N-2 artifacts hot in the registry and cache warm
- Publish containers by SHA, not
Canary with SLO gates
Argo Rollouts
,Flagger
, orSpinnaker + Kayenta
- Weights: 5% → 20% → 50% → 100% with pauses, analysis, and auto-abort
Blue/green or shadow traffic
- Two target groups or two
Kubernetes
services; flip DNS/weights in seconds - Mirror a slice of traffic to the new stack for read-only validation
- Two target groups or two
Feature flags for business risk
LaunchDarkly
orUnleash
for logic toggles and kill switches- Separate release toggles from experiment flags; assign owners and TTLs
Database expand/contract
- Add columns/tables first, backfill, dual-write, cut over, then drop later
- Use
gh-ost
orpt-online-schema-change
for MySQL;liquibase
/flyway
for versioning - Design for roll-forward; rollback means reverting code path, not dropping data
Schema-compatible events
- Enforce backward-compat in
Avro
orProtobuf
registries - Keep consumers tolerant to unknown fields
- Enforce backward-compat in
Concrete playbooks by stack
Here’s the stuff you put in runbooks, not in slide decks.
- Kubernetes (GKE/EKS/AKS)
- Roll back a deployment:
kubectl rollout undo deployment/api --to-revision=3
helm rollback api 23
argo rollouts undo rollout/api
- Canary with Prometheus guard (snippet):
- Roll back a deployment:
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: http-5xx-burn
spec:
metrics:
- name: error-burn
interval: 1m
count: 5
successCondition: result < 1.5
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])
ECS/EC2 with ALB
- Blue/green using two target groups; rollback = flip weights:
aws elbv2 modify-listener --listener-arn ... --default-actions '[{"Type":"forward","ForwardConfig":{"TargetGroups":[{"TargetGroupArn":"blue","Weight":100},{"TargetGroupArn":"green","Weight":0}]}}]'
- Blue/green using two target groups; rollback = flip weights:
Serverless (AWS Lambda)
- Use aliases and weighted routing; rollback = move alias back:
aws lambda update-alias --function-name checkout --name prod --function-version 42 --routing-config AdditionalVersionWeights={}
- With CodeDeploy:
aws deploy stop-deployment --deployment-id d-123 --auto-rollback-enabled
- Use aliases and weighted routing; rollback = move alias back:
Edge configs (Fastly/Cloudflare)
- Versioned configs; rollback = activate prior version:
fastly service-version activate --service-id $ID --version 67
- Versioned configs; rollback = activate prior version:
Terraform-managed infra
- Keep
launch_template_version
pinned; rollback by applying the prior version:terraform apply -var lt_version=42
- Use state locks and plan files; never hotfix in the console
- Keep
Databases
- Phase 1: Add nullable column, dual-write, backfill (idempotent)
- Phase 2: Switch reads, monitor, keep old path intact
- Phase 3: Remove old column in a separate, reversible change window
- If you must revert: toggle code path off; never drop recovered data
Guardrails that auto-abort the bad stuff
Make the safe path the default path.
SLO-driven automated rollback
Argo Rollouts
withPrometheus
queries; abort on error-rate, p95 latency, or saturation spikesSpinnaker + Kayenta
does automated canary analysis across metrics sets
Circuit breakers and timeouts
Envoy
/Istio
outlier_detection
,resilience4j
for JVMs, and exponential backoff- Prevent a bad deploy from melting neighbors while your pipeline rolls back
Policy as code and GitOps
OPA
/Kyverno
reject non-rollbackable manifests (e.g.,latest
tags, missing probes)ArgoCD
enforces desired state; revert =git revert
+ sync
Observability you can trust
- Golden signals are pre-wired into dashboards and alerts: error rate, latency, saturation, correctness
- Track deploy IDs in logs/traces (
X-Deploy-Id
), so you can correlate impact quickly
Checklists that scale with team size
Print these. Tape them to the wall. Automate them next.
Pre-deploy (every change)
- Does this change have a one-step rollback? List the exact
command/toggle/weight
. - Is the database step backward compatible? If not, split the change.
- Are SLO gates enabled for the canary? Which metrics?
- Are
feature flags
and kill switches wired and owned? - Announce scope and rollback plan in
#deploys
with links to dashboards.
- Does this change have a one-step rollback? List the exact
During deploy
- Start at 5% traffic; pause 5–10 minutes.
- Watch error burn, p95 latency, and key business events (checkout success, login rate).
- Promote only if metrics pass; otherwise let automation abort.
Rollback (execute within 5 minutes)
- Kubernetes:
helm rollback <release> <rev>
orkubectl rollout undo ...
- Lambda: move
prod
alias back to prior version - ALB/NGINX: shift weights to stable target group
- Feature flags: kill switch OFF for new path
- Verify recovery via dashboards; post the “resolved” note in Slack
- Kubernetes:
Monthly drills
- Randomly select a service; break a synthetic canary; time rollback to stable
- Track MTTR; create issues for any manual steps
- Rotate who runs the drill so the knowledge scales
Metrics that prove it’s working
If you can’t measure it, you’ll argue about it in the retro.
Change Failure Rate (CFR)
- Numerator: deployments that require rollback, hotfix, or flag kill
- Denominator: total production deployments
- Goal: trend down toward <5%
Lead Time for Changes
- From merge to production traffic at 100%
- Instrument via pipeline events + rollout promotion logs
- Goal: <60 minutes for routine changes
Recovery Time (MTTR)
- From detection (alert/gate fail) to user impact resolved
- Goal: <10 minutes for rollback-capable services
Dashboards
Grafana
: CFR by service, deployment frequency, MTTR P50/P90, error budget burn- Tag traces/logs with
deploy_id
so queries like “errors by deploy” are one click
I’ve seen teams cut CFR by half in a quarter just by shipping SLO-gated canaries and practicing rollback drills. The engineering hours you save stop going into firefighting and start going into features.
Case notes: what changed and what we got back
Fintech on GKE (payments + auth)
- Before: CFR 22%, MTTR 97m, informal Friday freeze
- After GitPlumbers’ 4-week rollout-first push (Argo Rollouts + Prometheus gates, Helm provenance, LD kill switches, expand/contract DB): CFR 6%, MTTR 9m, freeze removed. Lead time from 2 days to 45 minutes.
E-comm on ECS/ALB (search + checkout)
- Before: Blue/green by hand in the console, database changes tied to deploys, no flags
- After: Terraform-managed target weights, CodeDeploy canary with auto-rollback,
Unleash
for risky logic. Result: 3 consecutive holiday Fridays shipped with zero pages; on-call costs down ~30%.
SaaS analytics on Lambda + CloudFront
- Before: Alias drift, hard rollbacks, broken dashboards
- After: Aliased, versioned deploys with weighted canary, Fastly rollback runbook,
OPA
policy to banlatest
images. MTTR from 40m → 6m.
None of this is flashy. It’s plumbing. But boring plumbing is why Friday deploys become boring, too.
Related Resources
Key takeaways
- Design rollback first. Make it a product requirement, not an afterthought.
- Use DORA metrics as north stars: change failure rate, lead time, and recovery time.
- Prefer canary + SLO-driven automated aborts over manual guesswork.
- Keep rollbacks operationally simple: one command, one toggle, or one traffic weight change.
- Make databases rollback-safe with expand/contract and roll-forward mindset.
- Codify checklists and drills so any engineer can safely revert at 2 a.m. or 4:55 p.m. Friday.
Implementation checklist
- Every deploy must have a documented rollback path (command, toggle, or traffic shift).
- Artifacts are immutable, versioned, and quickly addressable (image SHA, Helm revision, Lambda alias).
- All database changes follow expand/contract; destructive steps are isolated and slow.
- SLO/metric gates enforce auto-abort on canaries; humans approve promotions, not rescues.
- Run monthly rollback drills; track MTTR from “bad deploy detected” to “user impact resolved.”
- Feature flag debt has owners, TTLs, and cleanup tasks on the board.
Questions we hear from teams
- Should we stop Friday deploys?
- No. You should stop unsafe deploys. If you can auto-abort canaries, flip traffic, and toggle features within five minutes, Friday is just another day. If you can’t, the day isn’t the problem—your rollback design is.
- How do we handle database rollbacks?
- Don’t. Design for roll-forward. Use expand/contract: add new structures, dual-write, backfill, then switch reads. Rollback means toggling code paths off. Only perform destructive drops in a separate, low-risk change after stability is proven.
- What about feature flag debt?
- Treat flags like code. Each flag has an owner, a TTL, and a cleanup ticket. Separate kill switches (operational) from experiments (product). Instrument flags in logs and dashboards so you can correlate behavior with toggle states.
- Who owns rollback in the org?
- Platform owns the tooling and guardrails. Service teams own their rollback runbooks and drills. SRE validates SLOs and gates. Execs track CFR, lead time, and MTTR. Everybody practices.
- How do we test rollbacks without scaring customers?
- Shadow traffic, synthetic transactions, and monthly drills on non-critical windows. Use canary releases with tiny weights (1–5%), real SLO gates, and automatic aborts. Make practice boring so production is boring.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.