The Rollback Plan That Makes Friday Deploys Feel Like Tuesday
Rollbacks shouldn’t be a Slack firefight. If your team can’t undo a bad change in minutes, you’re not “moving fast”—you’re just borrowing stress at compound interest.
If rollback isn’t executable in minutes by the on-call with least privilege, you don’t have a rollback plan—you have a bedtime story.Back to all posts
Friday deploys are only scary when rollback is improv
I’ve watched the same movie at a dozen companies: someone pushes a “small” change at 4:30pm Friday, the error budget catches fire at 4:38pm, and suddenly the whole org is doing distributed archaeology in Slack. The killer detail is always the same—nobody is sure what “rollback” even means in that system.
- Is rollback
git revertand redeploy? - Is it
helm rollback? - Is it “turn the feature flag off”?
- Is it “restore the database snapshot and pray”?
If your answer depends on who’s online, you don’t have a rollback strategy—you have tribal knowledge.
Here’s what actually works: design rollback as a product feature of your delivery system, tied to the metrics your exec staff will care about when the incident review hits: change failure rate, lead time, and recovery time (MTTR). Make rollback boring, and Friday becomes just another deploy window.
“Fast teams don’t avoid failure. They reduce the blast radius and shorten the apology.”
The only three numbers that matter (for rollback)
You can add all the DORA dashboards you want, but for rollback design, I anchor on:
- Change failure rate (CFR): what percentage of deploys trigger a rollback, hotfix, or incident.
- Lead time: how long from merged code to running in prod.
- Recovery time (MTTR): how long to restore service when a change goes bad.
Rollback strategy directly influences all three:
- If rollback is painful, teams delay shipping (lead time gets worse) and “power through” broken releases (CFR becomes a lie because you don’t classify failures consistently).
- If rollback is fast and safe, teams ship smaller batches (CFR down) and restore service quickly (MTTR down).
Concrete targets I’ve used successfully:
- Rollback-to-stable in < 5 minutes for stateless services.
- Rollback decision in < 10 minutes (time-box the debate).
- CFR < 15% for teams modernizing a legacy estate; < 5% for mature services with progressive delivery.
If those sound aggressive, good. Friday deploys are boring when rollback is a reflex, not a research project.
Pick your rollback type like you pick your data store: intentionally
Teams get in trouble when they treat rollback as one technique. In practice you want a menu, and you want each service to declare which one it uses.
Revert + rebuild + redeploy
- Works everywhere.
- Slowest MTTR unless you’ve optimized the pipeline.
- Best when you have immutable artifacts and tight CI.
Redeploy previous artifact (preferred when possible)
- MTTR friendly.
- Requires artifact retention and deterministic config.
Traffic shift rollback (canary/blue-green)
- Often the fastest: shift traffic back to known-good without changing images.
- Requires load balancer/ingress + metrics gates.
Feature flag kill-switch
- Fastest for “business logic” changes.
- Dangerous when flags are abused as long-lived forks.
A sane baseline for Kubernetes shops:
- Stateless services: traffic shift (Argo Rollouts) + redeploy previous image as fallback.
- Stateful services: feature flag + expand/contract DB patterns; avoid “restore snapshot” as a routine mechanism.
Here’s a minimal Argo Rollouts canary with automated abort on error-rate regression:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: payments-api
spec:
replicas: 10
strategy:
canary:
steps:
- setWeight: 10
- pause: { duration: 120s }
- analysis:
templates:
- templateName: success-rate
- setWeight: 50
- pause: { duration: 300s }
- setWeight: 100
selector:
matchLabels:
app: payments-api
template:
metadata:
labels:
app: payments-api
spec:
containers:
- name: app
image: ghcr.io/acme/payments-api:1.42.0
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate
spec:
metrics:
- name: http-success-rate
interval: 30s
successCondition: result[0] >= 0.995
failureLimit: 1
provider:
prometheus:
address: http://prometheus.monitoring.svc:9090
query: |
sum(rate(http_requests_total{app="payments-api",code=~"2.."}[2m]))
/
sum(rate(http_requests_total{app="payments-api"}[2m]))That’s not “fancy.” That’s buying down MTTR.
Databases: where “just roll back” goes to die
Most rollback horror stories aren’t about the container image. They’re about state.
I’ve seen this fail in the wild:
- Deploy includes a migration that drops/renames a column.
- New app version starts writing to the new shape.
- You roll back the app… and the old version can’t read the new schema.
- Now rollback is “restore prod DB,” and your MTTR is measured in hours and executive blood pressure.
The pattern that avoids this is boring and effective: expand/contract.
- Expand: Add new columns/tables/indexes in a backward-compatible way.
- Deploy code that writes both old and new (or writes new while still supporting reads).
- Validate.
- Contract: Remove old fields in a later deploy after the rollback window.
Example: adding customer_tier safely.
-- Expand (safe to roll back app code)
ALTER TABLE customers ADD COLUMN customer_tier TEXT;
-- Optional: backfill without locking the table forever
UPDATE customers SET customer_tier = 'standard' WHERE customer_tier IS NULL;Then in code, treat the new field as optional until you’ve passed the “we can roll back” window.
Also: if you’re doing Kafka-style eventing, schema discipline matters. Use protobuf/avro compatibility rules and enforce them in CI. Rolling back a consumer that can’t read the producer’s new message shape is the distributed systems version of stepping on a rake.
Make rollback a button, not a debate (CI/CD + GitOps)
Rollback fails under pressure for two reasons:
- It requires too many humans.
- It requires elevated privileges (“only Pat has prod access”).
Here’s what I push teams toward:
- Immutable artifacts: tag and retain images for at least N days.
- One-command rollback: the on-call can do it with least privilege.
- GitOps-driven state: rollback is a
git revertof the desired state, not a series of manual kubectl edits.
If you’re Helm-based, keep it stupid-simple:
# See rollout history
helm history payments-api -n prod
# Roll back to previous revision
helm rollback payments-api 41 -n prod
# Verify the deployed image tag
kubectl -n prod get deploy payments-api -o jsonpath='{.spec.template.spec.containers[0].image}'If you’re ArgoCD-based GitOps:
# Roll back by reverting the Git commit that introduced the change
git revert <bad-commit-sha>
git push origin main
# ArgoCD will sync back to known-good automatically
argocd app sync payments-api-prodAnd if you want rollback to be “boring,” wire it into your pipeline so you can redeploy a known-good artifact without rebuilding:
# .github/workflows/deploy.yml
name: deploy
on:
workflow_dispatch:
inputs:
image_tag:
description: "Image tag to deploy (supports rollback)"
required: true
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set image tag in manifests
run: |
yq -i '.spec.template.spec.containers[0].image = "ghcr.io/acme/payments-api:${{ inputs.image_tag }}"' k8s/deploy.yaml
- name: Commit desired state
run: |
git config user.name "release-bot"
git config user.email "release-bot@acme.com"
git commit -am "Deploy payments-api ${{ inputs.image_tag }}"
git pushThat workflow_dispatch input turns rollback from “rebuild and hope” into “deploy the last good tag.” MTTR drops fast.
Checklists that scale with team size (and don’t become theater)
I’m allergic to process cosplay. But repeatable checklists are how you keep CFR and MTTR from degrading as headcount grows.
Before merge (PR checklist)
- Rollback mechanism declared in the PR description:
flag,traffic shift,helm rollback, orrevert. - Blast radius stated: which endpoints/queues/tables are affected.
- Migration plan: expand/contract noted, destructive step scheduled later.
- Observability: dashboard link + at least one SLI:
5xx rate,p95 latency,queue lag,error budget burn.
Before deploy (release checklist)
- Confirm known-good version is available (image tag / chart revision).
- Confirm rollback command works in staging (yes, actually run it).
- Confirm alerting is sane (paging on symptom, not cause): 5xx/latency/saturation.
- Time-box verification window (ex: 10 minutes). No “we’ll keep an eye on it.”
During deploy (ops checklist)
- Deploy using progressive delivery (10% → 50% → 100%).
- Watch one golden dashboard (not 12): error rate, latency, saturation.
- If SLO signal breaches, abort automatically or execute rollback immediately.
Scaling by org size
- 1–2 teams: keep it lightweight; one dashboard, one runbook, one on-call.
- 3–10 teams: standardize templates (
Rollout, dashboards, alerts), create a shared “release guild,” and enforce PR checklist via CI. - 10+ teams: central platform provides paved-road deploy + rollback; product teams own service SLOs and runbooks. This is where GitPlumbers often gets called—because bespoke snowflake pipelines quietly murder MTTR.
Make it boring on purpose: practice, measure, and close the loop
The fastest way to make Friday deploys boring is to practice rollback when nothing is on fire.
Run a monthly rollback game day:
- Pick one service.
- Deploy a known-bad change in a canary.
- Validate the system aborts or the on-call rolls back in < 5 minutes.
Treat rollback like a product KPI:
- Track CFR, lead time, MTTR per service.
- After every rollback, do a 30-minute review:
- Did we detect fast?
- Did we have a single obvious rollback path?
- Did data/schema prevent rollback?
Where I’ve seen teams stumble lately is AI-generated changes (“vibe coding” in the release pipeline or migration scripts). The code compiles, the PR looks plausible, and then it fails in production in a way that’s hard to reason about. Your rollback discipline is the safety net.
If your Friday deploys still feel like roulette, GitPlumbers can help you turn rollback into a paved road: one-command, measurable, and practiced—so your recovery time stops depending on who’s awake.
- Talk to us about release engineering and rollback runbooks: https://gitplumbers.com/services/release-engineering
- If AI-generated code has polluted your delivery system, we do code rescue too: https://gitplumbers.com/services/vibe-coding-help
Related Resources
Key takeaways
- If rollback isn’t a one-command, low-permission operation, it will fail when you need it most.
- Design rollback around three metrics: **change failure rate**, **lead time**, and **recovery time (MTTR)**—not vibes.
- Your rollback strategy must include **data** and **state**, not just app binaries/containers.
- Use progressive delivery (canary/blue-green) + feature flags to turn “rollback” into “shift traffic back”.
- Scale with checklists: small teams need fewer gates, but they need the same repeatability and muscle memory.
Implementation checklist
- Every deploy has an explicit rollback mechanism (revert/redeploy/traffic shift) documented in the PR.
- Rollback is executable in < 5 minutes by the on-call with least privilege.
- DB changes follow expand/contract; destructive changes are delayed until after confirmation window.
- Progressive delivery is wired to SLO signals (5xx, latency, saturation) with auto-abort.
- Post-deploy verification is automated and time-boxed; humans only investigate deltas.
- Rollback runbook is tested monthly (game day) and updated after every real incident.
Questions we hear from teams
- Is rollback always better than roll-forward?
- No. For simple stateless changes, a roll-forward hotfix can be fine. But you still need a fast rollback path because roll-forward assumes you can diagnose and fix under pressure. The practical approach is: optimize for rollback to restore service, then decide whether to roll forward once you’re stable.
- What’s the fastest rollback mechanism in practice?
- Traffic shifting (canary/blue-green) and feature-flag kill switches are typically fastest because they avoid rebuilds. The caveat is data/schema: if your new version wrote incompatible state, traffic shifting won’t save you.
- How do we prevent database migrations from blocking rollback?
- Use expand/contract. Make the schema change backward-compatible first, deploy application changes second, and only remove old fields after the rollback window. Avoid destructive steps in the same deploy as app logic changes.
- What should we automate first to improve MTTR?
- Make “deploy previous known-good” a first-class operation: retain artifacts, standardize a rollback command (`helm rollback`, ArgoCD sync to previous commit, or Argo Rollouts abort), and ensure the on-call can run it without needing admin-level access.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
