The Progressive Delivery Stack That Survives Audit: Flags, Canaries, Blue/Green—Without Slowing You Down
Ship faster without playing roulette. Feature flags, canaries, and blue/green done with governance, so CFR drops, lead time shrinks, and MTTR stays honest.
Progressive delivery without governance is just faster incident creation.Back to all posts
The mess we’ve all shipped
I’ve watched teams add feature flags and ‘quick canaries’ at 5 p.m. Friday, only to wake up to a spike in support tickets and a Monday morning audit question: who flipped this to 100% in prod? No one knows, because it wasn’t through Git, there’s no analysis run, and the flag defaulted open in a retry path. Change failure rate (CFR) climbs, lead time creeps, MTTR looks heroic but only because you revert everything.
I’ve seen this fail in fintech under SOX, in gaming under massive load, and yes—even at unicorns with all the stickers. Here’s what actually works when you need speed and governance to coexist.
North-star metrics that drive the stack
If it doesn’t improve these, it’s theater:
- Change Failure Rate (CFR): % of changes that degrade SLOs or require rollback. Target <15% to start, <5% mature.
- Lead Time for Changes: code commit to prod. Target hours, not days. Measure median and p90.
- MTTR: time to restore normal SLOs after a bad change. Target <30 minutes for tier-1.
Tie these to gates:
- Block prod deploys when the error budget burn exceeds threshold.
- Require automated analysis for any change touching tier-1 paths.
- Fast path (blue/green n→n+1) only when CFR < target for 4 weeks and SLO burn <1x.
The reference architecture that doesn’t fight you
Use primitives that compose cleanly and leave an audit trail:
- Flags:
OpenFeatureSDK with a provider (LaunchDarkly,Unleash, orFlipt). One SDK across languages minimizes footguns. - Canary/Blue-Green:
Argo Rollouts(Kubernetes) orFlaggerwithIstio/NGINX Ingress/ALBfor traffic shaping. - GitOps:
ArgoCDmanaging all manifests. No directkubectlto prod except break-glass. - Policy as Code:
OPA/ConftestorKyvernoto enforce approvals, analysis templates, and env protections. - Observability:
Prometheus/GrafanaorDatadog/New RelicwithOpenTelemetry. Canary analysis reads SLO-aligned queries. - Incident tooling:
PagerDuty/incident.iowith pre-approved rollback runbooks.
Progressive delivery without governance is just faster incident creation.
Flags with guardrails (typed, fail-closed, and auditable)
Flags should reduce risk, not shift it around. Use OpenFeature to standardize and force safe defaults.
// src/checkout/feature-flags.ts
import { OpenFeature, EvaluationContext } from '@openfeature/js-sdk';
import { LaunchDarklyProvider } from '@openfeature/launchdarkly-provider';
OpenFeature.setProvider(new LaunchDarklyProvider({ sdkKey: process.env.LD_SDK_KEY! }));
const ctx: EvaluationContext = {
targetingKey: process.env.USER_ID || 'system',
attributes: { env: process.env.NODE_ENV || 'unknown', region: process.env.AWS_REGION || 'us-east-1' }
};
export async function isNewCheckoutEnabled(accountId: string) {
const client = await OpenFeature.getClient('checkout');
// Fail-closed default: false
return client.getBooleanValue('checkout.v2.enabled', false, { ...ctx, targetingKey: accountId });
}Principles that keep CFR low:
- Typed flags and explicit defaults (false) everywhere. No silent ‘true’.
- Kill switches for risky features:
checkout.v2.killchecked first in the code path. - Context discipline: target by account/region; no broad user segments without a canary.
- Auditability: require change tickets for flag updates in prod via webhook to
ArgoCDor the ITSM tool.
A simple governance rule with OPA (pseudocode) to catch unsafe flags on PRs:
package flags
deny[msg] {
input.path == "prod"
some f
f := input.flags[_]
f.key == "checkout.v2.enabled"
not f.has_kill_switch
msg := "prod flags must define a kill switch"
}Canaries and blue/green with automated analysis
Stop eyeballing dashboards. Use Argo Rollouts with analysis templates tied to SLOs.
# rollouts/checkout-canary.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: checkout
spec:
replicas: 20
strategy:
canary:
canaryService: checkout-canary
stableService: checkout-stable
trafficRouting:
istio:
virtualService: checkout-vs
weight: 0
steps:
- setWeight: 5
- pause: { duration: 2m }
- analysis:
templates:
- templateName: http-errors
- templateName: p90-latency
- setWeight: 20
- pause: { duration: 5m }
- analysis:
templates:
- templateName: http-errors
- templateName: error-budget
- setWeight: 50
- pause: { duration: 10m }
- analysis:
templates:
- templateName: http-errors
- templateName: p90-latency
- setWeight: 100
analysis:
startingStep: 1
args:
- name: service
value: checkout
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: http-errors
spec:
metrics:
- name: 5xx-rate
interval: 1m
successCondition: result < 0.01
failureLimit: 1
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
sum(rate(http_requests_total{service="$service",response_code=~"5.."}[5m]))
/
sum(rate(http_requests_total{service="$service"}[5m]))
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: p90-latency
spec:
metrics:
- name: p90
interval: 1m
successCondition: result < 0.3
failureLimit: 1
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
histogram_quantile(0.90, sum(rate(http_request_duration_seconds_bucket{service="$service"}[5m])) by (le))
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: error-budget
spec:
metrics:
- name: burn-rate-1h
interval: 1m
successCondition: result < 2
failureLimit: 0
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
sum(rate(slo_errors_total{service="$service"}[1h])) / sum(rate(slo_requests_total{service="$service"}[1h])) / (1 - 0.995)For blue/green, keep it boring: pre-warm green to full capacity, smoke-test with synthetic traffic, flip via Istio route or ALB target group, and keep blue hot for at least 30 minutes.
GitOps + policy: approvals, audit, and speed
All env changes flow through Git. ArgoCD enforces drift-free desired state.
# apps/prod/checkout-app.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: checkout-prod
annotations:
compliance.gitplumbers.io/change-ticket: 'CHG-12345'
compliance.gitplumbers.io/owner: 'payments-sre'
spec:
project: prod
source:
repoURL: 'git@github.com:org/checkout-infra.git'
path: 'k8s/prod'
targetRevision: main
destination:
server: 'https://kubernetes.default.svc'
namespace: checkout
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=trueEnforce guardrails with OPA/Kyverno:
package deploy
# Require canary strategy and analysis in prod rollouts
violation[msg] {
input.kind == "Rollout"
input.metadata.namespace == "checkout"
not input.spec.strategy.canary
msg := "prod rollouts must use canary strategy"
}
violation[msg] {
input.kind == "Rollout"
input.metadata.namespace == "checkout"
not input.spec.strategy.canary.analysis
msg := "canary must define automated analysis"
}CI gate with conftest before merge:
# .github/workflows/policy.yml
name: policy
on: [pull_request]
jobs:
opa:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: open-policy-agent/setup-opa@v2
- name: Run conftest
run: |
conftest test k8s/ --policy policy/ --output tableApprovals: require two code owners for prod paths, SSO + RBAC on flag consoles, and a pre-approved rollback workflow (no CAB meeting to revert).
Runbooks and checklists that scale
When the pager goes off, checklists beat heroics. Print these or embed in your runbook tool.
Feature flag rollout (risky user flows)
- Create a kill switch flag:
checkout.v2.killdefault false. - Roll out
checkout.v2.enabledto 1% of a low-risk cohort. - Watch SLOs and user funnel; if 5xx > 1% or p90 > 300ms, auto-disable.
- Ramp: 1% → 5% → 20% → 50% → 100% with 10–30 min pauses and automated analysis.
- Document the intent and expiry date; add a ticket to remove the flag in 14 days.
Canary release (service change)
- Pre-checks:
- Build is green, unit/integration tests passed, security scan clean.
- Observability dashboards exist for
5xx, p90 latency, saturation. - Rollback tested in staging within the last week.
- Execution:
- Apply rollout manifest via PR; ArgoCD syncs.
- Automated analysis runs at each weight; page on failure.
- Manual approval allowed only between 20% and 50% for tier-0.
- Rollback:
kubectl argo rollouts undo rollout/checkoutor set weight to 0.- Flip kill switch if user-impacting.
Blue/green cutover
- Warm green to 100% capacity.
- Synthetic checks (login, checkout, refunds) pass twice.
- Flip traffic route; watch error budget for 30 minutes.
- Keep blue hot; schedule retirement after 24 hours.
Incident quick-restore
# Pre-approved rollback
kubectl argo rollouts promote --full rollout/checkout || true
kubectl argo rollouts undo rollout/checkout
# Verify
kubectl argo rollouts get rollout/checkout
kubectl -n checkout get pods -o wide | grep RunningWhat good looks like (and common faceplants)
Results from a recent GitPlumbers engagement (payments, ~60 engineers, SOC2/SOX):
- Lead time: 3 days → 90 minutes median, p90 under 4 hours.
- CFR: ~18% → 6% in 6 weeks.
- MTTR: 120 minutes → 12–18 minutes, largely due to pre-approved rollbacks and kill switches.
- Audit: zero findings on change management; every prod flip traceable to a PR and a person.
Pitfalls I keep seeing:
- Stale flags accumulating: every flag needs an expiry. Weekly cleanup or your code becomes a museum.
- DIY canaries in shell scripts: no analysis, no gates, high CFR. Use Rollouts/Flagger.
- Observability mismatch: canary analyzing 2xx rate while SLOs are latency-based. Align metrics.
- Drift between flag segments and rollout cohorts: define segments as code, not in the UI only.
- AI-generated vibe code around flags: cleanup early. We do targeted vibe code cleanup and AI code refactoring so defaults and kill switches aren’t “TODOs.”
Final notes
Progressive delivery isn’t a tool; it’s a contract between speed and safety. If your CFR, lead time, and MTTR aren’t improving, the system isn’t working—no matter how pretty the dashboards are. If you need a neutral third party to cut through the noise, GitPlumbers has done this for banks, adtech, and B2B SaaS without turning deploys into committee meetings.
Key takeaways
- Pick CFR, lead time, and MTTR as the north-star metrics and wire them into gates, not slide decks.
- Treat flags, canaries, and blue/green as one system with GitOps + policy-as-code. Audit trails or it didn’t happen.
- Automate analysis with real SLOs. No manual canaries unless you like 2 a.m. rollbacks.
- Use typed flags and fail-closed defaults. Stale flags and silent fallbacks will wreck your CFR.
- Build runbooks and checklists that a new hire can follow at 3 a.m. and a SOX auditor can love at 3 p.m.
Implementation checklist
- Standardize on one flag SDK via OpenFeature and enforce typed, fail-closed defaults.
- Adopt Argo Rollouts or Flagger for canaries; ban DIY scripts in prod.
- Enforce GitOps (ArgoCD) for env changes. No kubectl to prod outside break-glass.
- Write OPA/Kyverno policies: require analysis for prod, two approvals, and audit annotations.
- Define SLOs per service; wire PromQL/Datadog queries into canary analysis templates.
- Create kill switches for risky features; pre-approve rollback workflows with change management.
- Run weekly cleanup: retire stale flags, verify drift-free manifests, test rollbacks in staging.
- Track CFR, lead time, MTTR in one dashboard; set budgets and stick to them.
Questions we hear from teams
- Do we need LaunchDarkly to do this, or can we stay open-source?
- You can stay OSS: use OpenFeature + Unleash or Flipt for flags and Argo Rollouts or Flagger for canaries. The key is standardizing the SDK and enforcing typed, fail-closed defaults. The governance bits (OPA, GitOps) are tool-agnostic.
- How do we pass SOX/SOC2 while releasing daily?
- GitOps provides the audit trail (who/what/when). Policy as code enforces approvals and safe strategies. Analysis templates tie changes to SLOs. That checks the boxes without a CAB meeting for every deploy—pre-approved, low-risk changes flow continuously.
- What’s the fastest way to cut CFR by half?
- Add automated analysis to canaries, enforce kill switches on risky flags, and pre-approve rollbacks. Those three changes alone usually cut CFR 30–60% in a month.
- Our AI-generated code added dozens of flags. What now?
- Run a flag hygiene sprint: catalog, classify, add expiry dates, enforce typed defaults, and remove dead code paths. We’ve done ‘vibe code cleanup’ and targeted refactors to put flags behind safe patterns without pausing delivery.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
