The Feature Flag Playbook That Halved Our Change Failure Rate
Flags aren’t magic. Done right, they turn scary releases into small, reversible bets. Here’s the design and the checklists that actually scale.
“Flags without guardrails are just distributed if statements. Treat them like infra.”Back to all posts
The outage that converted a skeptic
I used to think flags were lipstick on a pig. Then a Friday promo banner took down cart in under a minute at a retailer I worked with. The code passed CI, the rollout “looked” fine, but the banner’s image service 500’d under load and hung the page. No easy rollback—just revert the whole release and pray. MTTR: 74 minutes.
We rebuilt the release model around feature flags: progressive exposure, kill switches, and obvious runbooks. Six weeks later, the same team shipped a much riskier search rewrite, lit it up for 1% of traffic, saw P95 jump, toggled OFF, and fixed in an hour. Change failure rate dropped from ~18% to ~9% in a quarter. That’s the point: flags aren’t about faster code; they’re about faster decisions.
The only metrics that matter
If you can’t show movement on these, your flag program is theater:
- Change Failure Rate (CFR): percentage of deploys causing a customer-impacting incident. Flag goal: isolate changes and make failure reversible.
- Lead Time: from commit to production exposure. Flag goal: merge earlier, ship dark, expose later.
- Recovery Time (MTTR): time to mitigate an incident. Flag goal: instant kill switch; config rollbacks beat code rollbacks.
Instrument them:
# CFR proxy: failed_rollouts / total_rollouts in last 7 days
default_rollouts_total{service="checkout"}
# label reason="flag-rollback" to prove flags changed the outcome
sum(rate(rollouts_failed_total[7d]))
/
sum(rate(rollouts_total[7d]))
# MTTR: avg time between incident start and mitigation
avg_over_time(incident_mitigation_seconds_sum[30d])
/
avg_over_time(incident_mitigation_seconds_count[30d])Tie rollouts to flags in events. If an incident is mitigated by flag=search_v2 set=OFF, count it.
System design that doesn’t rot
I’ve seen too many “flag systems” devolve into distributed if statements. What works is a clear split:
- Control plane: who can create/modify flags, audit logs, approvals, TTLs, and policies.
- Data plane: ultra-fast flag evaluation at runtime with sane defaults and local fallbacks.
- Guardrails: automated checks, SLO gates, and one-click kill.
Tools I trust:
- OpenFeature SDKs to decouple your app from a vendor.
- Providers: LaunchDarkly, Unleash, or flagd for simple, self-hosted setups.
- GitOps via ArgoCD or Flux for flag manifest drift control.
- Policy via OPA/Conftest.
A minimal TypeScript integration with OpenFeature and flagd looks like this:
import { OpenFeature, Client } from '@openfeature/js-sdk';
import { FlagdProvider } from '@openfeature/flagd-provider';
OpenFeature.setProvider(new FlagdProvider({ host: 'flagd', port: 8013 }));
const client: Client = OpenFeature.getClient('checkout');
export async function renderPromo(userId: string) {
const enabled = await client.getBooleanValue('promo_banner_v2', false, {
targetingKey: userId,
attributes: { segment: 'internal' },
});
if (!enabled) return null;
return '<PromoBanner />';
}And the corresponding flagd config you track in Git:
# flags/promo_banner_v2.flagd.yaml
flags:
promo_banner_v2:
state: "OFF" # default OFF; never ON by default
variants:
on: true
off: false
defaultVariant: off
targeting:
- if: "context.segment == 'internal'"
then: on
- if: "percentage(context.targetingKey, 1)" # 1% rollout
then: on
- else: off
metadata:
owner: team-growth
ttl: "2025-01-15" # hard stop to force cleanup
ticket: "GROW-1245"
risk: "low"Flags without guardrails are just distributed
ifstatements. Treat them like infra.
GitOps all the things (including flags)
If your flags live only in a SaaS UI, you’ll drift and lose auditability. Put definitions in Git, sync with ArgoCD, and allow emergency overrides via the provider UI—but reconcile back to Git.
- Store flags per service in
flags/with owners, TTLs, and risk. - Use ArgoCD to sync to
flagd/Unleash/LD relays. - Lock prod with PR approvals and policy checks.
Example ArgoCD Application for a flags repo:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: flags-prod
spec:
project: default
source:
repoURL: 'https://github.com/acme/flags'
targetRevision: main
path: envs/prod
destination:
server: 'https://kubernetes.default.svc'
namespace: flagd
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=truePolicy gate with OPA/Conftest to reject risky flags without TTLs or owners:
package flags
violation[msg] {
input.kind == "flag"
not input.metadata.owner
msg := sprintf("flag %s missing owner", [input.name])
}
violation[msg] {
input.kind == "flag"
not input.metadata.ttl
msg := sprintf("flag %s missing ttl", [input.name])
}Progressive delivery that doesn’t page you at 2 a.m.
You want rollouts where the default failure mode is “toggle OFF, recover in seconds.” Use flags with traffic splitting and SLO checks.
- Start with
internalcohorts, then 1% random users, then 5%, 25%, 50%, 100%. - Gate promotions on Prometheus queries for error rate and latency.
- Keep a hard kill switch: a single flag that disables new code paths entirely.
Argo Rollouts + flag cohorts example:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: search
spec:
strategy:
canary:
steps:
- setWeight: 1
- pause: {duration: 300}
- analysis:
templates:
- templateName: slos-healthy
- setWeight: 5
- pause: {duration: 600}
- analysis:
templates:
- templateName: slos-healthy
- setWeight: 25
- pause: {duration: 900}
selector:
matchLabels:
app: searchGuardrail template using Prometheus:
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: slos-healthy
spec:
metrics:
- name: errors
successCondition: result < 0.01
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
rate(http_requests_total{app="search",status=~"5..",flag="search_v2",variant="on"}[5m])
/
rate(http_requests_total{app="search",flag="search_v2"}[5m])
- name: latency
successCondition: result < 0.3
provider:
prometheus:
query: |
histogram_quantile(0.95,
sum by (le) (rate(http_request_duration_seconds_bucket{app="search",flag="search_v2",variant="on"}[5m])))Operationally, teach teams: if the analysis fails, the rollout pauses or aborts automatically, and the first response is toggle OFF—not “merge a revert.”
The runbooks and checklists that scale
These are the steps I copy-paste into every team’s onboarding. Keep them in your repo.
- Define the flag with TTL, owner, risk, and ticket in
flags/. - Implement with OpenFeature in code. Default OFF. Handle both variants.
- Add metrics:
exposures_total{flag,variant},errors_total{flag,variant}, latency histograms. - Create rollout plan: cohorts, promotion criteria, and kill switch.
- Write a one-line mitigation action in the incident runbook:
curlthe provider or flip in UI. - Run a staging chaos drill: flip OFF under load, verify graceful fallback.
- Roll out progressive exposure; promote only on green SLOs.
- Within SLA (e.g., 14 days), delete dead code paths and remove the flag.
A one-click kill via LaunchDarkly’s API looks like this:
curl -X PATCH \
'https://app.launchdarkly.com/api/v2/flags/acme/search/search_v2' \
-H 'Authorization: api-123' \
-H 'Content-Type: application/json' \
-d '[{"op":"replace","path":"/environments/prod/on","value":false}]'And a basic Grafana panel to watch rollouts:
- Error budget burn rate for
flag=search_v2, variant=on - Exposure count per cohort (internal, 1%, 5%, ...)
- Time-to-mitigation from incident open to flag OFF
Governance: prevent flag debt before it buries you
Flags are temporary scaffolding. Without governance, they become archaeology.
- TTL enforcement: fail PRs if TTL > 90 days; escalate owners when TTL is near.
- Tagging:
owner,risk,ticket,service,experiment. - RBAC: only TLs/EMs can enable to >25% without approval.
- Audit: every toggle emits a structured event with user, timestamp, reason.
- Cleanup: bots open PRs to delete code paths once exposure is 100% for N days.
Simple detector for stale flags (Node.js):
import fs from 'fs';
import glob from 'glob';
const flags = JSON.parse(fs.readFileSync('flags/index.json','utf8')) as string[];
const files = glob.sync('src/**/*.{ts,tsx,js,jsx}');
const unused = flags.filter(key => !files.some(f => fs.readFileSync(f,'utf8').includes(key)));
if (unused.length) {
console.log('Orphaned flags:', unused);
process.exitCode = 1; // fail CI
}Use Conftest to block merges for missing owners/TTLs. Have a weekly job that posts a Slack report of:
- Flags past TTL
- Flags enabled 100% for >14 days
- Flags with no exposure in the last 30 days
Results you can actually defend in a QBR
What success looks like after 1–2 quarters of doing this right:
- CFR: down 30–60% because failures are isolated and reversible.
- Lead time: down 25–50% as engineers merge earlier and ship behind flags.
- MTTR: down 50–80% because mitigation is a toggle, not a deploy.
- Incident fatigue: on-call shrinks high-severity tickets by 20–40%.
You’ll also see cultural changes: product trusts smaller bets; engineers don’t fear Friday deploys; and your execs stop asking “why are we so slow?” because you can show the burn-down on CFR and MTTR.
If you want a sanity check on your current setup, GitPlumbers has helped teams on LaunchDarkly, Unleash, and homegrown stacks move from ad hoc toggles to a governed program in 4–6 weeks. We’ll bring the playbooks, the policy, and the dashboards. You keep shipping.
Key takeaways
- Design flags as a system: control plane, data plane, and guardrails—not ad hoc if-statements.
- Use OpenFeature to decouple your SDK from vendors; treat flag definitions as code via GitOps.
- Measure north-star metrics: change failure rate, lead time, and recovery time (MTTR).
- Standardize on rollouts with preflight checks, staged exposure, automated kill switches, and cleanup SLAs.
- Bake in governance: TTLs, tagging, RBAC, audit logs, and policy (OPA/Conftest) to avoid flag debt.
- Make reversibility cheap: one-click kill, instant config rollback, and clear runbooks.
Implementation checklist
- Pre-merge: add a unique flag key, TTL, owner, and tag (team, risk, ticket) to the flag manifest.
- Pre-merge: add telemetry counters for exposure, errors, and latency scoped to the flag and cohort.
- Pre-prod: link the flag to a rollout plan (1%→5%→25%→50%→100%) and SLO guardrails.
- Pre-prod: verify kill switch path (simulate toggle to OFF in staging with production-like traffic).
- Deploy: start with `off` by default, enable only for internal or synthetic users first.
- Rollout: use canary + flag cohorting; promote only if error rate and latency stay within SLOs for N minutes.
- Incident: first move is toggle flag OFF; second move is roll back config via GitOps.
- Post-release: delete code paths within SLA (e.g., 14 days) and remove the flag from manifests.
- Weekly hygiene: report on stale flags, orphaned flags, and flags without owners; assign remediation.
- Quarterly: run chaos drills that randomly kill a high-risk flag and validate MTTR.
Questions we hear from teams
- How do I prevent flags from leaking into performance-critical paths?
- Evaluate flags once per request (or per session) and cache the result in context. Avoid flag checks inside tight loops. For JVM/Node, use provider-side streaming for near-zero latency and fall back to last-known values on provider timeouts.
- Should flags live in the same repo as code?
- Treat flags as configuration with their own lifecycle. In practice: a dedicated flags repo synced via ArgoCD to your providers, referenced from service repos via submodules or automation. Critical part is PR-based review, policy checks, and audit trails.
- When do I delete a flag?
- Set TTLs on creation. Once exposure is 100% and the feature is stable for N days (7–14 is common), open an automated PR to remove the dead code and the flag definition. Aging flags increase cognitive load and risk.
- Do I still need canary deploys if I have flags?
- Yes. Flags protect feature-level behavior. Canaries protect infrastructure and dependency changes. Use both: canary the binary, flag the behavior.
- What if a vendor (e.g., LaunchDarkly) is down?
- Use OpenFeature with streaming + local evaluation. Providers like `flagd` or LD SDK cache flag values and default to last-known-good on outages. Always design a safe OFF default and idempotent fallbacks.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
