The Feature Flag System That Cut MTTR to 6 Minutes (Without Spiking CFR)
Feature flags are a safety system, not a growth hack. Here’s how to design them so experimentation moves the business without waking you at 2 a.m.
> Ship behind flags, but measure behind flags. Otherwise you’re just rolling the dice in production.Back to all posts
The 2 a.m. flip that cratered checkout
We had a Fortune 500 retail client whose checkout v2 looked great in staging. A PM flipped the checkout_v2 flag to 50% at 10 p.m. Traffic spiked, p95 latency crept from 180ms to 900ms, and one of the downstream payment gateways started 429-ing. Observability showed nothing unusual—until we realized none of the metrics were labeled by flag variant. We were debugging blind. It took three hours, a full rollback, and an executive Slack thread to unwind.
I’ve seen this movie a dozen times. Feature flags are sold as “move fast” tools. Used wrong, they’re silent landmines. Used right, they reduce change failure rate, shorten lead time, and give you a kill switch that drops MTTR to single-digit minutes. Here’s what actually works.
What “safe experimentation” means in numbers
Your feature flag system should move three north-star metrics:
- Change Failure Rate (CFR): Percentage of changes that cause degraded service or require hotfix/rollback. Flags should let you test changes in production with tiny blast radius, so CFR goes down.
- Lead Time: Time from code committed to value in users’ hands. Flags let you merge incomplete work behind safeties and decouple deploy from release.
- MTTR: Time to recover from an incident. A proper kill switch beats a redeploy every time.
Tie flags to these with observable gates:
- Define SLOs by service (e.g.,
99.9% availability,p95 < 300ms). - Gate rollouts with canaries that automatically pause or rollback if SLO burn increases with the flag on.
- Emit metrics labeled by
flagandvariantso you can compare on/off cohorts in real time.
I like using Argo Rollouts for automated canary + analysis, and Prometheus for the guardrail metrics. If you’re on LaunchDarkly, their experiment stats are decent, but I still push raw timeseries into Prom for the control loop.
Architecture that holds under real traffic
The pattern I’ve seen scale to dozens of teams:
- Evaluation SDK via OpenFeature: Use
OpenFeatureso you can swap providers (LaunchDarkly, Unleash, Flipt) without rewriting app code. - Flags-as-code: Manage flags, segments, and environments via
Terraformand Git PRs. Sync withArgoCDif self-hosted. - Progressive delivery: Use
Argo Rolloutsor your platform equivalent to do cohort-based exposure (internal → 1% → 10% …) with automated analysis. - Kill switches close to runtime: The SDK must fetch updates in seconds. Don’t require deploys to flip a flag. For heavy impact features, pair flags with an
Istio/Envoy circuit breaker that can shed load instantly. - Observability wired in: Emit
flag,variant, anduser_cohortlabels on success/error/latency metrics. Alert on budget burn deltas when the flag is on.
Minimal example (TypeScript + OpenFeature)
import { OpenFeature } from '@openfeature/js-sdk';
import client from 'prom-client';
const registry = new client.Registry();
const httpLatency = new client.Histogram({
name: 'http_request_latency_seconds',
help: 'HTTP latency',
labelNames: ['route', 'flag', 'variant'],
});
registry.registerMetric(httpLatency);
const of = OpenFeature.getClient('checkout');
export async function handleCheckout(req, res) {
const ctx = { userId: req.user.id, plan: req.user.plan };
const enabled = await of.getBooleanValue('checkout_v2', false, ctx);
const variant = enabled ? 'on' : 'off';
const end = httpLatency.startTimer({ route: 'POST /checkout', flag: 'checkout_v2', variant });
try {
const result = enabled ? await checkoutV2(req) : await checkoutV1(req);
res.json(result);
} catch (err) {
// also emit error counter with `flag`/`variant` labels
throw err;
} finally {
end();
}
}Flags as code (Terraform with LaunchDarkly)
provider "launchdarkly" {
access_token = var.ld_token
}
resource "launchdarkly_feature_flag" "checkout_v2" {
project_key = "retail-web"
key = "checkout_v2"
name = "Checkout v2"
description = "New payment orchestration path with idempotency."
tags = ["checkout", "risk:high", "owner:payments"]
variation_type = "boolean"
variations {
value = true
name = "on"
description = "Enable v2"
}
variations {
value = false
name = "off"
description = "Fallback to v1"
}
defaults {
on_variation = 1 # false
off_variation = 1
}
}
resource "launchdarkly_environment_flag" "checkout_v2_prod" {
env_key = "production"
flag_id = launchdarkly_feature_flag.checkout_v2.id
targets {
variation = 1 # off by default
values = ["*"]
}
}Self-hosting? Unleash or Flipt plus OpenFeature works well under GitOps. We’ve shipped this with ArgoCD syncing the Unleash config and Istio doing network-level circuit breaking as the last-resort kill switch.
Rollout workflow that scales with team count
The teams that avoid pager duty have boring, repeatable steps.
- Define the flag: owner, description, safe default, TTL, and blast radius.
- Bake a kill switch: boolean master guard that routes to v1 instantly. Keep it one network hop away.
- Route internal-only: dogfood with staff first. 0% external traffic.
- Canary to 1%, watch SLOs: error rate, p95 latency, saturation. Use automated analysis.
- Ramp to 10%, 25%, 50%, 100% with gates—not calendar time.
- Rollback automatically if analysis fails; flip the kill switch manually if users are hurting.
- Clean up: remove dead code once 100% is stable. PR or it didn’t happen.
Argo Rollouts + Prometheus canary gate
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: checkout
spec:
strategy:
canary:
steps:
- setWeight: 1
- pause: {duration: 300}
- analysis:
templates:
- templateName: err-rate
args:
- name: flag
value: checkout_v2
- setWeight: 10
- pause: {duration: 600}
- analysis:
templates:
- templateName: latency
trafficRouting:
istio:
virtualService: { name: checkout-vs, routes: [ primary ] }
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: err-rate
spec:
metrics:
- name: error-rate-flagged
interval: 60s
successCondition: result[0] < 0.02
provider:
prometheus:
address: http://prometheus:9090
query: |
sum(rate(http_requests_errors_total{flag="checkout_v2",variant="on"}[5m]))
/
sum(rate(http_requests_total{flag="checkout_v2",variant="on"}[5m]))This is the piece folks skip; then they wonder why CFR doesn’t budge. Gate on the metrics that matter while the flag is on.
Observability: label everything and alert on deltas
If you can’t see the impact of a flag, you’re gambling. Minimum instrumentation:
- Emit
flagandvariantlabels on latency, error rate, and throughput. - Export a
feature_flag_stategauge for critical flags so SRE can alert when someone turns on a high-risk path during traffic spikes. - Compare cohorts (on/off) for SLO burn. Alert on deltas, not absolutes.
import client from 'prom-client';
const flagState = new client.Gauge({
name: 'feature_flag_state',
help: '1 if flag enabled for this process/user cohort',
labelNames: ['flag', 'cohort'],
});
function setFlagGauge(flag: string, enabled: boolean, cohort: string) {
flagState.set({ flag, cohort }, enabled ? 1 : 0);
}Tie these into your incident playbooks: “If feature_flag_state{flag="checkout_v2"} flips and slo:error_budget_burn_rate > 2.0, page on-call and auto-disable.” I’ve also paired this with an Istio destination rule circuit breaker to cap concurrent requests to the risky backend.
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: payments-dr
spec:
host: payments
trafficPolicy:
outlierDetection:
consecutive5xxErrors: 5
interval: 5s
baseEjectionTime: 30s
connectionPool:
http:
http1MaxPendingRequests: 100
maxRequestsPerConnection: 100Hygiene: TTLs, cleanup, and avoiding flag graveyards
Flags are technical debt with a credit card APR. Without hygiene, your code turns into a haunted house.
- Time-to-live (TTL): Every flag gets an expiration date. Default 30–60 days. If it’s longer, it’s a config, not a flag.
- Ownership: Put the team and Slack channel in the flag metadata. If your provider supports it, tag with
owner:team-name. - Automation: A weekly job opens PRs to remove code paths for flags at 100% for >14 days and files issues for expired flags set at <100%.
- Block merges without cleanup: Use
Conftest/OPAin CI to reject PRs introducing a flag without TTL/owner, and to block deploys if expired flags exist. - Beware AI-generated toggles: We keep finding “temporary” flags sprinkled by vibe coding or AI-generated code. They never get removed. Run a
vibe code cleanuppass monthly.
Example: scan LaunchDarkly for stale flags and open GitHub issues.
#!/usr/bin/env bash
set -euo pipefail
LD_TOKEN="$LD_TOKEN" # export in CI
curl -s -H "Authorization: $LD_TOKEN" \
https://app.launchdarkly.com/api/v2/flags/my-project | jq -r '.items[] | select(.environments.production.on == true and (.maintainer | not)) | .key' \
| while read -r flag; do
gh issue create --title "Cleanup stale flag: $flag" \
--body "Flag $flag at 100% for 14d+. Remove dead code. Owner?" \
--label flags,cleanup
doneWe’ve built internal bots at GitPlumbers that post “flag debt” dashboards next to error budgets. Nothing like a little sunlight to keep things tidy.
Results we’ve actually seen (and what we’d do differently)
- A consumer fintech moved to OpenFeature + LaunchDarkly + Argo Rollouts; MTTR on feature incidents dropped from 42 minutes to 6 minutes. CFR fell from 21% to 9% over two quarters.
- A B2B SaaS on self-hosted Unleash cut lead time from code merge to user exposure from 5 days to same-day by merging behind flags and gating exposures.
- A marketplace with
Istiocircuit breakers tied to kill switches avoided a full brownout when a partner API degraded; the feature stayed on for unaffected cohorts and auto-paused for others.
What we’d change sooner every time:
- Wire metrics by flag on day one. Retrofitting labels across services later is painful.
- Standardize the rollout checklist in a repo; don’t make teams reinvent it.
- Make cleanup visible: a weekly flag-debt report. If everything is a flag, nothing is a flag.
If you’re sitting on a pile of legacy flags or AI-generated toggles from a “move fast” phase, do a two-week code rescue: catalog flags, add ownership/TTL, wire metrics, kill dead ones, refactor long-lived configs. We do this regularly for clients modernizing monoliths and microservices alike.
Key takeaways
- Treat flags as a safety system with owners, SLAs, and TTLs—not as dev candy.
- Prioritize three metrics: change failure rate, lead time, and MTTR. Design flags to directly influence them.
- Use OpenFeature + a managed or self-hosted provider (LaunchDarkly, Unleash, Flipt) and manage flags via GitOps/Terraform.
- Wire flags into observability: emit variant labels to Prometheus and gate rollouts with Argo Rollouts or equivalent.
- Codify checklists for creation, rollout, incident response, and cleanup—automation or it won’t happen.
- Kill switches and circuit breakers must be one hop away at runtime—no redeploy required.
Implementation checklist
- Every flag has an owner, description, and time-to-live (TTL).
- Change plan includes blast radius, kill-switch path, and rollback criteria.
- Flag defaults safe-off; exposure cohorts defined (internal, beta, 1%, 10%, 50%, 100%).
- Observability wired: Prometheus metrics labeled by `flag` and `variant`; alert when error budget burn > threshold with flag on.
- Runbook includes flip-to-safe sequence and data backfill steps.
- Cleanup automation opens PRs to remove dead code once flag hits 100% or is decommissioned.
Questions we hear from teams
- OpenFeature vs. vendor SDKs—why bother?
- OpenFeature lets you swap providers (LaunchDarkly, Unleash, Flipt) without rewriting app code. In practice, it’s insurance. We’ve migrated a client from a homegrown flag service to LaunchDarkly in a week because the app code stayed the same.
- How do we prevent flag sprawl and config drift?
- Treat flags as code via Terraform and PRs, enforce TTL and owner with OPA/Conftest in CI, and run weekly automation that opens cleanup PRs. Make a dashboard that shames stale flags next to your SLOs.
- What if we’re mostly legacy monoliths?
- Flags shine in monoliths. Start with a small SDK footprint, wire metrics by flag, and use a kill switch at the ingress or service mesh. We’ve done legacy modernization where flags gated risky refactors with near-zero downtime.
- Can we use flags for experiments and still keep CFR low?
- Yes—if experiments are gated by SLO-based analysis and limited blast radius. The problem isn’t experimentation; it’s flipping to 50% with no guardrails. Canary + Prometheus + automatic rollback keeps CFR in check.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
