The Feature Flag System That Cut Our MTTR to Minutes (Without Torching CFR)

Feature flags aren’t magic. Design them like safety gear, measure them like an SRE, and wire them like infra—then you can ship risky changes safely, fast.

Feature flags don’t replace release engineering—they are release engineering. Treat them like safety gear and they’ll pay you back in MTTR and sleep.
Back to all posts

The Friday Night Save (and the Tuesday Morning Postmortem)

We flipped on a new pricing engine at 5% for EMEA traffic. Two minutes later, checkout p95 lat spiked and error rates doubled. We didn’t roll back the deploy; we yanked a kill switch. MTTR was 3 minutes, CFR didn’t move, and we shipped a fix the next morning. That felt great—until Tuesday, when we realized half the org had write-access to flags, there was no TTL, and our dashboards didn’t connect flags to SLOs. We got lucky.

I’ve seen flags save companies and I’ve seen them cause planet-scale outages (ask anyone who lived through a vendor control-plane incident). The difference is design. Feature flags must be treated as release engineering and SRE tooling, not a product toy. Here’s what actually works.

Measure What Matters: CFR, Lead Time, MTTR

If flags don’t move these, they’re noise:

  • Change Failure Rate (CFR): percent of changes causing incidents. Flags should let you separate deploy from release, shrinking blast radius via targeting and progressive rollout.
  • Lead Time: code-commit-to-production-release. Flags enable “dark ship” and validate in prod with tiny cohorts.
  • MTTR: time to restore service. A proper kill switch beats a rollback when your bug is entangled with data or dependent services.

Practical wiring:

  • Tag every deploy and flag change with the same trace_id/deployment_id metadata in logs and spans (OpenTelemetry span attributes like feature_flag.key and feature_flag.variant).
  • Treat flag flips as change events in your observability tools.
# Prometheus: error-rate with flag dimension
sum by (service, feature_flag) (rate(http_requests_total{status=~"5.."}[5m]))
  / sum by (service, feature_flag) (rate(http_requests_total[5m]))
# Datadog: mark flag change events (for MTTR timeline correlation)
curl -X POST "https://api.datadoghq.com/api/v1/events" \
  -H "DD-API-KEY: $DD_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "title": "flag:pricing_engine -> OFF",
    "text": "owner=payments risk=high reason=error_budget_burn",
    "tags": ["service:checkout","feature_flag:pricing_engine"]
  }'

Architecture That Doesn’t Flake: Control Plane, Data Plane, Escape Hatches

You need three pieces:

  • Control plane: where flags live (LaunchDarkly, Unleash, Flipt). Manage via API and Terraform for auditability.
  • Data plane: client-side evaluation with local cache and offline defaults via OpenFeature SDKs. Avoid network lookups on hot paths.
  • Escape hatches: out-of-band kill switches when the control plane or SDK goes sideways.

Key design calls:

  • Use OpenFeature to vendor-neutralize. You’ll switch at least once; don’t rewrite every service.
  • Prefer local evaluation with a relay/edge cache (Unleash proxy, LD relay). Timeouts < 50ms, failure mode = safe default.
  • Bake a global kill switch per service with an env var, config map, or an Istio header route. It must work when your provider is down.
# Istio: emergency kill via header (ops can inject header at ingress)
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: checkout-vs
spec:
  hosts: ["checkout.svc.cluster.local"]
  http:
  - match: [{ headers: { X-Flag-Pricing-Engine: { exact: "off" } } }]
    route: [{ destination: { host: checkout, subset: stable } }]
  - route: [{ destination: { host: checkout, subset: experimental } }]
// Node/TS with OpenFeature: safe default and offline bootstrap
import { OpenFeature } from '@openfeature/js-sdk';

OpenFeature.setProvider(new MyUnleashProvider({
  url: process.env.UNLEASH_PROXY!,
  token: process.env.UNLEASH_TOKEN!,
  bootstrap: { pricing_engine: false }, // offline default
  timeout: 40
}));

const client = OpenFeature.getClient('checkout');
export async function priceOrder(ctx, order) {
  const enabled = await client.getBooleanValue('pricing_engine', false, { userId: ctx.userId });
  if (!enabled) return legacyPrice(order);
  return newPricingEngine(order);
}

Implementation Blueprint: OpenFeature + Unleash (or LaunchDarkly) with GitOps

Pick your stack; here are two battle-tested paths.

  1. OpenFeature + Unleash (self-hosted, no-SaaS dependency)
  • Run Unleash with a proxy in Kubernetes; manage config with ArgoCD.
  • Evaluate locally via OpenFeature provider.
# ArgoCD app-of-apps to deploy Unleash + proxy
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: platform-unleash
spec:
  project: platform
  source:
    repoURL: https://github.com/acme/platform-infra
    path: charts/unleash
    targetRevision: main
  destination: { namespace: platform, server: https://kubernetes.default.svc }
  syncPolicy: { automated: { prune: true, selfHeal: true } }
  1. OpenFeature + LaunchDarkly (managed, fast to start)
  • Define flags/environments in Terraform; restrict UI flips with RBAC + change approvals.
# Terraform: LaunchDarkly flag and environment wiring
provider "launchdarkly" { access_token = var.ld_token }

resource "launchdarkly_environment" "prod" {
  project_key = "checkout"
  key         = "prod"
  name        = "Production"
}

resource "launchdarkly_feature_flag" "pricing_engine" {
  project_key = "checkout"
  key         = "pricing_engine"
  name        = "Pricing Engine"
  description = "Enables new pricing engine"
  variations  = [{ value = true }, { value = false }]
  tags        = ["risk:high","owner:payments"]
}

Rollout Playbooks You Can Run at 2 a.m.

Ship playbooks, not vibes. Codify rollouts as steps everyone follows.

  • Flag creation checklist

    1. Define owner (Slack, pager), risk level, TTL, cleanup criteria.
    2. Set safe default = OFF, define global kill switch.
    3. Target internal cohort first; add canary segment (1–5%).
    4. Add dashboards and alerts keyed on feature_flag dimension.
    5. Write runbook entry: how to disable, who to page, rollback plan.
  • Progressive rollout

    1. 1% internal → watch p95, error rate, and SLO burn for 10–15 minutes.
    2. 5% random cohort; verify idempotency, data integrity, cost impact.
    3. 25%, 50%, 100% with guardrails. Abort if error budget burn > 2%/h.
# Example: scripted rollout with Unleash API
set -euo pipefail
FLAG=pricing_engine
for pct in 1 5 25 50 100; do
  echo "Rollout ${pct}%"
  curl -s -X POST "$UNLEASH_URL/api/admin/projects/default/features/$FLAG/strategies" \
    -H "Authorization: $UNLEASH_TOKEN" \
    -H 'Content-Type: application/json' \
    -d "{\"name\":\"gradualRolloutRandom\",\"parameters\":{\"percentage\":$pct}}" >/dev/null
  sleep 900  # observe 15m
  ./guardrail-check.sh || { echo "Abort at ${pct}%"; exit 1; }
done
  • App guard code (don’t trust humans)
// Go: fast path guard with circuit-breaker default
func PriceOrder(ctx context.Context, order Order) (Price, error) {
  enabled, err := flags.Bool("pricing_engine").Get(ctx)
  if err != nil || !enabled {
    return legacy.Price(order)
  }
  p, err := newengine.Price(order)
  if err != nil {
    // Soft-fail to legacy if new engine errors > threshold
    if metrics.Rate("pricing_engine_errors", 5*time.Minute) > 0.05 {
      return legacy.Price(order)
    }
  }
  return p, err
}

Observability and Guardrails: Tie Flags to SLOs

If you can’t see it, you can’t save it.

  • SLOs per surface: checkout availability (99.9%), latency p95, and correctness (refund rate). Track by feature_flag.
  • Error budget burn as a kill condition: automate flips when burn rate spikes.
# 1h burn rate for availability SLO, segmented by feature_flag
sum(rate(errors_total{service="checkout"}[5m]))
  / sum(rate(requests_total{service="checkout"}[5m]))
# Auto-disable via webhook when burn rate exceeds threshold
if ./calc_burn_rate.sh checkout pricing_engine > 2.0; then
  curl -X POST "$UNLEASH_URL/api/admin/feature-toggles/pricing_engine/disable" \
    -H "Authorization: $UNLEASH_TOKEN"
fi
  • Tracing with attributes
// OpenTelemetry: attach flag context to spans
span.setAttributes({
  'feature_flag.key': 'pricing_engine',
  'feature_flag.variant': enabled ? 'on' : 'off',
});
  • Dashboards
    • Grafana: panels grouped by feature_flag for latency and errors.
    • Honeycomb: breakdown by feature_flag.variant to catch tail latencies.
    • Datadog: monitor on error-budget burn; event stream for flips.

Lifecycle: TTLs, Code Refs, and Debt Budgets

Flags rot. They calcify complexity and block refactors. Bake removal into the system.

  • TTL and reminders: add expires_at on creation; page owner when overdue.
  • Code references: use ld-find-code-refs or a ripgrep-based scanner in CI.
  • Debt budget: limit max active release flags per service (e.g., ≤10). Over budget? You delete before adding.
# GitHub Actions: detect stale flags and fail PRs over budget
name: flag-hygiene
on: [pull_request]
jobs:
  scan:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v4
    - name: ld-find-code-refs
      uses: launchdarkly/find-code-references@v2
      with:
        access-token: ${{ secrets.LD_TOKEN }}
        proj-key: checkout
    - name: enforce-budget
      run: ./scripts/enforce_flag_budget.sh 10

AI-generated “vibe code” loves to scatter if (flag) all over the place. Centralize evaluation, wrap it behind small interfaces, and schedule “vibe code cleanup” alongside feature removal. GitPlumbers has rescued more than one team buried under zombie flags and AI hallucinations.

Data and Migrations: Double-Write, Read-Switch, Cleanup

Flags around data changes are trickier than UI toggles.

  • Double-write: write to old and new storage behind a flag while reads stay on old.
  • Read-switch: flip reads to new once parity checks pass; keep double-write until confidence is high.
  • Cleanup: drop old writes, then remove the flag and schema.
-- Postgres: gated column rollout
ALTER TABLE orders ADD COLUMN price_v2 NUMERIC;
-- app: IF flag(pricing_engine) THEN write price_v2 ELSE write price;
-- validation job compares price vs price_v2 until diff < 0.1% for N days

Practice the reversal. Nothing tanks MTTR like a one-way migration hidden behind a flag.

What Good Looks Like (Real Outcomes)

At a fintech we worked with in 2024:

  • CFR dropped from 18% to 7% in 90 days after adopting OpenFeature + Unleash with progressive rollouts.
  • Lead time (PR merge → prod exposure to 5%) fell from 2 days to 45 minutes (GitOps + scripted rollout).
  • MTTR on flag-induced incidents averaged 6 minutes due to global kill switches and auto-disable on burn rate.
  • Flag debt kept under 8 per service with TTLs and code-ref enforcement, reducing cognitive load during on-call by ~30%.

This wasn’t about a tool. It was design, discipline, and checklists.

structuredSections':[{

header'':'The Friday Night Save (and the Tuesday Morning Postmortem)

content':['We flipped on a new pricing engine at 5% for EMEA traffic. Two minutes later, checkout p95 lat spiked and error rates doubled. We didn’t roll back the deploy; we yanked a kill switch. MTTR was 3 minutes, CFR didn’t move, and we shipped a fix the next morning. That felt great—until Tuesday, when we realized half the org had write-access to flags, there was no TTL, and our dashboards didn’t connect flags to SLOs. We got lucky.

I’ve seen flags save companies and I’ve seen them cause planet-scale outages (ask anyone who lived through a vendor control-plane incident). The difference is design. Feature flags must be treated as release engineering and SRE tooling, not a product toy. Here’s what actually works.

Related Resources

Key takeaways

  • Flags reduce CFR only when they’re engineered like safety gear: defaults safe, kill switches fast, and guardrails enforced in CI/CD.
  • Track the trifecta: change failure rate, lead time, and MTTR—with flag-aware dashboards that tie toggles to SLOs.
  • Use OpenFeature to decouple app code from vendors; run Unleash/Flipt self-hosted or LaunchDarkly with Terraform + GitOps.
  • Codify rollout playbooks and checklists; practice them. If it’s not repeatable at 2 a.m., it’s not done.
  • Automate cleanup with TTLs, code-references, and debt budgets—or flags become permanent landmines.

Implementation checklist

  • Choose a control plane (LaunchDarkly, Unleash, Flipt) and a client evaluation model with offline defaults.
  • Wire OpenFeature SDKs with a safe default and a one-click global kill switch per service.
  • Store flag definitions in Git with Terraform; sync via ArgoCD for auditable changes.
  • Define flag risk levels, owners, TTLs, and removal criteria at creation time.
  • Implement progressive rollout scripts (1%→5%→25%→50%→100%) gated by error budget burn.
  • Instrument flag usage and outcomes; correlate to SLOs and CFR in Grafana/Datadog/Honeycomb.
  • Automate stale-flag detection with ld-find-code-refs or custom scanners in CI.
  • Rehearse failure: drill the rollback, kill switch, and incident comms quarterly.

Questions we hear from teams

Do I need a vendor, or can I self-host?
Both work. If you have strict data boundaries or want zero SaaS runtime dependency, run Unleash/Flipt behind a proxy. If you need speed-to-value, LaunchDarkly with Terraform control is solid. Use OpenFeature either way to keep your app code vendor-neutral.
How do I prevent a flag provider outage from taking me down?
Local evaluation with cached configs, short timeouts, and safe defaults. Add a service-level kill switch (env/ConfigMap) and an ingress-based header override. Treat the provider as optional, never on the hot path.
Aren’t canaries enough? Why flags too?
Canaries protect deploy risk; flags protect release risk and allow targeting by cohort, geography, or account. Use both: Argo Rollouts/Spinnaker for canary deployments and feature flags for behavior gating.
What about performance overhead in hot paths?
Use SDKs with in-memory caches and evaluate once per request/session. Avoid per-call network fetches. Pre-warm caches and measure tail latency impact; it should be sub-millisecond.
How do I handle database migrations safely with flags?
Use double-write → read-switch → cleanup. Validate parity with jobs, not anecdotes. Keep the path reversible until you’ve proved parity over time windows under real load.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to GitPlumbers about flag systems that actually reduce CFR Download the rollout playbook checklist (PDF)

Related resources