Open-source or vendor?

If you want fast adoption with strong audit and targeting, vendors like LaunchDarkly or Split work well. If you want Git-first and self-hosted, Unleash or Flipt are solid. Use OpenFeature to keep your options open.

How many flags are too many?

The wrong number is the number you can’t clean up. Budget time to delete. Tag every flag with owner and expiry, and enforce removal in CI. A healthy org can carry dozens of active flags if they’re visible and expiring.

Client-side or server-side evaluation?

Server-side by default for risky features. Client-side for UX experiments with strict caching and safe defaults. Mobile needs offline bundles and long TTLs. Never send PII in contexts; use a hashed key.

Can flags replace staged rollouts?

No. Use flags to gate behavior and feature exposure. Use canary/blue-green for binary changes. Together they’re safer than either alone.

How do we connect flags to business outcomes?

Tag events, traces, and metrics with `feature` and `variation`. Maintain a control cohort. Your experimentation platform (or a simple warehouse query) can compute deltas. Track CFR and MTTR per flip in your release dashboard.

Release-engineering · Oct 4, 2025 · 10 minute read

Feature Flags Without Regret: The Design That Halved Change Failures and Shrunk MTTR

Stop playing roulette with flag flips. Here’s the design, guardrails, and checklists that make experimentation safe—and measurable—for real engineering orgs.

Matt K.

Partner, Release Engineering at GitPlumbers

20 years in the trenches building and rescuing release systems at retailers, fintechs, and unicorns. Ex-Atlassian, ex-Shopify. I break glass only when the runbook says to.

Flags are not if-statements; they’re safety equipment. Treat them like seatbelts with sensors, not duct tape on the dashboard.

Back to all posts

The 2 a.m. flip that took down checkout (and how we stopped doing that)

I’ve watched teams ship a slick new checkout behind a flag, “just for staff.” Then someone tweaks targeting in a vendor UI on Friday night, cache misses spike, and suddenly you’re doing incident comms while Stripe is fine but your session store isn’t. I’ve seen this fail at a big-box retailer (Firebase Remote Config, React, no kill switch) and at a unicorn fintech (ad‑hoc if (env.NEW_FLOW) everywhere). The pattern is the same: flags treated as toggles, not as safety equipment.

What actually works: design flags as a release system with guardrails, not an afterthought. Track the DORA metrics that matter—change failure rate, lead time, and MTTR—per flag flip. Put targeting in code you can review. Tie rollouts to SLOs. Automate rollback. And make the runbook boring enough your on-call can do it half-asleep.

We’ve implemented this at GitPlumbers for teams running LaunchDarkly, Unleash, Flagsmith, and homegrown stacks. The results are repeatable when the system is designed right.

What you should measure (and why execs will care)

If experimentation doesn’t move your DORA needle, it’s theater.

Change failure rate (CFR): Percent of flag flips that require a rollback or cause an SLO violation within a defined window (typically 24 hours). Track per flag and per team.
Lead time for changes: From PR merge that adds targeting rules to the first user seeing the feature. With Git-managed flags and pre-approved runbooks, we see this drop from days to minutes.
MTTR: Time from SLO breach detection to flag disable. With a real kill switch and automation, target <15 minutes. Under 5 is achievable.

Make these visible in your release dashboard. Tie bonuses to CFR going down and MTTR staying green. When leadership sees lead time shrink without incident volume going up, you’ll get budget for the boring work that keeps you safe.

The design: Flags as a first-class release system

The happy path uses existing tools but adds discipline.

Vendor-agnostic SDKs: Use OpenFeature in apps/services. Swap providers without rewriting call sites.
Policy as code: Manage flag definitions and rollout rules in Git (Terraform for LaunchDarkly; CRDs for Unleash/Flipt). Use CODEOWNERS for approvals.
Kill switches everywhere: Every flag gets an immediate off-ramp. Validate in prod.
Progressive delivery: Target rings (staff → beta → slices by geo/plan) and percentages with stickiness (user.id).
Observability hooks: Emit feature_evaluated events and tag downstream metrics/traces with feature and variation.
SLO guardrails: Auto-disable when error rate/latency/regression crosses thresholds.
Expiry + ownership: Tag flags with owner, ticket, expiresAfter. CI blocks merges if expired flags aren’t removed.

Example YAML for Unleash/Flipt-style GitOps:

# flags/payments.new_checkout.yaml
apiVersion: flags.gitops/v1
kind: FeatureFlag
metadata:
  name: payments.new_checkout
  labels:
    owner: payments-team
    ticket: PLAT-2312
    expiresAfter: 2025-12-31
spec:
  type: boolean
  state: off
  rules:
    - name: kill-switch
      match: { any: true }
      action: off
    - name: staff-only
      match:
        attributes:
          user.role: ["staff", "qa"]
      action: on
    - name: canary-usca
      action: percentage
      percentage: 5
      stickiness: user.id
      constraints:
        attributes:
          country: ["US", "CA"]

Key gotchas I’ve seen:

Don’t evaluate flags over slow network calls on the hot path. SDKs should cache locally with short pollingInterval and strict timeouts.
Mobile/desktop need offline evaluation and a safe default; treat config updates like content delivery with ETags and TTLs.
Never ship PII in targeting contexts; hash user IDs or use bucketing keys.

Implementation details that survive on-call

Use OpenFeature so your app code doesn’t care which vendor you picked this quarter.

// frontend/src/feature.ts
import { OpenFeature } from '@openfeature/js-sdk';
import { LaunchDarklyProvider } from '@openfeature/launchdarkly-provider';

OpenFeature.setProvider(new LaunchDarklyProvider({
  pollingInterval: 30,  // seconds
  flushInterval: 2,     // seconds
  timeout: 1000         // ms for init/eval
}));

const client = OpenFeature.getClient('checkout');

export async function renderCheckout(user: { id: string; country: string; plan: string }) {
  const context = { targetingKey: user.id, attributes: { country: user.country, plan: user.plan } };
  const enabled = await client.getBooleanValue('payments.new_checkout', false, { context });
  return enabled ? renderNewCheckout() : renderOldCheckout();
}

Server-side Go with telemetry tags:

// checkout/handler.go
flagClient := openfeature.NewClient("checkout")

func (h *Handler) Handle(w http.ResponseWriter, r *http.Request) {
    ctx := context.Background()
    user := getUser(r)
    evalCtx := ofctx.Context{
        TargetingKey: user.ID,
        Attributes: map[string]interface{}{"plan": user.Plan, "country": user.Country},
    }

    enabled, _ := flagClient.BooleanValue(ctx, "payments.new_checkout", false, evalCtx)
    span := trace.SpanFromContext(ctx)
    span.SetAttributes(label.String("feature", "payments.new_checkout"), label.Bool("enabled", enabled))
    r = r.WithContext(contextWithFeature(r.Context(), "payments.new_checkout", enabled))

    if enabled { h.newFlow(w, r); return }
    h.oldFlow(w, r)
}

Manage flags as code. Example with Terraform + LaunchDarkly:

# infra/flags.tf
provider "launchdarkly" {
  access_token = var.ld_token
}

resource "launchdarkly_feature_flag" "new_checkout" {
  project_key   = "prod"
  key           = "payments.new_checkout"
  name          = "New checkout"
  description   = "Gradually roll out the new checkout"
  tags          = ["owner:payments", "expires:2025-12-31"]
  variation_type = "boolean"

  variations { value = true  name = "On"  description = "Enable new checkout" }
  variations { value = false name = "Off" description = "Disable new checkout" }
}

Observability hooks that pay for themselves:

Add feature and variation labels to request metrics/traces.
Emit an event on flag change; correlate with incidents.
Slice SLIs by feature: latency{feature="payments.new_checkout",enabled="true"}.

Prometheus guardrail query example (2% 5xx over 5m for enabled traffic):

(
  increase(http_requests_total{app="checkout",feature="payments.new_checkout",enabled="true",status=~"5.."}[5m])
/
  increase(http_requests_total{app="checkout",feature="payments.new_checkout",enabled="true"}[5m])
) > 0.02

Auto-disable via API when threshold trips:

#!/usr/bin/env bash
set -euo pipefail
FLAG_KEY="payments.new_checkout"

curl -sS -X PATCH \
  -H "Authorization: Bearer $LD_TOKEN" \
  -H "Content-Type: application/json" \
  "https://app.launchdarkly.com/api/v2/flags/default/$FLAG_KEY" \
  -d '{"on": false, "comment": "Auto-disable via SLO guardrail"}'

I’ve seen this cut MTTR from ~90 minutes to under 10 without waking a human.

Progressive rollout that won’t page you

Stop “YOLO to 50%.” Use rings and stickiness.

Staff-only in prod: validate kill switch, data telemetry, and privacy.
1% canary across low-risk segments (e.g., country in [US, CA], plan = free).
5% → 25% → 50% based on SLO deltas vs. control. Bake 30–60 minutes per step or a minimum N requests.
100% only when error budget impact is negligible for 24 hours.
Remove the flag (or convert to a permanent config) within 2 sprints.

Note: for high-risk features (billing, auth), use Argo Rollouts or Flagger with webhooks that consult SLOs before promoting. Flags + rollouts play well together; flags gate behavior while rollouts gate binaries.

Targeting rule example with stickiness:

spec:
  rules:
    - name: beta-cohort
      action: percentage
      percentage: 10
      stickiness: user.id
      constraints:
        attributes:
          plan: ["beta", "staff"]

Stickiness ensures the same user’s experience doesn’t flap between sessions—one of those subtle UX killers I’ve seen tank conversion during experiments.

Repeatable checklists your org will actually use

This is the boring part that saves quarters.

Preflight (before any user sees it)

Confirm flag defined in Git with owner, ticket, expiresAfter.
Implement kill switch and test in staging and prod (non-customer path).
Add metrics/traces tags: feature, enabled, variation.
Define SLO guardrails and rollback conditions in a ticket:
- Latency p95 increase > 10% over control
- 5xx rate > 2% for enabled traffic
- Conversion drop > 3% (if measurable in near-real time)
Dry-run the disable script/automation.

Rollout steps

Merge targeting rule PR; CI validates schema and expiry.
Deploy provider config; ensure SDK init time < 1s.
Flip to staff; verify dashboards and logs for tagged traffic.
Progressively increase: 1% → 5% → 25% → 50% with bake times.
Hold 100% for 24h while monitoring SLOs and business KPIs.

Rollback (MTTR < 15m target)

Disable flag via API or UI; confirm enabled=false telemetry within 1 minute.
Page on-call only if SLOs don’t recover within 10 minutes.
Create a post-incident note with flag_key, change ID, and metrics deltas.

Cleanup

Convert success to permanent code path. Remove the old path.
Delete the flag and targeting config in Git.
CI job fails builds when expiresAfter < now().
Weekly chore: list flags by expiry and owner; nudge in Slack.

Example CI check (Node):

jq -e 'if .metadata.expiresAfter < (now | strftime("%Y-%m-%d")) then false else true end' flags/*.yaml

We ship a simple flag-linter for clients; ping us if you want it.

Results we’ve seen and what we’d do differently

At a fintech with LaunchDarkly + OpenFeature + Terraform, we:

Dropped CFR for flag flips from 22% to 9% in 90 days by enforcing preflight and SLO guardrails.
Cut MTTR from ~2h to 12m (p50) with API auto-disable tied to Prometheus alerts.
Reduced lead time from PR-to-first-user from ~3 days to 30 minutes by treating targeting rules as code and pre-approving rollout runbooks.
Eliminated ~140 stale flags in 6 weeks using expiry tags and a weekly cleanup ritual.

What I’d do sooner next time:

Push OpenTelemetry propagation everywhere before flags—you want clean feature slices in traces.
Standardize “flag categories”: experiment, kill switch, ops toggle. Different defaults and expiry rules.
Add a dedicated “flag debt” metric to the platform dashboard and review it in ops review.

If you’re still gating mission-critical behavior via environment variables, you’ll keep paying for that choice on every incident. Move targeting into a system—then make the system observable, testable, and governed.

For more detail on how we wire this up, see our write-up: GitPlumbers Release Engineering and the case study where a flag system saved Black Friday: Case Study: Flags That Saved Checkout.

Related Resources

Key takeaways

Design flags as first-class, observable release controls—not if-statements.
Use OpenFeature SDKs to avoid lock-in; manage flags via Git/Terraform for auditability.
Tie rollouts to SLOs and automate rollback using metrics, not vibes.
Track change failure rate, lead time, and MTTR for flag flips explicitly.
Standardize runbooks and checklists; make them repeatable and boring.
Tag every flag with owner, expiry, and ticket; clean them up like you mean it.

Implementation checklist

Create a kill switch for every flag and test it in prod (on/off in <60s).
Use OpenFeature across services; configure SDK timeouts, caching, and fallbacks.
Manage targeting rules as code (Terraform/GitOps) with CODEOWNERS approval.
Instrument flag evaluation and impact: events, traces, and per-flag SLI slices.
Roll out progressively: staff → 1% → 5% → 25% → 50% → 100%; gate on SLOs.
Define failure and auto-rollback conditions before you flip.
Expire flags with metadata and CI checks; remove dead flags weekly.

Questions we hear from teams

Open-source or vendor?: If you want fast adoption with strong audit and targeting, vendors like LaunchDarkly or Split work well. If you want Git-first and self-hosted, Unleash or Flipt are solid. Use OpenFeature to keep your options open.
How many flags are too many?: The wrong number is the number you can’t clean up. Budget time to delete. Tag every flag with owner and expiry, and enforce removal in CI. A healthy org can carry dozens of active flags if they’re visible and expiring.
Client-side or server-side evaluation?: Server-side by default for risky features. Client-side for UX experiments with strict caching and safe defaults. Mobile needs offline bundles and long TTLs. Never send PII in contexts; use a hashed key.
Can flags replace staged rollouts?: No. Use flags to gate behavior and feature exposure. Use canary/blue-green for binary changes. Together they’re safer than either alone.
How do we connect flags to business outcomes?: Tag events, traces, and metrics with `feature` and `variation`. Maintain a control cohort. Your experimentation platform (or a simple warehouse query) can compute deltas. Track CFR and MTTR per flip in your release dashboard.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to a GitPlumbers release engineer Download the Feature Flag Runbook Template