Progressive Delivery With a Spine: Feature Flags, Canaries, and Blue/Green With Real Governance

Stop betting the business on a Friday night deploy. Stand up progressive delivery that moves DORA metrics in the right direction—and survives audits.

Ship more, page less. Progressive delivery with policy beats heroics every day ending in y.
Back to all posts

The release that didn’t page us at midnight

I’ll never forget the quarter a B2B SaaS we worked with finally stopped gambling. Before GitPlumbers came in, every deploy was Russian roulette: weekly batch releases, 15% change failure rate, and a two-hour MTTR if you were lucky and your best on-call wasn’t on a plane. We put in feature flags, canaries, and a boring blue/green path—and added governance that didn’t rely on calendar approvals. Three months later: 3% CFR, lead time down from days to hours, MTTR under 20 minutes. Nobody got a hero cookie; they just shipped safely and slept.

Progressive delivery without governance is chaos. Governance without progressive delivery is bureaucracy. You need both.

The metrics that matter (and how progressive delivery moves them)

If you’re not measuring, you’re just adding tech you’ll rip out next reorg. Align everything to three DORA metrics:

  • Change failure rate (CFR): failed changes / total changes. Target <5%.

  • Lead time: first commit to running in production. Target hours, not days.

  • Recovery time (MTTR): time to restore service. Target <30 minutes.

Progressive delivery moves these by:

  • Feature flags: decouple deploy from release. Smaller, reversible changes drop CFR and lead time. Kill switches slash MTTR.

  • Canary/blue-green: incremental exposure with automated rollback on SLO burn. CFR goes down; MTTR becomes a button, not a bridge call.

  • Governance: policy-as-code and audit replace committee meetings. Less waiting, more control.

Reference architecture that won’t rot under audit

Here’s a stack we’ve stood up at fintech, healthtech, and adtech—works on EKS/GKE/AKS and even on-prem if you’ve got a pulse:

  • Flags: OpenFeature SDK + provider (LaunchDarkly or Unleash).

  • Delivery: Argo Rollouts for canary/blue-green; ArgoCD for GitOps.

  • Traffic: Istio or NGINX Ingress with stable/canary services.

  • Observability: Prometheus + Grafana (or Datadog/New Relic) with SLOs; OpenTelemetry tracing.

  • Policy: OPA Gatekeeper or Kyverno to enforce rollout/analysis and signed images; cosign for signature verification.

  • CI: GitHub Actions/GitLab CI/Buildkite; merge queue and protected branches.

  • Data layer: migrations via expand/migrate/contract with flags guarding code paths.

  • Audit: CODEOWNERS, required reviews, and GitOps audit trail. No shadow changes.

Flow: dev merges to main → CI builds/signed image + SBOM → GitOps PR updates desired state → policy checks block bad configs → Argo Rollouts canaries with SLO-driven analysis → automatic rollback or promotion → flags gate user exposure.

Feature flags that don’t become technical debt

Flags are power tools. Used well, they decouple deploy from release. Used badly, they become an archaeological dig of dead code and security risk.

  • Implement with OpenFeature: avoids SDK lock-in. Example in Node.js:
import { OpenFeature } from '@openfeature/js-sdk';
const client = OpenFeature.getClient('checkout');
const enabled = await client.getBooleanValue('discounts-v2', false, { userId, plan });
if (enabled) {
  return runNewFlow();
}
return runOldFlow();
  • Non-negotiables:

    • Every flag has an owner, expiry date, and Jira link in description.

    • Default safe value set; include a kill switch flag per critical feature.

    • Targeting is auditable (segments by tenantId, region, beta tag).

    • PII never embedded in flag keys; pass only hashed or non-PII context.

  • Governance: keep a flags/registry.yaml and enforce via CI:

- key: discounts-v2
  owner: payments-team
  expires: 2025-02-01
  ticket: PAY-1427
  type: release
  killSwitch: true
  • Policy check (Conftest/OPA) blocking missing metadata:
package flags
violation[msg] {
  some f in input.flags
  f.owner == ""
  msg := sprintf("flag %s missing owner", [f.key])
}
  • Lifecycle: plan removal at creation. We require flags to be deleted within two sprints of 100% rollout. CI fails if expires < today.

Canary and blue/green with Argo Rollouts that your SRE will trust

If your canary isn’t wired to real SLOs, it’s theater. Use Argo Rollouts with AnalysisTemplates hooked to Prometheus (or Datadog).

  • Canary Rollout (excerpt):
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: checkout
spec:
  replicas: 6
  strategy:
    canary:
      canaryService: checkout-canary
      stableService: checkout-stable
      steps:
      - setWeight: 5
      - pause: { duration: 120 }
      - analysis:
          templates:
          - templateName: http-5xx
          - templateName: latency-p95
      - setWeight: 25
      - pause: { duration: 300 }
      - setWeight: 50
      - pause: { duration: 600 }
      - setWeight: 100
  # ... selector & template omitted
  • AnalysisTemplate with Prometheus:
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: http-5xx
spec:
  metrics:
  - name: 5xx
    interval: 30s
    failureLimit: 1
    successCondition: result[0] < 1
    provider:
      prometheus:
        address: http://prometheus.monitoring.svc.cluster.local:9090
        query: sum(rate(http_requests_total{service="checkout",status=~"5.."}[1m]))
  • Blue/Green when you need the big switch:
strategy:
  blueGreen:
    activeService: checkout-stable
    previewService: checkout-preview
    autoPromotionEnabled: false # human gate
  • Governance gates:

    • Require at least one analysis step. Policy blocks rollouts without it.

    • Error budget aware: abort if burn rate > 2x over last hour.

    • abortScaleDownDelaySeconds to keep canary pods warm for fast rollback.

    • Manual promotion in office hours only; blocked by calendar freeze during peak events.

Governance that scales (without a CAB meeting on every deploy)

The trick is to shift approvals left into code and automate the boring parts. What’s worked repeatedly:

  • Policy-as-code:

    • Kyverno/Gatekeeper rules to require: labeled owners on Rollouts, AnalysisTemplates present, signed images (cosign), SBOM attached (attestations).

    • Example Kyverno snippet:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-rollout-analysis
spec:
  rules:
  - name: enforce-analysis
    match:
      resources:
        kinds: ["Rollout"]
    validate:
      message: "Rollouts must define analysis templates."
      pattern:
        spec:
          strategy:
            canary:
              analysis:
                templates:
                  - templateName: "*"
  • Git hygiene:

    • Protected branches, merge queue, status checks for policy and tests.

    • CODEOWNERS requiring sre-team approval for k8s/rollouts/** and flags/**.

    • Signed commits and container images; block unsigned at admission.

  • GitOps: ArgoCD keeps prod declarative; every change shows up as a PR with history. No “kubectl oops” on prod.

  • Audit: pipeline posts deployment summaries to Slack, tags incidents with change IDs, and links to the exact PR and Argo Rollout. Auditors love clickable trails.

Operationalizing DORA: instruments, not vibes

Don’t eyeball this—instrument it. Our go-to is the Four Keys pipeline wired into your stack.

  • Lead time: timestamp first commit in PR → timestamp of production Rollout reaching 100%.

  • CFR: number of rollbacks or hotfixes tagged to a change / total changes. We label incidents in PagerDuty with change_id.

  • MTTR: incident opened → service SLO restored. Pull from StatusPage/PagerDuty.

  • How to wire quickly:

    1. Emit a deployment event from CI with change_id, git SHA, service, environment.

    2. Emit a rollout_promoted event from Argo Rollouts via webhook or Argo Event.

    3. Store events in a warehouse (BigQuery/Postgres). Use Looker/Grafana to chart DORA.

    4. Alert if CFR over 7-day window > target, or lead time regresses >2x baseline.

  • SLO-aware rollouts: connect AnalysisTemplates to SLO burn rate queries (e.g., 2x/14x multi-window). If burn breaches, auto-abort and flip kill switch.

Checklists you can actually copy

Print these. Make them boring. Boring scales.

  • Org-level setup

    • Pick OpenFeature + provider, Argo Rollouts, ArgoCD, Prometheus/Datadog.

    • Enable protected branches, merge queue, and signed images with cosign.

    • Install Kyverno/Gatekeeper policies for rollouts, signatures, owners, and flag metadata.

    • Baseline DORA; agree targets with product and SRE; publish dashboards.

  • Service onboarding

    1. Add Rollout manifests and services (stable, canary).

    2. Add AnalysisTemplates for 5xx, latency p95, and error budget burn.

    3. Integrate OpenFeature SDK and register flags in flags/registry.yaml.

    4. Add synthetic checks per critical path; tag metrics with change_id.

  • Release checklist (per change)

    • DB changes follow expand/migrate/contract; guarded by flag.

    • Canary weights: 5% → 25% → 50% with analysis at each step.

    • Manual promotion during office hours; PagerDuty on-call aware.

    • Rollback plan documented: argorollouts abort checkout + kill switch flag.

  • Post-deploy

    • Delete or 100%-on flags within 2 sprints; remove dead code paths.

    • Record change outcome in the PR (success/rollback).

    • Review DORA deltas weekly. If CFR creeps up, tune analysis or test coverage.

Results you can expect (and what breaks)

What we’ve seen after 6–12 weeks in orgs from 20 to 300 engineers:

  • CFR: 10–20% → 2–5%.

  • Lead time: days → hours (often same-morning).

  • MTTR: hours → <20 minutes with automated rollback + kill switches.

  • Release frequency: weekly batch → daily or on-merge.

And the things that still break if you ignore them:

  • DB migrations: flags don’t fix unsafe schema changes. Use expand/contract and backfills.

  • Observability gaps: if you measure the wrong thing (like pod restarts only), your canary lies.

  • Flag debt: expired flags rot. Enforce TTL in CI and delete aggressively.

  • HPA thrash: mis-tuned autoscaling can mask canary signals. Pin or account for it during canary.

If this sounds like the spine your delivery needs, we’ve done this dance before. GitPlumbers can stand up the stack, wire the policies, and leave you with dashboards that survive audits and reorgs. No silver bullets—just fewer pages and faster, safer ships.

Related Resources

Key takeaways

  • Progressive delivery only works when tied to DORA metrics: change failure rate, lead time, and recovery time.
  • Governance is a feature: bake policy, audit, and approval into the pipeline, not into calendar meetings.
  • Use OpenFeature with a flags provider plus Argo Rollouts for canary/blue-green, wired to Prometheus and SLOs.
  • Standardize checklists that scale across teams—service onboarding, release, and post-deploy cleanup.
  • Automate rollback on SLO burn, and require flag TTL/owners to avoid flag debt.

Implementation checklist

  • Define DORA baselines (CFR, lead time, MTTR) and agree on targets with product/SRE.
  • Adopt OpenFeature + a flag provider (LaunchDarkly/Unleash) with flag TTL, owner, and kill switch.
  • Deploy Argo Rollouts with canary or blue/green, wired to Prometheus AnalysisTemplates and manual gates.
  • Enforce policy-as-code (OPA/Kyverno) for rollout presence, analysis, and cosign verification.
  • Instrument pipelines to compute DORA and tag deploys; alert on error budget burn during canary.
  • Run documented release and post-deploy checklists; close or delete stale flags within 2 sprints.

Questions we hear from teams

Which feature flag provider should we choose?
Use OpenFeature to avoid lock-in. If you need enterprise targeting and audit now, LaunchDarkly is solid. If you want open-source and can operate it, Unleash works well. The key is flag metadata (owner/TTL/ticket), kill switches, and SDK availability. Governance matters more than the vendor.
How do we handle database changes with canaries?
Use expand/migrate/contract. Add columns/tables in a backward-compatible way, deploy the new code path behind a flag, dual-write if needed, backfill, then flip reads via the flag. Only drop old columns once the flag is 100% and logs show no access.
What if we don’t have Istio?
Argo Rollouts works with NGINX Ingress and service meshes. Start with NGINX stable/canary services and traffic splits. You can add mesh later if you need per-request routing or mTLS.
How fast can we see DORA improvements?
Teams typically see CFR and MTTR drop within 2–4 weeks once canaries and kill switches are enforced. Lead time improves as soon as you stop batching—often within the first sprint.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Stand up progressive delivery (without the chaos) See the progressive delivery case study

Related resources