How do I measure flake rate reliably?

Tag test runs with commit SHA and runner ID; a failure followed by a pass with no code change counts as a flake. Aggregate from JUnit across 7 days and compute flake_count / total_runs per test. Anything >2% goes to quarantine immediately.

Should I add retries to tests?

Prefer zero retries. If you must, cap at 1 and only around known-transient boundaries (browser startup, network handshake). Never use retries to mask data races, timeouts, or order-dependence.

Do I need Bazel to get fast pipelines?

No. Bazel helps at scale, but you can get 80% with path filters, sharding, aggressive caches (pnpm/pip/Gradle), BuildKit cache-to/from, and moving heavy suites post-merge behind a merge queue.

Release-engineering · Nov 8, 2025 · 10 minute read

The CI Flake Diet: 10‑Minute Pipelines, Lower CFR, Faster Recovery

Your CI isn’t slow—it’s noisy. Kill flake, shrink pipeline time, and move your DORA metrics in the right direction without hiring an army.

Avery D. Martin

Principal, Release Engineering at GitPlumbers

20 years shipping and rescuing build systems from dot-com Ant scripts to today’s Bazel/Argo stacks. Ex-Netflix tooling, helped dozens of teams cut CI time in half without burning weekends.

You don’t need more runners—you need less work per PR.

Back to all posts

The symptom: red builds you can’t reproduce

If your Slack lights up with “CI failed, works on my machine,” I’ve been there. At one unicorn-scale marketplace (150+ devs, GH Actions + Buildkite), PRs took 30–40 minutes to go red-green. Merge queue froze daily because e2e flaked on login. CFR hovered around 22%. MTTR was measured in hours because reverting was manual and scary.

We fixed it by treating CI like production. The goal wasn’t a sexy pipeline poster. It was better DORA: lower change failure rate, shorter lead time, and faster recovery time. The side effect? First-signal in under 10 minutes, and the merge queue stopped feeling like Friday deploys in 2012.

Make the metrics the boss

Your pipeline serves three business outcomes. Measure them, weekly, from the pipeline—not anecdotes.

Change Failure Rate (CFR): Percentage of deploys causing a hotfix/rollback/flag-off within 24 hours.
Lead Time for Changes: Commit to production (or to customer impact if flags); aim for hours, not days.
MTTR: From detection to restoration (rollback or flag flip).

Practical instrumentation:

Emit JUnit/Allure results and pipeline timings; store in a warehouse (BigQuery/Snowflake) or push to Prometheus.
Label post-merge failures that trigger rollback or feature-flag kill as failures for CFR.
Track flake rate = “tests failing then passing without code changes ÷ total runs.”

Quick-and-dirty extraction from GitHub with gh and jq:

# MTTR from last 200 prod rollouts with a rollback label
gh run list --workflow deploy.yml --limit 200 --json databaseId,startedAt,updatedAt,conclusion \
  | jq '[.[] | select(.conclusion=="failure")] | 
        map((.updatedAt | fromdateiso8601) - (.startedAt | fromdateiso8601)) 
        | {mttr_p50:(sort|.[length/2|floor]), mttr_p95:(sort|.[(length*0.95)|floor])}'

Set budgets and review weekly:

p95 PR first-signal: <= 10 minutes
Main branch full suite: <= 25 minutes
Flake rate: < 1% sustained
CFR: trending down; if it rises, slow merges until stability returns

Kill flake at the source (not with blind retries)

Retries hide rot. Quarantine and fix. The fastest way to move CFR is to stop blocking the world on nondeterminism.

What actually works:

Quarantine flow
1. Fail a test twice without code change? Mark quarantine, auto-exclude from blocking, open a ticket, assign an owner, and set a 14‑day expiry.
2. Run quarantined tests in a non-blocking job and publish a separate report.
Determinism
- Force UTC, fixed random seeds, and hard timeouts.
- Kill external dependencies. Use Testcontainers or in‑process fakes; never rely on a shared dev DB.
- Pin versions for browsers, Docker images, and language tools.
Resource isolation
- Cap parallelism to CPU cores; E2E browsers get their own machine type.
- One service per test when possible; for suites, spin ephemeral compose stacks.

Pytest setup (quarantine + determinism):

# pytest.ini
[pytest]
addopts = -q -ra -n auto --dist=loadscope --durations=20 --maxfail=1
markers =
    quarantine: flaky test quarantined; does not block merges

# conftest.py
import os, random
try:
    import numpy as np
except ImportError:
    np = None
SEED = int(os.getenv("TEST_SEED", "42"))
random.seed(SEED)
if np:
    np.random.seed(SEED)
os.environ["TZ"] = "UTC"

GitHub Actions: run quarantined tests separately and don’t block merges:

jobs:
  unit:
    runs-on: ubuntu-22.04
    steps:
      - uses: actions/checkout@v4
      - run: pytest -m 'not quarantine' --junitxml=report.xml
  quarantined:
    runs-on: ubuntu-22.04
    continue-on-error: true
    steps:
      - uses: actions/checkout@v4
      - run: pytest -m quarantine --junitxml=quarantine.xml

Architect for a 10‑minute first signal

You don’t need more runners—you need less work per PR.

Pre-merge vs post-merge
- Pre-merge: lint, typecheck, unit, a tiny smoke e2e on ephemeral infra. Target < 10 minutes p95.
- Post-merge: full e2e, fuzz, cross-browser, load. Failures here block promotion, not developer feedback.
Merge queue + required checks
- Turn on GitHub’s merge_queue or Mergify/Bors. Require green pre-merge checks. No “YOLO main” pushes.
Path filters
- Don’t run e2e when only docs changed. Use paths in GitHub Actions/CircleCI filters.
Cancel in-progress
- New commits should cancel old pipelines. Use concurrency or Buildkite’s cancel_intermediate_builds.
Parallelism + sharding
- Shard tests by file/time. Persist historical timings to split evenly.
Cache like you mean it
- Language cache (npm/pnpm, pip, Gradle, Maven).
- Docker BuildKit cache-to/from.
- Remote build caches (Bazel, Nx, Gradle build cache).

Concrete GitHub Actions example:

name: ci
on:
  pull_request:
    paths:
      - 'src/**'
      - 'package.json'
      - '!docs/**'
concurrency:
  group: ${{ github.workflow }}-${{ github.ref }}
  cancel-in-progress: true
jobs:
  unit:
    runs-on: ubuntu-22.04
    strategy:
      fail-fast: false
      matrix:
        shard: [1,2,3,4]
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: 'npm'
      - run: npm ci --prefer-offline --no-audit
      - name: Run shard
        run: npx jest --ci --reporters=default --reporters=jest-junit --shard=${{ matrix.shard }}/4 --maxWorkers=50%
  build_image:
    runs-on: ubuntu-22.04
    steps:
      - uses: actions/checkout@v4
      - uses: docker/setup-buildx-action@v3
      - uses: docker/build-push-action@v6
        with:
          context: .
          push: false
          tags: ghcr.io/acme/app:pr-${{ github.sha }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

Bazel/Nx/Gradle remote caching pays off fast for monorepos:

# .bazelrc
build --remote_http_cache=https://bazel-cache.internal
build --experimental_repository_cache_hardlinks

E2E without the pain (hermetic and minimal)

E2E is where flakes breed. Keep it short, hermetic, and meaningful.

Run against ephemeral infra
- Use docker compose or Testcontainers to stand up DB/queues locally per job.
- No shared staging DB for tests—ever.
Stabilize browsers
- Pin Playwright version; ship its browsers via cache. Use retries: 1 only for truly transient browser startup.
Test the contract, not the internet
- Mock third parties (Stripe/SNS/etc.) at the boundary. Only one happy path per dependency in e2e.
Limit suite length
- 5–10 critical flows. The rest are integration/unit with contract tests.

Playwright config example:

// playwright.config.ts
import { defineConfig } from '@playwright/test';
export default defineConfig({
  timeout: 30_000,
  retries: 1,
  use: { baseURL: process.env.APP_URL, trace: 'on-first-retry' },
  reporter: [['junit', { outputFile: 'playwright.xml' }]],
  forbidOnly: true,
  grepInvert: /@quarantine/,
  projects: [{ name: 'chromium' }],
});

Ephemeral stack via compose in CI:

- name: Start stack
  run: |
    docker compose -f docker-compose.test.yml up -d --wait
    export APP_URL=http://localhost:8080
- name: Run smoke
  run: npx playwright test tests/smoke --reporter=junit
- name: Teardown
  if: always()
  run: docker compose -f docker-compose.test.yml down -v

Faster recovery beats perfect (canary + flags + auto‑rollback)

Perfect tests don’t exist. Design for fast, safe recovery.

Progressive delivery with Argo Rollouts or Spinnaker
- Canary 5% → 25% → 50% with pauses and analysis. Roll back on SLO breach.
Feature flags (LaunchDarkly/Unleash) reduce blast radius
- Ship dark; expose to 1% of traffic. Flags are change control.
Automated rollback
- Roll back on elevated error rate/latency for N minutes. Wire to metrics, not humans.

Argo Rollouts skeleton:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: api
spec:
  strategy:
    canary:
      steps:
        - setWeight: 5
        - pause: { duration: 2m }
        - setWeight: 25
        - pause: { duration: 5m }
      analysis:
        templates:
          - templateName: error-rate
        startingStep: 1

Measure MTTR from “alert” to “traffic restored.” Practice monthly. If it’s manual, it’s slow.

Checklists that scale with team size

Run these rituals so you don’t regress when headcount doubles.

Daily flake triage (15 minutes)
- Review non-deterministic failures from the last 24 hours.
- Mark quarantine, file a ticket with owner + expiry date, post to a shared board.
- If flake rate > 1%, freeze non-critical merges until addressed.
Weekly pipeline budget review (30 minutes)
- p95 PR first-signal under 10 minutes? If not: add shards, trim work, increase cache hits.
- Top 10 slow tests/modules—optimize or split.
- Invalidations: cache keys still correct? Are we over-invoking e2e via path filters?
Monthly hygiene (60 minutes)
- Expire quarantined tests older than 14 days; escalate to owners’ manager.
- Bump toolchains, re-pin versions, verify Docker image provenance.
- Chaos test the rollback path; verify Argo Rollouts and flags actually abort.

Copy-paste starter:

CI Keeper Rotation
- Rotate weekly; on-call for CI flake and pipeline incidents
- Owns daily triage, weekly budget review, and monthly hygiene
- Reports CFR, lead time, MTTR to Eng Leads every Friday

Results we’ve seen (and you can replicate)

At a fintech client on GitHub Actions + Argo Rollouts:

p95 PR first-signal: 38 minutes → 9 minutes
Main branch full suite: 62 minutes → 21 minutes
Flake rate: 6.2% → 0.8%
CFR: 19% → 9% in six weeks
MTTR: 2h 4m → 24m (flags + auto-rollback)
Lead time: 2.5 days → 0.9 days

What changed:

Quarantine + deterministic tests via Testcontainers
Merge queue with pre-merge smoke only; moved heavy e2e post-merge
Sharded unit tests (Jest --shard) and aggressive BuildKit + pnpm cache
Argo canaries + LaunchDarkly; rollback on error budget burn

If you only do three things this quarter: quarantine flakes, enforce a <10 minute first signal, and automate rollback. Your DORA metrics will move. Your team will feel it in code review pace and fewer “try rerunning?” pings.

Related Resources

Key takeaways

Make change failure rate, lead time, and MTTR the north-star metrics and wire CI/CD data to measure them weekly.
Quarantine flaky tests fast; treat them like production incidents with an owner and an expiry date.
Design pipelines for a <10 minute first signal: parallelize, split tests, cache aggressively, and cancel in-progress runs.
Stabilize dependencies with hermetic tests, ephemeral infra via Testcontainers, and strict timeouts.
Push risky tests to post-merge with a merge queue and progressive delivery; use feature flags to reduce blast radius.
Automate rollback via Argo Rollouts and SLO-driven gates; measure recovery time in minutes, not hours.
Run the checklists—daily flake triage, weekly pipeline budget review, and monthly hygiene—to keep it from regressing.

Implementation checklist

Create a CI dashboard tracking CFR, lead time, MTTR, p95 pipeline time, and flake rate.
Enable merge queue and required checks; separate pre-merge smoke from post-merge exhaustive suites.
Implement test quarantine: mark, exclude from blocking, track, and expire within 14 days.
Adopt deterministic tests: fixed seeds, UTC timezone, hermetic dependencies via Testcontainers.
Turn on aggressive caching: language caches, Docker BuildKit cache-to/from, remote build caches (Bazel/Nx/Gradle).
Shard and parallelize tests; cancel in-progress runs on new commits; fail fast on first red check.
Introduce canary/feature flags and SLO-based automated rollback; practice it monthly.

Questions we hear from teams

How do I measure flake rate reliably?: Tag test runs with commit SHA and runner ID; a failure followed by a pass with no code change counts as a flake. Aggregate from JUnit across 7 days and compute flake_count / total_runs per test. Anything >2% goes to quarantine immediately.
Should I add retries to tests?: Prefer zero retries. If you must, cap at 1 and only around known-transient boundaries (browser startup, network handshake). Never use retries to mask data races, timeouts, or order-dependence.
Do I need Bazel to get fast pipelines?: No. Bazel helps at scale, but you can get 80% with path filters, sharding, aggressive caches (pnpm/pip/Gradle), BuildKit cache-to/from, and moving heavy suites post-merge behind a merge queue.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to a GitPlumber about stabilizing your CI Read the CI stabilization case study