Should we just add retries to make the pipeline green?

Only as a temporary containment measure. Retries should be limited (1–2), scoped (usually E2E), and paired with quarantine + an owner and deadline. Blanket retries hide real failures and increase pipeline duration.

What’s the fastest way to identify flaky tests?

Start by consistently exporting JUnit XML and correlating failures across runs. Flakes typically show up as the same test failing intermittently with no related code change. If you already have GitHub history, you can usually identify the top offenders within a day once artifacts are reliable.

How do we know if we should optimize speed or reliability first?

If `main` is not green-first-try, prioritize reliability first—speed doesn’t matter if the signal is untrustworthy. If reliability is solid but p95 pipeline time is high, prioritize sharding/caching/test selection. Most teams can do both in parallel once they’ve instrumented the baseline.

Guides · May 30, 2026 · 9 minute read

Your CI Is Lying to You: Deflake Tests and Cut Pipeline Time Without Slowing Delivery

A pragmatic playbook to reduce flaky tests and pipeline latency with measurable checkpoints, concrete configs, and tooling that works in the real world.

GitPlumbers Editorial Team

Legacy + AI Code Rescue, Release Engineering, and Reliability

We’re the folks teams call when CI is flaky, pipelines are slow, and “works on my machine” is becoming a business strategy. We’ve rebuilt release pipelines after outages, untangled brittle test suites, and cleaned up AI-assisted code that shipped faster than it could be trusted.

If your team’s CI ritual includes “just rerun it,” you’re not shipping faster—you’re rolling dice with better branding.

Back to all posts

The two numbers that decide whether you ship: flake rate and pipeline lead time

If you’ve ever watched a release train get derailed by a red build that “goes green on rerun,” you already know the problem: CI becomes noise. Engineers stop trusting it, PRs stack up, and risk leaks into production.

Focus on two metrics that correlate strongly with delivery speed and safety:

Flake rate: percent of CI failures that pass on immediate rerun with no code changes.
- Target (healthy): < 0.5% of runs on main.
- Danger zone: > 2% and trending up.
Pipeline lead time: end-to-end time from commit to signal (green/red) on main.
- Track p50/p95 duration and queue time separately.
- A lot of teams think they have a 12-minute pipeline; in reality it’s 12 minutes + 18 minutes waiting for runners.

Checkpoint: Write these down for the last 14 days:

% green-first-try on main
flake_rate = flaky_failures / total_failures
p95_pipeline_duration (excluding queue)
p95_queue_time

If you can’t calculate them reliably, your first job is instrumentation—not “fixing tests.”

Baseline like an adult: instrument CI so you can prove progress

I’ve seen teams “deflake” for weeks and get nowhere because they were optimizing vibes. You need artifacts and trendlines.

Publish test results every run (even on failure). If you don’t have JUnit XML (or similar), add it.
Attach timing: suite duration, slowest tests, and runner wait time.
Centralize visibility: GitHub Actions summary is fine to start; Datadog CI Visibility / Buildkite Analytics / CircleCI Insights are better if you already pay for them.

Here’s a practical GitHub Actions pattern: always upload JUnit and keep logs searchable.

# .github/workflows/ci.yml
name: ci
on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: npm

      - name: Install
        run: npm ci

      - name: Test (Jest)
        run: |
          npm test -- --ci --reporters=default --reporters=jest-junit
        env:
          JEST_JUNIT_OUTPUT_DIR: test-results/jest
          JEST_JUNIT_OUTPUT_NAME: junit.xml

      - name: Upload test results (always)
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: junit
          path: test-results/**

Checkpoint: After this lands, verify you can answer:

What are the top 20 failing tests by count on main?
What are the top 20 slowest tests/suites?
How much time is queue vs execution?

No dashboards yet? A crude weekly export is still better than blind firefighting.

Deflake at the source: determinism beats retries

Flaky tests are rarely “mystical.” They’re usually one of these four:

Time: tests depend on wall clock, timezones, DST, or async timing.
Randomness: unseeded RNG, randomized test order, or data generation.
Shared state: database rows leaking, global singletons, parallel tests fighting.
External dependencies: network calls, real queues, third-party APIs, eventual consistency.

Time: freeze it, don’t fight it

JavaScript/TypeScript example (Jest):

// example.test.ts
beforeEach(() => {
  jest.useFakeTimers();
  jest.setSystemTime(new Date('2025-01-01T00:00:00Z'));
});

afterEach(() => {
  jest.useRealTimers();
});

Python example (freezegun):

from freezegun import freeze_time

def test_invoice_due_date():
    with freeze_time("2025-01-01"):
        assert compute_due_date() == "2025-01-31"

Randomness: seed it and log the seed

Seed generators (Faker, random UUIDs, property-based tests) so failures reproduce.
When a randomized test fails, print the seed in the failure output.

Shared state: isolate aggressively

Prefer per-test transactions with rollback.
Use unique schemas/databases per parallel worker (or namespaced keys).
Delete the “helpful” global cache in tests; it’s never helpful.

External dependencies: stub the boundary

Here’s what actually works:

Contract tests at the edge (e.g., pact, schemathesis) + fast unit tests inside.
For integration tests, spin dependencies locally (Docker Compose / Testcontainers) and pin versions.

Checkpoint: For your top 10 flakes, tag the root cause category (time/random/state/external). If you can’t categorize a test in 15 minutes, it’s usually shared state or external I/O hiding behind a helper.

Contain the blast radius: quarantine + guardrailed retries (not “rerun until green”)

Sometimes you need shipping safety today while you pay down flake debt. That’s where quarantine and limited retries come in.

The anti-pattern I’ve seen fail: unlimited retries across the whole suite. It hides real failures, inflates duration, and makes CI less trustworthy.

A workable policy

Quarantine a test only when:
- It’s proven flaky (passes on rerun) and
- You have an owner + a ticket + a deadline.
Quarantined tests run:
- On main in a separate job (non-blocking), or
- Nightly, with alerts when they fail.
Retries are:
- Scoped (E2E only, or specific tagged tests)
- Limited (1–2 retries max)
- Temporary (expires in 2 weeks)

Playwright example (retries only on CI):

// playwright.config.ts
import { defineConfig } from '@playwright/test';

export default defineConfig({
  retries: process.env.CI ? 1 : 0,
  workers: process.env.CI ? 4 : 1,
  reporter: [['list'], ['junit', { outputFile: 'test-results/playwright/junit.xml' }]],
});

Jest example (use sparingly):

// jest.config.js
module.exports = {
  testEnvironment: 'node',
  testTimeout: 30000,
  // Only enable if you’ve identified specific flaky suites; don’t blanket this.
  // You can also set per-file with jest.retryTimes(1)
};

Checkpoint: After quarantine + retries land, your goal is:

main becomes green-first-try again (trust restored)
Quarantine list trends down weekly (not up)

If quarantine grows forever, you’ve created a junk drawer.

Cut pipeline time without buying more runners: shard, cache, and select

Pipeline latency is usually death by a thousand cuts: cold dependency installs, no caching, serial suites, and “run everything” for every PR.

1) Shard test suites (and make shards stable)

Stable sharding keeps runs comparable and avoids the “shard 3 is always slow” mystery.

GitHub Actions matrix sharding pattern:

jobs:
  test:
    runs-on: ubuntu-latest
    strategy:
      fail-fast: false
      matrix:
        shard: [1, 2, 3, 4]
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: npm
      - run: npm ci
      - name: Run shard
        run: |
          node ./scripts/run-tests-sharded.js --shard ${{ matrix.shard }} --total 4

If you’re Python-heavy, pytest-xdist is still the quickest win:

pytest -n auto --dist=loadscope --junitxml=test-results/pytest/junit.xml

2) Cache the expensive parts (and measure hit rate)

Dependency caches: npm, pip, bundler, maven, gradle.
Build caches: Gradle build cache, Bazel remote cache, Nx/Turborepo caching.

Gradle example:

# gradle.properties
org.gradle.caching=true
org.gradle.parallel=true

Checkpoint: Track cache hit rate. If you can’t, you’ll “add caching” and still be slow.

3) Use test selection (with guardrails)

Running the full suite on every PR is comforting—and often unnecessary.

Pragmatic ladder:

PRs: unit + fast integration + lint/typecheck
main: full suite
Nightly: heavy E2E/perf/security

Tools that help:

Monorepos: Nx affected, Bazel, pants, buck2
JVM: Gradle incremental tasks + test filtering

Proof point: On a fintech CI I worked on, stable sharding + caching took p95 pipeline from ~42 minutes to ~14 minutes in a week, without touching app code. The next week was deflaking and data isolation.

Make it stick: CI SLOs and automatic regression alarms

Treat CI like a production system with an SLO (Service Level Objective: a reliability target with an error budget). Otherwise, flakiness slowly creeps back in—especially after a big refactor or a wave of AI-generated changes.

Good CI SLOs:

Reliability SLO: >= 99.5% of main runs are green on first attempt (excluding known quarantined tests).
Speed SLO: p95 pipeline duration <= 12 minutes (or whatever supports your release cadence).
Signal SLO: p95 time-to-first-failure <= 5 minutes (fail fast when it’s going to fail).

Enforcement patterns that work:

Weekly report: top flakes, top slow suites, quarantine list, cache hit rate.
“Two-way door” policy: if a PR adds > X minutes to p95 or introduces a new flake, it needs a fix or a plan.
Separate lane for experimental/AI-assisted code changes until they prove stable.

Checkpoint: Add a lightweight gate:

If p95 increases by >15% week-over-week → investigate.
If flake rate exceeds 1% → schedule a deflake sprint or stop-the-line for the top offenders.

When you want this fixed fast (and provably): how GitPlumbers helps

I’ve seen teams lose months to CI weirdness because nobody owns the full pipeline end-to-end. That’s where an outside pair of seasoned eyes pays for itself.

At GitPlumbers, we typically tackle this in three steps:

Run Automated Insights (GitHub-integrated) to quickly surface structural issues that correlate with flakes and slow CI: risky test patterns, dependency churn, hotspots, and reliability smells.
Book a code audit focused on test determinism + pipeline architecture: what’s flaky, why, and the shortest path to stable green builds and a faster p95.
Assemble a fractional team for remediation if you need parallel workstreams (E2E deflake + build caching + CI refactor) without derailing product delivery.

If you’re stuck in rerun culture, the fastest win is usually: baseline metrics → quarantine policy → top 10 deflake → shard + cache → CI SLOs. We can help you execute that in weeks, not quarters.

Next step: run GitPlumbers Automated Insights, then book a code audit to turn the findings into a prioritized remediation plan you can ship against.

Related Resources

Key takeaways

Track **flake rate** and **p95 pipeline duration** before changing anything; otherwise you’re guessing.
Most flakes come from **time**, **randomness**, **shared state**, and **external dependencies**—make tests deterministic first.
Use **quarantine + guardrailed retries** to protect mainline while you pay down flake debt.
Reduce latency with **sharding, caching, and test selection** before you buy more runners.
Set CI **SLOs** (yes, for CI) so regressions trigger action automatically instead of becoming culture.

Implementation checklist

Define baseline metrics: flake rate, p95 pipeline duration, queue time, % green-first-try on `main`
Ensure JUnit (or equivalent) artifacts are always uploaded, even on failure
Identify top 10 flaky tests by rerun frequency and isolate root causes (time, randomness, shared state, external calls)
Implement quarantine workflow and a strict policy for retries (limited scope, short expiry)
Shard tests and add dependency/build caches; validate cache hit rates
Add CI SLOs + regression alerts (flake rate and p95 duration budgets)
Schedule weekly deflake work until flake rate target is met

Questions we hear from teams

Should we just add retries to make the pipeline green?: Only as a temporary containment measure. Retries should be limited (1–2), scoped (usually E2E), and paired with quarantine + an owner and deadline. Blanket retries hide real failures and increase pipeline duration.
What’s the fastest way to identify flaky tests?: Start by consistently exporting JUnit XML and correlating failures across runs. Flakes typically show up as the same test failing intermittently with no related code change. If you already have GitHub history, you can usually identify the top offenders within a day once artifacts are reliable.
How do we know if we should optimize speed or reliability first?: If `main` is not green-first-try, prioritize reliability first—speed doesn’t matter if the signal is untrustworthy. If reliability is solid but p95 pipeline time is high, prioritize sharding/caching/test selection. Most teams can do both in parallel once they’ve instrumented the baseline.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Run Automated Insights Book a CI-focused code audit