Isn’t retrying tests good enough?

Retries are a circuit breaker, not a cure. They mask nondeterminism and inflate lead time. Use limited retries for known-transient issues, but quarantine flakes and fix root causes (time, randomness, shared services, version drift).

How do I justify the investment to product?

Tie work to DORA metrics and delivery speed. Cutting p95 pipeline time from 40 to 15 minutes recovers hours of engineer time weekly and reduces CFR. We’ve seen teams ship 2–3x more safely when feedback loops are fast and trustworthy.

What about AI-generated tests that seem flaky?

A lot of AI-generated code and tests are “vibe-coded” with hidden nondeterminism. Treat them like any other code: pin inputs, isolate dependencies, and quarantine until they earn trust. We do targeted AI code refactoring and vibe code cleanup as part of CI hardening.

Do we need to migrate to Bazel to get benefits?

No. Bazel is great for large monorepos, but you can get 80% with Nx (JS), Gradle test selection (JVM), and smart caching. Start with targeting and hermetic tests; adopt Bazel when your polyglot graph justifies it.

Release-engineering · Nov 21, 2025 · 10 minute read

Stop Letting CI Flake Run Your Roadmap: How We Cut Pipeline Time by 60% Without Burning the Team

CI flake isn’t just annoying—it corrupts your feedback loop and poisons your DORA metrics. Here’s the playbook we use to make builds deterministic, pipelines fast, and failures trustworthy.

Alex Morgan

Principal Release Engineer, GitPlumbers

20 years shipping and rescuing systems at scale. Ex-Atlassian build systems, Netflix delivery platform, and a tour through two unicorns’ CI outages. I fix pipelines, not slide decks.

Flaky CI turns every red build into a negotiation. Make failures trustworthy and the rest gets easier.

Back to all posts

The CI you don’t trust is the CI you ignore

I walked into a fintech where every broken build was a coin flip. PRs waited behind a 42-minute pipeline that failed ~7% of the time for reasons nobody could reproduce. Engineers were rerunning jobs, not fixing code. CFR drifted over 20%, lead time bloated to days, and MTTR crept past 2 hours because nobody could tell signal from noise.

We didn’t buy a bigger runner. We removed nondeterminism, cut the critical path, and made failures trustworthy. Six weeks later: pipelines at 12–16 minutes, flake below 1%, CFR halved, and roll-forwards became normal again. Here’s the playbook.

Related Resources

Key takeaways

Flake is a systems problem—eliminate nondeterminism first, don’t paper over it with retries.
Optimize for Change Failure Rate, Lead Time, and MTTR; pipeline minute-counts are secondary.
Design pipelines for the critical path: hermetic builds, targeted testing, and aggressive caching.
Quarantine flaky tests to restore trust; track them like incidents with owners and deadlines.
Speed recovery with merge queues, feature flags, and progressive delivery (canaries).
Instrument your pipeline and set SLOs for duration and pass rate; alert on the feedback loop, not vanity metrics.
Codify checklists so fixes scale with team growth.

Implementation checklist

Pin toolchains and dependencies (container images, language runtimes, build tools).
Make tests hermetic: fixed seeds, frozen clocks, isolated network/filesystem, ephemeral services via Testcontainers.
Cache the right layers: deps, build artifacts, and test results where deterministic.
Target work: run only affected tests using Nx/Bazel/Gradle test selection.
Quarantine known-flaky tests; prevent them from blocking main; file tickets with owners and kill dates.
Implement merge queue and required checks; trunk-based development to reduce long-lived branch drift.
Use progressive delivery (Argo Rollouts/Flagger) and feature flags to limit blast radius; prefer roll-forward.
Instrument DORA and CI SLOs; alert when 95th percentile pipeline duration or pass rate SLOs are breached.
Run weekly flake triage and monthly pipeline cost reviews; delete or refactor slowest 1% tests each cycle.

Questions we hear from teams

Isn’t retrying tests good enough?: Retries are a circuit breaker, not a cure. They mask nondeterminism and inflate lead time. Use limited retries for known-transient issues, but quarantine flakes and fix root causes (time, randomness, shared services, version drift).
How do I justify the investment to product?: Tie work to DORA metrics and delivery speed. Cutting p95 pipeline time from 40 to 15 minutes recovers hours of engineer time weekly and reduces CFR. We’ve seen teams ship 2–3x more safely when feedback loops are fast and trustworthy.
What about AI-generated tests that seem flaky?: A lot of AI-generated code and tests are “vibe-coded” with hidden nondeterminism. Treat them like any other code: pin inputs, isolate dependencies, and quarantine until they earn trust. We do targeted AI code refactoring and vibe code cleanup as part of CI hardening.
Do we need to migrate to Bazel to get benefits?: No. Bazel is great for large monorepos, but you can get 80% with Nx (JS), Gradle test selection (JVM), and smart caching. Start with targeting and hermetic tests; adopt Bazel when your polyglot graph justifies it.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Fix CI flake with GitPlumbers Get a 60-minute release engineering assessment