Stop Timing Standups. Start Timing Waits: Measuring Friction and Killing Hand‑Offs with a Paved Road
If your PRs age like cheese, you don’t need another platform — you need a stopwatch, a paved road, and the courage to kill bespoke snowflakes.
You don’t need a developer portal. You need a stopwatch.Back to all posts
The Wednesday PR That Died in the Queue
I watched a team at a unicorn fintech push a two-line config change at 10:12 a.m. on a Wednesday. It merged Friday afternoon. Nothing was complex. The code was fine. The problem? Waits.
- CI sat in a Jenkins queue behind a nightly job that never ended.
- Environments required a ticket to a gatekeeper who provisioned a shared namespace “when they had time.”
- Reviews bounced between teams because nobody owned the folder.
I’ve seen this movie a hundred times. You don’t fix it with a developer portal or a motivational poster. You fix it by measuring the friction and killing the top waits with a paved road.
Measure Friction Like an SRE: Instrument the Wait States
Feelings are valid; metrics get budget. Start with a simple, shared definition of flow metrics:
- PR cycle time:
merged_at - created_at. - Time to first review:
first_review_at - created_at. - CI duration and queue time: sum of job runtimes and time spent waiting for runners.
- Environment lead time: from PR creation to a deployable preview env.
- Flake rate:
% of CI runs that fail then pass without code changes.
Pull this from GitHub/GitLab and your CI. Don’t over-engineer — a bash script and jq gets you 80%.
# Quick-and-dirty PR cycle time (hours) for the last 200 merged PRs
gh pr list --state merged --limit 200 \
--json number,createdAt,mergedAt,reviews \
| jq -r '.[] | [.number, .createdAt, .mergedAt,
((.mergedAt|fromdateiso8601) - (.createdAt|fromdateiso8601))/3600] | @csv'Expose CI timing to Prometheus and graph 50th/95th percentiles in Grafana. If your CI can’t export metrics, that’s a smell.
# HELP ci_stage_duration_seconds Duration of CI stages
# TYPE ci_stage_duration_seconds summary
ci_stage_duration_seconds{stage="build"} 540
ci_stage_duration_seconds{stage="test"} 1200
ci_stage_duration_seconds{stage="publish"} 180Set explicit SLOs for your platform:
p50 CI < 10m,p95 CI < 20mFirst review < 2hduring working hoursPreview env ready < 10mfor PRs
Publish a weekly scorecard. You’ll get alignment without a single OKR meeting.
The Usual Suspects: Where the Time Actually Goes
Across GitPlumbers clients, the same culprits eat 60–80% of developer time:
- Long builds/test suites: un-cached dependencies, no Docker layer caching, integration tests running on every commit.
- Queueing for runners: shared Jenkins with overloaded agents or under-provisioned GitHub-hosted runners.
- Serial approvals:
change-advisory-boardtheater, orCODEOWNERSmisconfig that pings the wrong team. - Environment bottlenecks: shared QA clusters, manual Terraform apply, or a single ArgoCD app for everything.
- Flaky tests: retry culture masking problems and burning hours.
- Bespoke toolchains: every repo a snowflake tool stack; platform team can’t support any of them well.
When we baseline, we often see: median PR cycle ~ 2.5 days, time-to-first-review ~ 6h, CI p95 ~ 45m, flake rate 5–10%, preview env lead time 1–2 days (or none). That’s the hill you need to flatten.
Build the Paved Road, Not a Tool Zoo
Pick defaults and make them great. Make the paved road the fastest path; let teams opt out with evidence.
- CI: GitHub Actions (or GitLab CI) with reusable workflows and aggressive caching.
- Builds: language-native caches plus Docker BuildKit with GHA cache backend.
- Preview envs: Vercel/Netlify for SPA/Next.js;
ApplicationSet+ namespace-per-PR for Kubernetes services. - GitOps: ArgoCD with app-per-service; no giant monolithic app.
- Policy: OPA/Conftest in CI and protected branches with
CODEOWNERS.
Here’s a minimal paved-road ci.yml that cuts Node+Docker CI from 30m to under 10m on most stacks:
name: ci
on:
pull_request:
types: [opened, synchronize, reopened, ready_for_review]
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
jobs:
build-test-image:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with: { fetch-depth: 0 }
- uses: actions/setup-node@v4
with:
node-version: 20
cache: npm
- run: npm ci
- run: npm test -- --ci
- uses: docker/setup-buildx-action@v3
- uses: docker/build-push-action@v6
with:
context: .
file: ./Dockerfile
push: false
cache-from: type=gha
cache-to: type=gha,mode=maxFor Kubernetes preview environments, stop ticketing. Generate them on PRs with ArgoCD ApplicationSet:
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: preview-envs
spec:
generators:
- pullRequest:
github:
owner: your-org
repo: your-repo
tokenRef:
secretName: github-token
key: token
requeueAfterSeconds: 60
template:
metadata:
name: pr-{{number}}
spec:
project: default
source:
repoURL: https://github.com/your-org/your-repo.git
targetRevision: {{head_sha}}
path: deploy/overlays/preview
kustomize:
namePrefix: pr-{{number}}-
images:
- your-img:sha-{{head_sha}}
destination:
server: https://kubernetes.default.svc
namespace: pr-{{number}}
syncPolicy:
automated:
prune: true
selfHeal: trueThis buys you consistent, 5–10 minute envs without humans in the loop. For frontends, Vercel/Netlify does this out of the box — use it unless you truly need K8s.
Kill Hand-Offs with Policy-as-Code and Fast Approvals
The fastest review is the one routed to the right humans with clear rules. Use CODEOWNERS and a bot to enforce SLOs.
# CODEOWNERS
/services/payments/ @payments-team
/infrastructure/ @platform-team
/terraform/modules/ @platform-team @secops- Set branch protection: 1–2 required reviews max, status checks required, and auto-merge on green.
- Use OPA/Conftest to block risky changes; reserve human reviews for material risk.
- Instrument a Slack reminder bot: “PR #123 waiting 2h for first review.”
This combo cuts “review pinball” without compromising risk. I’ve seen PCI shops ship daily with this setup — the trick is moving policy to code and making the paved road audit-friendly.
Case Study: From Jenkins Sprawl to 8-Hour Cycle Time
At a payments client, we inherited:
- Jenkins with 120 freestyle jobs, 25m median build, p95 70m
- No preview environments; QA shared cluster with a weekly reset
- Two approvals on every PR, no
CODEOWNERS - Flake rate 7–9% across services
We shipped a paved road in 6 weeks:
- Baseline metrics in Grafana; published SLOs.
- Migrated the top 10 repos to a reusable Actions workflow with Node and Docker caching.
- Introduced ArgoCD
ApplicationSetpreview envs; teardown on merge. - Added
CODEOWNERS, review auto-assignment, and a 2h first-response SLO. - Quarantined flaky tests and added a “flake sheriff” rotation for a month.
Results after 60 days:
- PR cycle time: 3.2 days → 8 hours median
- CI p95: 70m → 18m (median 9m)
- Flake rate: 8% → 0.8%
- Preview env lead time: N/A → 7m median
- Saved ~1.1 engineer-weeks per squad per sprint. Finance noticed before engineering did.
A 90-Day Plan You Can Actually Execute
Weeks 1–2: Baseline and align
- Collect 90 days of PR/CI data. Publish current p50/p95 for CI, cycle, review, env lead time.
- Agree on 3 SLOs and make them public.
Weeks 3–6: Ship the paved road
- Build one reusable CI workflow per language stack with caches and parallelization.
- Stand up preview envs for one service path (frontend on Vercel; K8s via
ApplicationSet). - Add
CODEOWNERS, required checks, and auto-merge on green.
Weeks 7–10: Eliminate the top waits
- Tackle the slowest stage (often Docker build). Add BuildKit cache; split unit vs integration.
- Add a small runner pool to kill CI queue time; autoscale if you can.
- Quarantine flaky tests and badge repos with flake rate until it’s <1%.
Weeks 11–13: Lock in and expand
- Roll paved road to the next 10 repos; delete bespoke pipelines.
- Add a Slack reminder bot for review SLOs.
- Publish a weekly platform scorecard; celebrate teams that hit SLOs.
Trade-Offs and Guardrails (So You Don’t Build a Platform for Its Own Sake)
I’ve seen Backstage rollouts that cost a quarter and moved zero metrics. Choose boring tech and measure impact.
- Bespoke vs paved: Only allow snowflakes with a written, measurable reason (e.g., special compliance or latency). Time-box re-evaluation.
- Monorepo vs polyrepo: The repo topology doesn’t matter if your CI and ownership are clear. Don’t reorg to avoid doing the hard CI/env work.
- Preview env sprawl: Enforce TTL and quotas. Namespace-per-PR with network policies is safer than a shared “dev” cluster.
- Policy: Keep the policy code short and versioned. If reviewers don’t know why a check failed, you created a new wait state.
You don’t need a developer portal. You need a stopwatch.
Ship the paved road, publish the SLOs, and remove what doesn’t earn its keep. That’s how you buy back developer time in weeks, not quarters.
Key takeaways
- Measure the waits, not the feelings: track PR cycle time, review latency, CI queue time, and environment provisioning time.
- Fix 3 waits first: CI/build time, review hand-off latency, and ephemeral env creation. They usually account for 60–80% of friction.
- Favor paved-road defaults with reusable workflows, caches, and preview environments. Kill bespoke snowflake tooling.
- Publish explicit SLOs for the platform (e.g., first review in <2h; CI median <10m) and instrument them.
- Automate approvals with policy-as-code and `CODEOWNERS` to cut review hand-offs without compromising risk.
Implementation checklist
- Baseline the top 5 wait states: PR cycle time, time-to-first-review, CI duration/queue, flaky rate, env create time.
- Create a paved-road CI template with language/runtime caches and Docker BuildKit cache.
- Add preview environments per PR via ArgoCD ApplicationSet (or Vercel/Fly.io for simple stacks).
- Set `CODEOWNERS`, auto-assign reviewers, and define review SLOs; enforce with a bot.
- Quarantine flaky tests and fail fast; badge repos with flake rate until it’s <1%.
- Publish a platform scorecard weekly; fix the longest wait first each sprint.
- Delete or sunset bespoke tools that duplicate the paved road — measure the reclaimed time.
Questions we hear from teams
- Do we need Backstage to improve DevEx?
- Not to start. Backstage can be useful as an index over your paved road, but it won’t cut wait times by itself. Ship fast CI, preview envs, and policy-as-code first; add a portal after the metrics move.
- Can we do this in a regulated (PCI/SOC2/HIPAA) environment?
- Yes. Put policy in code (OPA), require reviews where risk is real (infra, secrets, PII), and make every change auditable via GitOps. Automated checks usually strengthen compliance and shorten audits.
- Do we need a monorepo to get these gains?
- No. The biggest gains come from CI caching, preview envs, and clear ownership. Monorepos can help with dependency control, but they are neither necessary nor sufficient for speed.
- What if our tests are the bottleneck?
- Split unit vs integration, parallelize, and cache. Quarantine flakes and fail fast. Tools like `pytest-xdist`, `jest --runInBand` vs `--maxWorkers`, and test sharding in CI cut time without rewriting tests.
- How do we keep teams from ignoring the paved road?
- Make it the fastest option. Measure adoption, deprecate snowflakes, and require a written exception with SLO impact. Celebrate teams that hit the SLOs on the paved road.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
