Stop Paying the Wait Tax: Measure Dev Friction and Kill Hand‑Offs with a Paved Road
If your engineers spend more time waiting than shipping, you don’t have a talent problem—you have a platform problem. Instrument the friction, fix the top waits, and give teams a paved road.
“Developer productivity is a latency problem wearing a people problem’s clothes.”Back to all posts
The wait tax you don’t see
I walked into a unicorn’s platform review where engineering was “slow.” The data said otherwise: typing time was fine. But PRs waited 1.9 days for first review, CI queued for 20–40 minutes during peak, and staging environments took a human hand‑off to provision. Lead time looked like a highway: short driving, long traffic.
I’ve seen this movie at banks, adtech, and FAANG-adjacent shops. The pattern is consistent: we optimize code and talent, then drown both in queueing theory. The fix isn’t a motivational poster—it’s measuring friction like latency and then paving the road so teams don’t have to ask for permission to ship.
Measure friction like an SRE, not vibes
You can’t fix what you can’t see. Treat developer flow like you treat p99 latency: instrument, segment, and attack the biggest tail.
- Break down lead time (DORA) into observable phases:
coding
(first commit → PR open)PR-open→first-review
review→merge
merge→deploy
- Add supporting signals:
CI queue time
vsCI duration
(separate waiting from running)flaky test rate
(reruns / total runs)preview env wait
(PR open → URL available)change failure rate
andMTTR
from incidents
Quick-and-dirty collection beats six months of a data warehouse project. Use gh
+ jq
, your CI API, and a spreadsheet for the first week. Then productionize.
Example: GitHub GraphQL to measure PR-open→first-review for last 100 merged PRs:
REPO="org/repo"
GH_QUERY='query($owner:String!, $name:String!){
repository(owner:$owner, name:$name){
pullRequests(last:100, states:MERGED){
nodes{ number createdAt mergedAt reviews(first:1){nodes{createdAt}} }
}
}
}'
echo "{\"owner\":\"${REPO%/*}\",\"name\":\"${REPO#*/}\"}" | \
gh api graphql -f query="${GH_QUERY}" -f variables=@- | \
jq -r '.data.repository.pullRequests.nodes[] | [ .number, .createdAt, (.reviews.nodes[0].createdAt // null) ] | @csv'
Feed that into a quick script to compute medians/p95. Do the same for CI with Buildkite GraphQL
or GitHub Actions
run logs. Publish a single dashboard in Grafana
or Metabase
with weekly deltas and owners by repo/team. No vanity metrics—if a team can’t act on it next sprint, it doesn’t go on the wall.
Kill hand‑offs with a paved road, not a fleet of bespoke tools
Every time a dev has to remember a snowflake command or file a ticket, you’re paying a wait tax. The antidote is a paved road: opinionated defaults that 80% of services follow, with escape hatches for the rest.
What good looks like:
- Templates: Backstage Software Templates or
cookiecutter
that generate repos withMakefile
,Dockerfile
,helm/
,opa/
,CODEOWNERS
, and CI pre-wired. - One workflow:
make bootstrap
,make test
,make build
,make deploy
. No bespoke CLIs. - GitOps: manifests in-repo;
Argo CD
applies them to clusters automatically per branch/env. Infra changes viaTerraform
+Atlantis
orSpacelift
. - Policy-as-code:
OPA
/Conftest
in CI so security/compliance is a green check, not a meeting.
I’ve seen teams shave weeks off migrations just by deleting three internal CLIs and aligning on make
and two YAML templates.
Before/after: eliminate the top three waits
Let’s hit the usual suspects. These numbers are from real client engagements (rounded) and are repeatable if you hold the line on defaults.
1) PR review latency
- Before: median 1.8 days to first review; reviewers manually picked; no ownership.
- Moves:
- Use
CODEOWNERS
withpath → team
mappings; avoid “everyone owns everything.” - Auto-assign reviewers (GitHub native or
Mergify
). - Slack nudge at T+2h, T+24h via webhook; rotate review buddy if owner is OOO.
- Set SLA: “first review < 4 business hours” for normal changes, < 1h for hotfix.
- Use
- After: 0.4 days median; p90 under a day. Lead time down 30–45% without touching code.
2) CI queue + build time
- Before: 25m average queue during peak; 18m build; 12m tests. Devs context-switch, PRs pile up.
- Moves:
- Add a dedicated autoscaling runner fleet; set
concurrency
to cancel superseded runs. - Enable caching/remote cache:
actions/cache
,Bazel
/Nx
/Gradle
remote cache,Docker BuildKit
with--cache-to/from
. - Parallelize tests with a timing file; shard by package.
- Fail fast: smoke tests before heavy integration.
- Add a dedicated autoscaling runner fleet; set
- Example GitHub Actions snippet:
name: ci
on:
pull_request:
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
jobs:
build-test:
runs-on: ubuntu-latest
strategy:
matrix:
shard: [1,2,3,4]
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with: { node-version: '20' }
- uses: actions/cache@v4
with:
path: |
~/.npm
.nx/cache
key: ${{ runner.os }}-npm-${{ hashFiles('**/package-lock.json') }}
- run: npm ci
- run: npx nx run-many -t test --parallel=4 --ci --shard=${{ matrix.shard }}/4
After: queue 0–5m; build 8–10m; tests 6–8m. PR cycle shrinks by ~30–45 minutes.
3) Environment provisioning / preview URLs
- Before: ticket to SRE to create staging namespace; once a day release window; QA blocked.
- Moves:
- Namespace-per-PR with
Argo CD
andKustomize
, or use managed previews (Vercel
,Render
,Qovery
) for web apps. - Generate a per-PR
values.yaml
with image tagpr-<number>
; auto-destroy on merge/close.
- Namespace-per-PR with
- Minimal Kustomize overlay:
# k8s/overlays/pr/kustomization.yaml
resources:
- ../../base
patches:
- target: { kind: Deployment, name: web }
patch: |
- op: replace
path: /spec/template/spec/containers/0/image
value: ghcr.io/org/app:pr-$(PR_NUMBER)
- After: preview URL in 3–7 minutes after PR; QA self-serves; product reviews async. Hand‑offs evaporate.
Favor simplification over bespoke tooling
The quickest way to reintroduce friction is to “solve” it with a new custom tool. Fewer levers, clearer docs, more repetition.
- Standard Make targets everyone can memorize:
.PHONY: bootstrap test build deploy
bootstrap:
asdf install || true
npm ci
docker-build:
docker buildx build --target app --build-arg SHA=$(shell git rev-parse --short HEAD) -t ghcr.io/org/app:$(SHA) .
test:
npx nx run-many -t test --ci
deploy:
argocd app sync app-$(ENV)
- One manifest format per layer:
helm
for apps,Terraform
for infra. Don’t mixPulumi
/CDKTF
/custom wrappers unless you’re staffed to maintain them. - Backstage or die-by-a-thousand-confluences: one catalog, one software template per language. Close the blank page problem.
- OPA rules that read like guardrails, not riddles. Example Conftest policy to block
:latest
images:
package ci
violation[msg] {
input.kind == "Deployment"
some c
c := input.spec.template.spec.containers[_]
endswith(c.image, ":latest")
msg := sprintf("Disallow :latest for %s", [c.name])
}
- Sunset the old path. Leave an escape hatch via an RFC process, but make the paved road the default in templates and docs. If it’s optional, it’s dead.
A 90‑day rollout that actually sticks
Don’t boil the ocean. Pick two waits, fix them hard, and publish the gains.
- Weeks 0–2: Instrument and baseline
- Capture DORA lead time segmented by phase for top 10 repos.
- Add CI queue/duration, flaky rate, preview env wait. Publish a simple Grafana board.
- Weeks 3–6: Ship paved-road v1
- Backstage template with
Makefile
, CI,CODEOWNERS
,helm/
, OPA policies. - Enable
concurrency
, caching, test sharding in CI. Add Slack nudges for review SLAs.
- Backstage template with
- Weeks 7–10: Roll out preview environments
- Namespace-per-PR via Argo CD/Kustomize or managed previews. Auto-destroy on merge.
- Weeks 11–12: Enforce and retire
- Make the template the default. Freeze new repos unless created via template.
- Deprecate old CI configs; add a migration playbook. Celebrate with before/after graphs.
Success looks like: lead time down 30–50%, PR-first-review < 4h, CI under 15m median, preview URLs in < 10m. If you don’t see it, you didn’t remove enough choice or you left the bespoke path open.
Dashboards that matter (and the ones that don’t)
Useful KPIs:
- Lead time segmented into the four phases (show medians and p90s).
- PR-first-review SLA attainment by team (stack-ranked).
- CI queue vs run time over the day (spot capacity holes).
- Flaky test rate by suite (action: quarantine or fix).
- Preview env availability time and failure rate.
- Change failure rate and MTTR tied to deploys per day.
Bad KPIs: lines of code, number of PRs, story points. Optimize flow, not theater.
Keep ownership obvious: each graph has an on-call Slack channel and a doc link to “how to fix it.” If a metric has no owner, either delete it or assign one.
Lessons learned the hard way
- If everything is a special case, nothing ships. Mandate templates for new services. Your exceptions policy should be a page, not a book.
- Be ruthless about old paths. We’ve cut org lead time in half just by deleting the second deploy button.
- Policy must be paired with examples. OPA without human-readable remediation is just another gate.
- Preview envs beat staging. Most bugs are spotted in-context by PMs/QA reviewing a URL linked in the PR.
- Buy where undifferentiated. Autoscaling runners, preview environments, secret managers—don’t DIY unless it’s your core business.
GitPlumbers has done this at fintechs under SOC2, adtech shops with 1K+ services, and seed-stage teams drowning in early tooling debt. We’ll instrument, cut your top waits, and leave you with a paved road your teams actually use.
Key takeaways
- Developer friction is mostly wait states and hand‑offs, not keystrokes. Measure it like SREs measure latency.
- Break lead time into phases (coding, PR-open→first-review, review→merge, merge→deploy) and attack the biggest waits first.
- Favor paved-road defaults (templates, Make targets, GitOps) over custom CLIs and snowflake pipelines.
- Three high-ROI fixes: enforce PR review SLAs, speed CI with caching/parallelism, auto-provision preview envs per PR.
- Use policy-as-code to shift security/compliance left without adding more gates or human hand‑offs.
- Do a 90-day rollout: instrument, pick top two waits, ship golden paths, enforce by default, retire the bespoke path.
Implementation checklist
- Capture DORA lead time and segment wait states per repo/service.
- Instrument CI queue time, build duration, and flaky test rate.
- Set PR review SLAs; auto-assign reviewers via CODEOWNERS; send nudges, not nagging.
- Enable CI caching/remote cache; parallelize tests; fail-fast with build matrices.
- Provision per-PR preview environments via Argo CD/Kustomize or a managed preview service.
- Standardize Make targets (bootstrap, test, build, deploy) and ship Backstage templates.
- Shift-left security: Trivy/Snyk in CI; OPA/Conftest for policy gates with clear messages.
- Publish a single dashboard in Grafana/Looker with weekly deltas and ownership.
Questions we hear from teams
- What should we tackle first: PR latency, CI, or environments?
- Measure first, but 8 out of 10 times PR-first-review latency is the biggest lever because it compounds everything else. Fix ownership with CODEOWNERS and SLAs, then cut CI queue time. Preview environments usually land next.
- How do we avoid metric theater and surveillance vibes?
- Only track metrics teams can act on in the next sprint, publish them publicly, and pair every graph with a documented action. No individual leaderboards. Focus on system waits and hand-offs, not keystrokes or time-at-keyboard.
- Won’t paved-road defaults slow down our experts?
- Done right, the paved road includes escape hatches via RFCs and overrides. The trick is making the default path so good that most teams choose it voluntarily. Keep the surface small and the docs excellent.
- Do we need Backstage to do this?
- No, but it helps. You can start with a Git template repo and a Makefile standard. Backstage becomes valuable when you want a catalog and self-service templates in one place.
- How do we bring security/compliance along?
- Shift-left with OPA/Conftest and scanners (Trivy/Snyk) in CI, paired with clear messages and remediation. Map policies to tickets and audits. When security sees fewer production surprises, they become your champions.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.