Do we need Backstage to improve DevEx?

Not to start. Backstage can be useful as an index over your paved road, but it won’t cut wait times by itself. Ship fast CI, preview envs, and policy-as-code first; add a portal after the metrics move.

Can we do this in a regulated (PCI/SOC2/HIPAA) environment?

Yes. Put policy in code (OPA), require reviews where risk is real (infra, secrets, PII), and make every change auditable via GitOps. Automated checks usually strengthen compliance and shorten audits.

Do we need a monorepo to get these gains?

No. The biggest gains come from CI caching, preview envs, and clear ownership. Monorepos can help with dependency control, but they are neither necessary nor sufficient for speed.

What if our tests are the bottleneck?

Split unit vs integration, parallelize, and cache. Quarantine flakes and fail fast. Tools like `pytest-xdist`, `jest --runInBand` vs `--maxWorkers`, and test sharding in CI cut time without rewriting tests.

How do we keep teams from ignoring the paved road?

Make it the fastest option. Measure adoption, deprecate snowflakes, and require a written exception with SLO impact. Celebrate teams that hit the SLOs on the paved road.

Platform-productivity · Oct 16, 2025 · 10 minute read

Stop Timing Standups. Start Timing Waits: Measuring Friction and Killing Hand‑Offs with a Paved Road

If your PRs age like cheese, you don’t need another platform — you need a stopwatch, a paved road, and the courage to kill bespoke snowflakes.

Alex Kim

Principal, Platform Engineering, GitPlumbers

20 years building and rescuing platforms at scale (AWS, Atlassian, two unicorns you bank with). I’ve led teams through microservices rewrites, SOX/PCI audits, and three CI/CD overhauls that actually moved DORA metrics.

You don’t need a developer portal. You need a stopwatch.

Back to all posts

The Wednesday PR That Died in the Queue

I watched a team at a unicorn fintech push a two-line config change at 10:12 a.m. on a Wednesday. It merged Friday afternoon. Nothing was complex. The code was fine. The problem? Waits.

CI sat in a Jenkins queue behind a nightly job that never ended.
Environments required a ticket to a gatekeeper who provisioned a shared namespace “when they had time.”
Reviews bounced between teams because nobody owned the folder.

I’ve seen this movie a hundred times. You don’t fix it with a developer portal or a motivational poster. You fix it by measuring the friction and killing the top waits with a paved road.

Measure Friction Like an SRE: Instrument the Wait States

Feelings are valid; metrics get budget. Start with a simple, shared definition of flow metrics:

PR cycle time: merged_at - created_at.
Time to first review: first_review_at - created_at.
CI duration and queue time: sum of job runtimes and time spent waiting for runners.
Environment lead time: from PR creation to a deployable preview env.
Flake rate: % of CI runs that fail then pass without code changes.

Pull this from GitHub/GitLab and your CI. Don’t over-engineer — a bash script and jq gets you 80%.

# Quick-and-dirty PR cycle time (hours) for the last 200 merged PRs
gh pr list --state merged --limit 200 \
  --json number,createdAt,mergedAt,reviews \
  | jq -r '.[] | [.number, .createdAt, .mergedAt, 
      ((.mergedAt|fromdateiso8601) - (.createdAt|fromdateiso8601))/3600] | @csv'

Expose CI timing to Prometheus and graph 50th/95th percentiles in Grafana. If your CI can’t export metrics, that’s a smell.

# HELP ci_stage_duration_seconds Duration of CI stages
# TYPE ci_stage_duration_seconds summary
ci_stage_duration_seconds{stage="build"} 540
ci_stage_duration_seconds{stage="test"} 1200
ci_stage_duration_seconds{stage="publish"} 180

Set explicit SLOs for your platform:

p50 CI < 10m, p95 CI < 20m
First review < 2h during working hours
Preview env ready < 10m for PRs

Publish a weekly scorecard. You’ll get alignment without a single OKR meeting.

The Usual Suspects: Where the Time Actually Goes

Across GitPlumbers clients, the same culprits eat 60–80% of developer time:

Long builds/test suites: un-cached dependencies, no Docker layer caching, integration tests running on every commit.
Queueing for runners: shared Jenkins with overloaded agents or under-provisioned GitHub-hosted runners.
Serial approvals: change-advisory-board theater, or CODEOWNERS misconfig that pings the wrong team.
Environment bottlenecks: shared QA clusters, manual Terraform apply, or a single ArgoCD app for everything.
Flaky tests: retry culture masking problems and burning hours.
Bespoke toolchains: every repo a snowflake tool stack; platform team can’t support any of them well.

When we baseline, we often see: median PR cycle ~ 2.5 days, time-to-first-review ~ 6h, CI p95 ~ 45m, flake rate 5–10%, preview env lead time 1–2 days (or none). That’s the hill you need to flatten.

Build the Paved Road, Not a Tool Zoo

Pick defaults and make them great. Make the paved road the fastest path; let teams opt out with evidence.

CI: GitHub Actions (or GitLab CI) with reusable workflows and aggressive caching.
Builds: language-native caches plus Docker BuildKit with GHA cache backend.
Preview envs: Vercel/Netlify for SPA/Next.js; ApplicationSet + namespace-per-PR for Kubernetes services.
GitOps: ArgoCD with app-per-service; no giant monolithic app.
Policy: OPA/Conftest in CI and protected branches with CODEOWNERS.

Here’s a minimal paved-road ci.yml that cuts Node+Docker CI from 30m to under 10m on most stacks:

name: ci
on:
  pull_request:
    types: [opened, synchronize, reopened, ready_for_review]
concurrency:
  group: ${{ github.workflow }}-${{ github.ref }}
  cancel-in-progress: true
jobs:
  build-test-image:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with: { fetch-depth: 0 }
      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: npm
      - run: npm ci
      - run: npm test -- --ci
      - uses: docker/setup-buildx-action@v3
      - uses: docker/build-push-action@v6
        with:
          context: .
          file: ./Dockerfile
          push: false
          cache-from: type=gha
          cache-to: type=gha,mode=max

For Kubernetes preview environments, stop ticketing. Generate them on PRs with ArgoCD ApplicationSet:

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: preview-envs
spec:
  generators:
    - pullRequest:
        github:
          owner: your-org
          repo: your-repo
          tokenRef:
            secretName: github-token
            key: token
        requeueAfterSeconds: 60
  template:
    metadata:
      name: pr-{{number}}
    spec:
      project: default
      source:
        repoURL: https://github.com/your-org/your-repo.git
        targetRevision: {{head_sha}}
        path: deploy/overlays/preview
        kustomize:
          namePrefix: pr-{{number}}-
          images:
            - your-img:sha-{{head_sha}}
      destination:
        server: https://kubernetes.default.svc
        namespace: pr-{{number}}
      syncPolicy:
        automated:
          prune: true
          selfHeal: true

This buys you consistent, 5–10 minute envs without humans in the loop. For frontends, Vercel/Netlify does this out of the box — use it unless you truly need K8s.

Kill Hand-Offs with Policy-as-Code and Fast Approvals

The fastest review is the one routed to the right humans with clear rules. Use CODEOWNERS and a bot to enforce SLOs.

# CODEOWNERS
/services/payments/   @payments-team
/infrastructure/      @platform-team
/terraform/modules/   @platform-team @secops

Set branch protection: 1–2 required reviews max, status checks required, and auto-merge on green.
Use OPA/Conftest to block risky changes; reserve human reviews for material risk.
Instrument a Slack reminder bot: “PR #123 waiting 2h for first review.”

This combo cuts “review pinball” without compromising risk. I’ve seen PCI shops ship daily with this setup — the trick is moving policy to code and making the paved road audit-friendly.

Case Study: From Jenkins Sprawl to 8-Hour Cycle Time

At a payments client, we inherited:

Jenkins with 120 freestyle jobs, 25m median build, p95 70m
No preview environments; QA shared cluster with a weekly reset
Two approvals on every PR, no CODEOWNERS
Flake rate 7–9% across services

We shipped a paved road in 6 weeks:

Baseline metrics in Grafana; published SLOs.
Migrated the top 10 repos to a reusable Actions workflow with Node and Docker caching.
Introduced ArgoCD ApplicationSet preview envs; teardown on merge.
Added CODEOWNERS, review auto-assignment, and a 2h first-response SLO.
Quarantined flaky tests and added a “flake sheriff” rotation for a month.

Results after 60 days:

PR cycle time: 3.2 days → 8 hours median
CI p95: 70m → 18m (median 9m)
Flake rate: 8% → 0.8%
Preview env lead time: N/A → 7m median
Saved ~1.1 engineer-weeks per squad per sprint. Finance noticed before engineering did.

A 90-Day Plan You Can Actually Execute

Weeks 1–2: Baseline and align

Collect 90 days of PR/CI data. Publish current p50/p95 for CI, cycle, review, env lead time.
Agree on 3 SLOs and make them public.

Weeks 3–6: Ship the paved road

Build one reusable CI workflow per language stack with caches and parallelization.
Stand up preview envs for one service path (frontend on Vercel; K8s via ApplicationSet).
Add CODEOWNERS, required checks, and auto-merge on green.

Weeks 7–10: Eliminate the top waits

Tackle the slowest stage (often Docker build). Add BuildKit cache; split unit vs integration.
Add a small runner pool to kill CI queue time; autoscale if you can.
Quarantine flaky tests and badge repos with flake rate until it’s <1%.

Weeks 11–13: Lock in and expand

Roll paved road to the next 10 repos; delete bespoke pipelines.
Add a Slack reminder bot for review SLOs.
Publish a weekly platform scorecard; celebrate teams that hit SLOs.

Trade-Offs and Guardrails (So You Don’t Build a Platform for Its Own Sake)

I’ve seen Backstage rollouts that cost a quarter and moved zero metrics. Choose boring tech and measure impact.

Bespoke vs paved: Only allow snowflakes with a written, measurable reason (e.g., special compliance or latency). Time-box re-evaluation.
Monorepo vs polyrepo: The repo topology doesn’t matter if your CI and ownership are clear. Don’t reorg to avoid doing the hard CI/env work.
Preview env sprawl: Enforce TTL and quotas. Namespace-per-PR with network policies is safer than a shared “dev” cluster.
Policy: Keep the policy code short and versioned. If reviewers don’t know why a check failed, you created a new wait state.

You don’t need a developer portal. You need a stopwatch.

Ship the paved road, publish the SLOs, and remove what doesn’t earn its keep. That’s how you buy back developer time in weeks, not quarters.

Related Resources

Key takeaways

Measure the waits, not the feelings: track PR cycle time, review latency, CI queue time, and environment provisioning time.
Fix 3 waits first: CI/build time, review hand-off latency, and ephemeral env creation. They usually account for 60–80% of friction.
Favor paved-road defaults with reusable workflows, caches, and preview environments. Kill bespoke snowflake tooling.
Publish explicit SLOs for the platform (e.g., first review in <2h; CI median <10m) and instrument them.
Automate approvals with policy-as-code and `CODEOWNERS` to cut review hand-offs without compromising risk.

Implementation checklist

Baseline the top 5 wait states: PR cycle time, time-to-first-review, CI duration/queue, flaky rate, env create time.
Create a paved-road CI template with language/runtime caches and Docker BuildKit cache.
Add preview environments per PR via ArgoCD ApplicationSet (or Vercel/Fly.io for simple stacks).
Set `CODEOWNERS`, auto-assign reviewers, and define review SLOs; enforce with a bot.
Quarantine flaky tests and fail fast; badge repos with flake rate until it’s <1%.
Publish a platform scorecard weekly; fix the longest wait first each sprint.
Delete or sunset bespoke tools that duplicate the paved road — measure the reclaimed time.

Questions we hear from teams

Do we need Backstage to improve DevEx?: Not to start. Backstage can be useful as an index over your paved road, but it won’t cut wait times by itself. Ship fast CI, preview envs, and policy-as-code first; add a portal after the metrics move.
Can we do this in a regulated (PCI/SOC2/HIPAA) environment?: Yes. Put policy in code (OPA), require reviews where risk is real (infra, secrets, PII), and make every change auditable via GitOps. Automated checks usually strengthen compliance and shorten audits.
Do we need a monorepo to get these gains?: No. The biggest gains come from CI caching, preview envs, and clear ownership. Monorepos can help with dependency control, but they are neither necessary nor sufficient for speed.
What if our tests are the bottleneck?: Split unit vs integration, parallelize, and cache. Quarantine flakes and fail fast. Tools like `pytest-xdist`, `jest --runInBand` vs `--maxWorkers`, and test sharding in CI cut time without rewriting tests.
How do we keep teams from ignoring the paved road?: Make it the fastest option. Measure adoption, deprecate snowflakes, and require a written exception with SLO impact. Celebrate teams that hit the SLOs on the paved road.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Get a Developer Friction Baseline See how the paved road works in practice