How do we avoid building a bespoke platform that becomes legacy?

Keep your layer thin and boring. Reuse open tools (Backstage, ArgoCD, Terraform/OpenTofu, GitHub Actions) and avoid custom controllers until the use case hurts repeatedly. Version everything, publish deprecation notices, and keep escape hatches documented. If your CLI is 10,000 lines, you’re re-implementing the cloud—stop.

What if a team truly needs to go off the paved road?

Make it a formal RFC with success criteria and SLO/ownership. Provide extension points: custom Helm chart under a `chart/experimental/` path, terraform `extra_*` variables, and a `policy-exception.yaml` reviewed by platform + security. Time-box exceptions and revisit quarterly.

Can we do this without Kubernetes?

Yes. The same pattern works with ECS, Nomad, or even serverless. Swap ArgoCD with CodeDeploy or Spinnaker; use Terraform modules and a reusable CI workflow. The key is defaults + GitOps-style desired state.

How do we roll out without blocking current delivery?

Start with one team and one runtime. Migrate net-new services first. For existing services, offer a migration path with tooling (`gp migrate service`) and support office hours. Make the paved road faster than the old way and adoption will follow.

What should the platform team own vs. product teams?

Platform owns the golden paths, shared modules, reusable workflows, cluster/runtime health, and policy. Product teams own their service code, SLOs, on-call, and any off-road choices they opt into via RFC.

Platform-productivity · Oct 7, 2025 · 10 minute read

Stop Making Everyone an SRE: The Paved Road That Turned 90% of Infra Tickets Into Pull Requests

Your engineers don’t want to babysit IAM roles or Helm values. Give them a paved road with sane defaults, escape hatches, and guardrails—then watch cycle time drop and infra tickets disappear.

Evan Marcus

Principal Consultant, GitPlumbers

20 years building and rescuing platforms at scale: Netflix CDN, pre-IPO fintechs, and more lift-and-shift regrets than he cares to admit. Evan leads GitPlumbers engagements for platform engineering and developer productivity.

Platforms win when they are boring, fast, and safe. Shiny is optional; paved is not.

Back to all posts

The incident that made us stop training product engineers as SREs

A Tuesday 9:14 AM page. Half the site in 502. Someone “just flipped a flag” and accidentally rolled a new NLB with a default SG that blocked health checks. The engineer followed our wiki. The wiki was wrong for eks 1.26. I’ve seen this movie at three different companies: great people, heroic intent, and a platform that expects everyone to understand IAM, VPCs, Helm, and Argo app-of-apps. That’s not a platform—that’s a scavenger hunt.

We fixed it the boring way: we stopped making everyone an SRE. We built a paved road with opinionated defaults, a tiny CLI, and GitOps. Infra ticket volume dropped 90% in a quarter. The infra team stopped being a concierge desk and started shipping road improvements.

The rule: abstract infra, default everything, document the escape hatch

Three constraints that have actually worked across orgs from 20 to 2,000 engineers:

One way to do the common thing. New service? There’s a single create-service path. Deploy? There’s one reusable workflow.
Defaults over options. 80% of decisions are pre-decided: runtime, base image, observability, SLO template, service account, network policy.
Guardrails, not gates. Policies run in CI with clear messages. If you need to go off-road, it’s a PR + RFC, not a Slack DM.

The paved road is just a small interface over your stack: Backstage templates or cookiecutter for repos, Terraform/OpenTofu modules for infra, GitHub Actions reusable workflows, and ArgoCD for GitOps. Keep it thin; the goal is fewer knobs, not a new bespoke platform to maintain.

Paved road blueprint: one CLI, one template, one pipeline

Here’s the model we’ve rolled out at multiple clients:

CLI: gp (GitPlumbers) wraps common tasks. It shells out to real tools (terraform, gh, kubectl) but hides flags and enforces conventions.
Template: Backstage software template (or cookiecutter) that creates repo + service with pre-wired observability, SLOs, and delivery.
Pipeline: A single reusable workflow for build/test/deploy that 90% of services can call.
GitOps: ArgoCD deploys from a central env/ repo. No direct kubectl in CI.
Policy-as-code: OPA/Conftest checks for things like public S3, missing budgets, or broad IAM.

A thin CLI example (TypeScript via tsx):

// tools/gp/src/commands/createService.ts
import { execSync } from 'node:child_process';
import fs from 'node:fs';

export async function createService(opts: { name: string; runtime: 'node'|'python'|'go'; }) {
  const { name, runtime } = opts;
  // 1) Scaffold from Backstage template
  execSync(`npx @backstage/create-app --no-private-registries --scope ${name}`);

  // 2) Register in catalog and create baseline SLO
  execSync(`gh repo create org/${name} --private --source ./${name} --push`);
  execSync(`gh api repos/org/${name}/dispatches -f event_type=create-slo`);

  // 3) Provision minimal infra via Terraform module
  fs.writeFileSync(`./${name}/infra/main.tf`, `
module "service" {
  source       = "git::ssh://git@github.com/org/tf-mod-service.git//base?ref=v3.2.0"
  service_name = "${name}"
  runtime      = "${runtime}"
}
`);
  execSync(`(cd ./${name}/infra && terraform init && terraform apply -auto-approve)`, { stdio: 'inherit' });
}

GitHub Actions reusable workflow everyone calls:

# .github/workflows/reuse-build-deploy.yaml in platform-infra repo
name: build-test-deploy
on:
  workflow_call:
    inputs:
      service:
        required: true
        type: string
      env:
        required: true
        type: string
jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: '20' }
      - run: npm ci && npm test && npm run build
      - uses: docker/build-push-action@v6
        with:
          push: true
          tags: ghcr.io/org/${{ inputs.service }}:${{ github.sha }}
  deploy:
    needs: build
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Create PR to env repo
        run: |
          gh repo clone org/env
          cd env
          yq -i \
            '.spec.source.helm.values.image.tag = "${{ github.sha }}"' \
            apps/${{ inputs.service }}/${{ inputs.env }}/values.yaml
          git checkout -b bump-${{ inputs.service }}-${{ github.sha }}
          git commit -am "bump ${{ inputs.service }} -> ${{ github.sha }}"
          git push origin HEAD
          gh pr create --fill

Before/After: standing up a new service

Before (what I actually found in a fintech last year):

Wiki with 22 steps across four pages.
Copy a service repo, grep/replace names.
Open three Jira tickets for DNS, IAM, and monitoring.
Ping a platform engineer to fix ServiceAccount and a Helm values.yaml you don’t understand.
Wait 3–5 days.

After (paved road):

# 7 minutes, one command, one PR
npx gp create service --name billing-api --runtime node

What the CLI generates:

billing-api/ repo with Dockerfile, helm/, observability/ (Grafana dashboard + Prometheus alerts), slo.yaml (99.9% latency budget), and a catalog-info.yaml for Backstage.
infra/ folder that pins Terraform modules to platform-approved versions (no surprise upgrades).
GitHub Actions workflow that calls the reusable build-test-deploy.

Backstage template snippet that encodes default choices so no one bikesheds:

# templates/service-template.yaml
apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
  name: node-rest-service
  title: Node REST Service (paved road)
spec:
  owner: platform
  parameters:
    - title: Service name
      required:
        - name
      properties:
        name:
          type: string
          pattern: '^[a-z][a-z0-9-]+$'
  steps:
    - id: fetch
      name: Fetch base template
      action: fetch:template
      input:
        url: ./skeleton
        values:
          runtime: node
          tracing: enabled
          metrics: enabled
          featureFlags: launchdarkly

Concrete result: time-to-first-PR went from ~3 days to <1 hour; new service lead time dropped by 80%.

Before/After: shipping to prod

Before:

CI runs kubectl apply with hand-crafted K8s manifests per service.
Rollbacks require another manual kubectl or digging for the last working YAML.
Drift everywhere: what’s in the cluster doesn’t match what’s in Git.

After (GitOps): CI only opens a PR in the env/ repo; ArgoCD does the rest. You can read prod state in Git and roll back with git revert.

ArgoCD Application per service:

# env/apps/billing-api/prod/app.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: billing-api-prod
spec:
  project: default
  source:
    repoURL: https://github.com/org/billing-api
    targetRevision: HEAD
    path: helm
    helm:
      valueFiles:
        - values/prod.yaml
  destination:
    server: https://kubernetes.default.svc
    namespace: billing
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true

Minimal Helm values developers touch (sane defaults baked into the chart):

# env/apps/billing-api/prod/values.yaml
image:
  repository: ghcr.io/org/billing-api
  tag: 7b3c1f6
resources:
  requests: { cpu: "200m", memory: "256Mi" }
  limits:   { cpu: "500m", memory: "512Mi" }
autoscaling:
  enabled: true
  minReplicas: 3
  maxReplicas: 10
  targetCPUUtilizationPercentage: 70
ingress:
  enabled: true
  host: billing.prod.example.com

Now rollbacks look like:

git revert <merge_commit_sha> && git push
# ArgoCD syncs back to the last good image.

MTTR went from “hope someone remembers the kubectl incantation” to “merge the revert.”

The costs you actually pay (and why it’s worth it)

Be honest about trade-offs:

You will say no to creative snowflake infra. That’s the point. If someone needs NATS instead of your default Kafka or SQS, make it an RFC with SLOs and owning team.
You’re maintaining a platform product. Release notes, versioned modules, deprecation policy. Boring, necessary.
Abstraction leaks. Once a quarter, someone will need to edit a SecurityGroup or a PodDisruptionBudget. Document the escape hatch:

# Off-road with eyes open
gp open escape-hatch --service billing-api --policy p99-latency --rfc 123

Benefits that show up in the CFO’s spreadsheet:

Infra tickets/engineer/month drop. We’ve seen 7.4 -> 0.6 in one quarter.
Change failure rate falls because you deploy the same way every time.
Onboarding time shrinks: new engineer ships in day one rather than week two.
Cloud bill variance stabilizes because you eliminate the weird one-off stacks.

30–60 day rollout plan (no yak shaving)

Pick your runtime(s) and IaC. E.g., Node 20 + Go 1.22 and OpenTofu 1.7. Freeze versions.
Baseline modules. Create tf-mod-service and tf-mod-database with pinned providers and sane defaults (budget alarms, encryption, tags).
Golden-path template. Backstage or cookiecutter for one runtime. Wire in logging (OpenTelemetry), metrics (Prometheus), tracing (Jaeger), and SLO scaffold.
Reusable workflow. Publish build-test-deploy in a platform repo. Block kubectl from CI runners.
GitOps. Stand up ArgoCD, create env/ repo with app-of-apps or per-service apps.
Policy-as-code. Add Conftest checks to the reusable workflow:

conftest test infra/ --policy policies/ && conftest test helm/ --policy policies/

CLI wrapper. Even a Bash MVP is fine if it encodes conventions:

#!/usr/bin/env bash
set -euo pipefail
cmd=$1; shift
case "$cmd" in
  create-service) npx @backstage/create-app "$@" ;;
  tf) (cd infra && tofu "$@") ;;
  deploy) gh workflow run reuse-build-deploy.yaml -f service=$(basename "$PWD") -f env=${1:-dev} ;;
  *) echo "unknown command"; exit 1;;
esac

Docs and office hours. 2-page quickstart, weekly “pave requests” triage, and a public backlog.

What to measure and when to iterate

Track platform like a product. Dashboards we install day one:

DORA metrics: lead time for changes, deployment frequency, change failure rate, MTTR.
Platform SLOs: template success rate, pipeline success rate, Argo sync lag.
Support load: infra tickets per engineer per month, median time-to-first-response.
Adoption: % services on the golden path, % deploys via reusable workflow, % drift-free apps in Argo.

If adoption stalls, the usual culprits:

The template doesn’t match reality (missing grpc, no cronjob support).
The CLI makes off-road too hard or too easy.
Hidden toil in the pipeline (Docker layer cache not working; 12-min builds). Fix it or people will bypass you.

Platforms win when they are boring, fast, and safe. Shiny is optional; paved is not.

Related Resources

Key takeaways

Abstract infrastructure behind a small, boring interface that defaults 80% of decisions.
Standardize on one CLI, one template, and one delivery pipeline; make escape hatches explicit.
Move infra support from bespoke tickets to PRs by codifying paved-road modules and reusable workflows.
Measure success with MTTR, lead time for changes, and infra ticket volume per engineer.
Adopt GitOps with ArgoCD to reduce drift and make rollbacks predictable.

Implementation checklist

Pick one IaC tool and version it (Terraform or OpenTofu).
Create a single golden-path template (Backstage or cookiecutter) per runtime.
Ship a thin internal CLI wrapper for paved-road operations.
Adopt GitOps (ArgoCD/Flux) and one reusable CI/CD workflow.
Codify policy-as-code with OPA/Conftest or Sentinel.
Document escape hatches and an RFC process for going off-road.
Instrument the platform: deploy dashboards for DORA + SLOs.

Questions we hear from teams

How do we avoid building a bespoke platform that becomes legacy?: Keep your layer thin and boring. Reuse open tools (Backstage, ArgoCD, Terraform/OpenTofu, GitHub Actions) and avoid custom controllers until the use case hurts repeatedly. Version everything, publish deprecation notices, and keep escape hatches documented. If your CLI is 10,000 lines, you’re re-implementing the cloud—stop.
What if a team truly needs to go off the paved road?: Make it a formal RFC with success criteria and SLO/ownership. Provide extension points: custom Helm chart under a `chart/experimental/` path, terraform `extra_*` variables, and a `policy-exception.yaml` reviewed by platform + security. Time-box exceptions and revisit quarterly.
Can we do this without Kubernetes?: Yes. The same pattern works with ECS, Nomad, or even serverless. Swap ArgoCD with CodeDeploy or Spinnaker; use Terraform modules and a reusable CI workflow. The key is defaults + GitOps-style desired state.
How do we roll out without blocking current delivery?: Start with one team and one runtime. Migrate net-new services first. For existing services, offer a migration path with tooling (`gp migrate service`) and support office hours. Make the paved road faster than the old way and adoption will follow.
What should the platform team own vs. product teams?: Platform owns the golden paths, shared modules, reusable workflows, cluster/runtime health, and policy. Product teams own their service code, SLOs, on-call, and any off-road choices they opt into via RFC.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to a platform engineer Download the paved-road checklist