Stop Making Everyone an SRE: The Paved Road That Turned 90% of Infra Tickets Into Pull Requests
Your engineers don’t want to babysit IAM roles or Helm values. Give them a paved road with sane defaults, escape hatches, and guardrails—then watch cycle time drop and infra tickets disappear.
Platforms win when they are boring, fast, and safe. Shiny is optional; paved is not.Back to all posts
The incident that made us stop training product engineers as SREs
A Tuesday 9:14 AM page. Half the site in 502. Someone “just flipped a flag” and accidentally rolled a new NLB with a default SG that blocked health checks. The engineer followed our wiki. The wiki was wrong for eks 1.26. I’ve seen this movie at three different companies: great people, heroic intent, and a platform that expects everyone to understand IAM, VPCs, Helm, and Argo app-of-apps. That’s not a platform—that’s a scavenger hunt.
We fixed it the boring way: we stopped making everyone an SRE. We built a paved road with opinionated defaults, a tiny CLI, and GitOps. Infra ticket volume dropped 90% in a quarter. The infra team stopped being a concierge desk and started shipping road improvements.
The rule: abstract infra, default everything, document the escape hatch
Three constraints that have actually worked across orgs from 20 to 2,000 engineers:
- One way to do the common thing. New service? There’s a single
create-servicepath. Deploy? There’s one reusable workflow. - Defaults over options. 80% of decisions are pre-decided: runtime, base image, observability, SLO template, service account, network policy.
- Guardrails, not gates. Policies run in CI with clear messages. If you need to go off-road, it’s a PR + RFC, not a Slack DM.
The paved road is just a small interface over your stack: Backstage templates or cookiecutter for repos, Terraform/OpenTofu modules for infra, GitHub Actions reusable workflows, and ArgoCD for GitOps. Keep it thin; the goal is fewer knobs, not a new bespoke platform to maintain.
Paved road blueprint: one CLI, one template, one pipeline
Here’s the model we’ve rolled out at multiple clients:
- CLI:
gp(GitPlumbers) wraps common tasks. It shells out to real tools (terraform,gh,kubectl) but hides flags and enforces conventions. - Template: Backstage software template (or
cookiecutter) that creates repo + service with pre-wired observability, SLOs, and delivery. - Pipeline: A single reusable workflow for build/test/deploy that 90% of services can call.
- GitOps: ArgoCD deploys from a central
env/repo. No directkubectlin CI. - Policy-as-code: OPA/Conftest checks for things like public S3, missing budgets, or broad IAM.
A thin CLI example (TypeScript via tsx):
// tools/gp/src/commands/createService.ts
import { execSync } from 'node:child_process';
import fs from 'node:fs';
export async function createService(opts: { name: string; runtime: 'node'|'python'|'go'; }) {
const { name, runtime } = opts;
// 1) Scaffold from Backstage template
execSync(`npx @backstage/create-app --no-private-registries --scope ${name}`);
// 2) Register in catalog and create baseline SLO
execSync(`gh repo create org/${name} --private --source ./${name} --push`);
execSync(`gh api repos/org/${name}/dispatches -f event_type=create-slo`);
// 3) Provision minimal infra via Terraform module
fs.writeFileSync(`./${name}/infra/main.tf`, `
module "service" {
source = "git::ssh://git@github.com/org/tf-mod-service.git//base?ref=v3.2.0"
service_name = "${name}"
runtime = "${runtime}"
}
`);
execSync(`(cd ./${name}/infra && terraform init && terraform apply -auto-approve)`, { stdio: 'inherit' });
}GitHub Actions reusable workflow everyone calls:
# .github/workflows/reuse-build-deploy.yaml in platform-infra repo
name: build-test-deploy
on:
workflow_call:
inputs:
service:
required: true
type: string
env:
required: true
type: string
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with: { node-version: '20' }
- run: npm ci && npm test && npm run build
- uses: docker/build-push-action@v6
with:
push: true
tags: ghcr.io/org/${{ inputs.service }}:${{ github.sha }}
deploy:
needs: build
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Create PR to env repo
run: |
gh repo clone org/env
cd env
yq -i \
'.spec.source.helm.values.image.tag = "${{ github.sha }}"' \
apps/${{ inputs.service }}/${{ inputs.env }}/values.yaml
git checkout -b bump-${{ inputs.service }}-${{ github.sha }}
git commit -am "bump ${{ inputs.service }} -> ${{ github.sha }}"
git push origin HEAD
gh pr create --fillBefore/After: standing up a new service
Before (what I actually found in a fintech last year):
- Wiki with 22 steps across four pages.
- Copy a service repo, grep/replace names.
- Open three Jira tickets for DNS, IAM, and monitoring.
- Ping a platform engineer to fix
ServiceAccountand a Helmvalues.yamlyou don’t understand. - Wait 3–5 days.
After (paved road):
# 7 minutes, one command, one PR
npx gp create service --name billing-api --runtime nodeWhat the CLI generates:
billing-api/repo withDockerfile,helm/,observability/(Grafana dashboard + Prometheus alerts),slo.yaml(99.9% latency budget), and acatalog-info.yamlfor Backstage.infra/folder that pins Terraform modules to platform-approved versions (no surprise upgrades).- GitHub Actions workflow that calls the reusable
build-test-deploy.
Backstage template snippet that encodes default choices so no one bikesheds:
# templates/service-template.yaml
apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
name: node-rest-service
title: Node REST Service (paved road)
spec:
owner: platform
parameters:
- title: Service name
required:
- name
properties:
name:
type: string
pattern: '^[a-z][a-z0-9-]+$'
steps:
- id: fetch
name: Fetch base template
action: fetch:template
input:
url: ./skeleton
values:
runtime: node
tracing: enabled
metrics: enabled
featureFlags: launchdarklyConcrete result: time-to-first-PR went from ~3 days to <1 hour; new service lead time dropped by 80%.
Before/After: shipping to prod
Before:
- CI runs
kubectl applywith hand-crafted K8s manifests per service. - Rollbacks require another manual
kubectlor digging for the last working YAML. - Drift everywhere: what’s in the cluster doesn’t match what’s in Git.
After (GitOps): CI only opens a PR in the env/ repo; ArgoCD does the rest. You can read prod state in Git and roll back with git revert.
ArgoCD Application per service:
# env/apps/billing-api/prod/app.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: billing-api-prod
spec:
project: default
source:
repoURL: https://github.com/org/billing-api
targetRevision: HEAD
path: helm
helm:
valueFiles:
- values/prod.yaml
destination:
server: https://kubernetes.default.svc
namespace: billing
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=trueMinimal Helm values developers touch (sane defaults baked into the chart):
# env/apps/billing-api/prod/values.yaml
image:
repository: ghcr.io/org/billing-api
tag: 7b3c1f6
resources:
requests: { cpu: "200m", memory: "256Mi" }
limits: { cpu: "500m", memory: "512Mi" }
autoscaling:
enabled: true
minReplicas: 3
maxReplicas: 10
targetCPUUtilizationPercentage: 70
ingress:
enabled: true
host: billing.prod.example.comNow rollbacks look like:
git revert <merge_commit_sha> && git push
# ArgoCD syncs back to the last good image.MTTR went from “hope someone remembers the kubectl incantation” to “merge the revert.”
The costs you actually pay (and why it’s worth it)
Be honest about trade-offs:
- You will say no to creative snowflake infra. That’s the point. If someone needs
NATSinstead of your defaultKafkaorSQS, make it an RFC with SLOs and owning team. - You’re maintaining a platform product. Release notes, versioned modules, deprecation policy. Boring, necessary.
- Abstraction leaks. Once a quarter, someone will need to edit a SecurityGroup or a PodDisruptionBudget. Document the escape hatch:
# Off-road with eyes open
gp open escape-hatch --service billing-api --policy p99-latency --rfc 123Benefits that show up in the CFO’s spreadsheet:
- Infra tickets/engineer/month drop. We’ve seen 7.4 -> 0.6 in one quarter.
- Change failure rate falls because you deploy the same way every time.
- Onboarding time shrinks: new engineer ships in day one rather than week two.
- Cloud bill variance stabilizes because you eliminate the weird one-off stacks.
30–60 day rollout plan (no yak shaving)
- Pick your runtime(s) and IaC. E.g.,
Node 20 + Go 1.22andOpenTofu 1.7. Freeze versions. - Baseline modules. Create
tf-mod-serviceandtf-mod-databasewith pinned providers and sane defaults (budget alarms, encryption, tags). - Golden-path template. Backstage or
cookiecutterfor one runtime. Wire in logging (OpenTelemetry), metrics (Prometheus), tracing (Jaeger), and SLO scaffold. - Reusable workflow. Publish
build-test-deployin a platform repo. Blockkubectlfrom CI runners. - GitOps. Stand up ArgoCD, create
env/repo withapp-of-appsor per-service apps. - Policy-as-code. Add Conftest checks to the reusable workflow:
conftest test infra/ --policy policies/ && conftest test helm/ --policy policies/- CLI wrapper. Even a Bash MVP is fine if it encodes conventions:
#!/usr/bin/env bash
set -euo pipefail
cmd=$1; shift
case "$cmd" in
create-service) npx @backstage/create-app "$@" ;;
tf) (cd infra && tofu "$@") ;;
deploy) gh workflow run reuse-build-deploy.yaml -f service=$(basename "$PWD") -f env=${1:-dev} ;;
*) echo "unknown command"; exit 1;;
esac- Docs and office hours. 2-page quickstart, weekly “pave requests” triage, and a public backlog.
What to measure and when to iterate
Track platform like a product. Dashboards we install day one:
- DORA metrics: lead time for changes, deployment frequency, change failure rate, MTTR.
- Platform SLOs: template success rate, pipeline success rate, Argo sync lag.
- Support load: infra tickets per engineer per month, median time-to-first-response.
- Adoption: % services on the golden path, % deploys via reusable workflow, % drift-free apps in Argo.
If adoption stalls, the usual culprits:
- The template doesn’t match reality (missing
grpc, no cronjob support). - The CLI makes off-road too hard or too easy.
- Hidden toil in the pipeline (Docker layer cache not working; 12-min builds). Fix it or people will bypass you.
Platforms win when they are boring, fast, and safe. Shiny is optional; paved is not.
Key takeaways
- Abstract infrastructure behind a small, boring interface that defaults 80% of decisions.
- Standardize on one CLI, one template, and one delivery pipeline; make escape hatches explicit.
- Move infra support from bespoke tickets to PRs by codifying paved-road modules and reusable workflows.
- Measure success with MTTR, lead time for changes, and infra ticket volume per engineer.
- Adopt GitOps with ArgoCD to reduce drift and make rollbacks predictable.
Implementation checklist
- Pick one IaC tool and version it (Terraform or OpenTofu).
- Create a single golden-path template (Backstage or cookiecutter) per runtime.
- Ship a thin internal CLI wrapper for paved-road operations.
- Adopt GitOps (ArgoCD/Flux) and one reusable CI/CD workflow.
- Codify policy-as-code with OPA/Conftest or Sentinel.
- Document escape hatches and an RFC process for going off-road.
- Instrument the platform: deploy dashboards for DORA + SLOs.
Questions we hear from teams
- How do we avoid building a bespoke platform that becomes legacy?
- Keep your layer thin and boring. Reuse open tools (Backstage, ArgoCD, Terraform/OpenTofu, GitHub Actions) and avoid custom controllers until the use case hurts repeatedly. Version everything, publish deprecation notices, and keep escape hatches documented. If your CLI is 10,000 lines, you’re re-implementing the cloud—stop.
- What if a team truly needs to go off the paved road?
- Make it a formal RFC with success criteria and SLO/ownership. Provide extension points: custom Helm chart under a `chart/experimental/` path, terraform `extra_*` variables, and a `policy-exception.yaml` reviewed by platform + security. Time-box exceptions and revisit quarterly.
- Can we do this without Kubernetes?
- Yes. The same pattern works with ECS, Nomad, or even serverless. Swap ArgoCD with CodeDeploy or Spinnaker; use Terraform modules and a reusable CI workflow. The key is defaults + GitOps-style desired state.
- How do we roll out without blocking current delivery?
- Start with one team and one runtime. Migrate net-new services first. For existing services, offer a migration path with tooling (`gp migrate service`) and support office hours. Make the paved road faster than the old way and adoption will follow.
- What should the platform team own vs. product teams?
- Platform owns the golden paths, shared modules, reusable workflows, cluster/runtime health, and policy. Product teams own their service code, SLOs, on-call, and any off-road choices they opt into via RFC.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
