Your Engineers Didn’t Join to Debug Terraform: Build the Paved Road, Not Another Snowflake Tool
Internal tooling should hide infrastructure complexity, not repackage it. Here’s the playbook I’ve seen actually work: opinionated defaults, thin abstractions, and ruthless avoidance of bespoke platform products.
Build the paved road with thin, versioned defaults. Don’t build a bespoke platform product you’ll be on-call for forever.Back to all posts
The 2 a.m. Slack thread where “self-serve infra” goes to die
I’ve watched this movie at least a dozen times. A team is “moving fast,” someone copies a Terraform snippet from Confluence, tweaks it until terraform apply stops screaming, and ships. Two months later you’re in a 2 a.m. Slack thread because:
- the
EKSnode group autoscaling settings are different in each environment - half the services log JSON, half log unstructured text
- nobody remembers why one team pinned
provider "aws"to an ancient version - security asks why there are 47 S3 buckets with public ACLs “temporarily” enabled
The root problem isn’t that engineers can’t learn infra. It’s that you’re making every team pay the full cognitive cost of infrastructure—over and over—then acting surprised when reliability and cost go sideways.
The fix is not “build more internal tooling.” The fix is build a paved road: a default path that’s so frictionless that DIY feels like doing your own dental work.
When internal tooling becomes your most expensive product
Here’s what I’ve seen fail: platform teams building a bespoke portal/CLI that tries to be everything. It starts as “a thin wrapper over Terraform,” and ends up as:
- a custom state machine
- a plugin ecosystem
- an RBAC model no one understands
- a backlog bigger than the app teams’ backlogs
If your internal tooling requires a dedicated on-call rotation, congratulations—you built a product. And unlike real products, it has the world’s worst customer base: your own engineers, who will route around it the second it slows them down.
What actually works is boring:
- Standardize the 80% with opinionated defaults
- Compose existing tools (Terraform, ArgoCD, GitHub Actions, Backstage) instead of replacing them
- Keep abstractions thin: expose the outcome, not every knob
A good rule of thumb: if your “abstraction” needs its own DSL, you’re probably building a trap.
Paved-road defaults: what to abstract (and what not to)
The goal is to hide infrastructure complexity without hiding reality. Abstract the parts that are repetitive, risky, and easy to mess up.
Prioritize abstractions that:
- reduce time-to-first-deploy for a new service
- reduce change failure rate and MTTR by standardizing deploy/rollback paths
- reduce cloud cost variance by defaulting sane sizing/limits
- enforce baseline security controls (IAM boundaries, network policy, secrets handling)
Avoid abstractions that:
- require you to model every possible edge case
- turn platform work into “ticket-driven infrastructure”
- prevent teams from debugging (no visibility, no escape hatch)
The paved road isn’t about control. It’s about making the safe path the easiest path.
Before/after example: from Terraform copy/paste to a golden-path service scaffold
Before (what I still see in 2026):
- each team has its own repo template
- CI pipelines are forked YAML snowflakes
- Terraform is sprinkled everywhere (with different module versions)
- “creating a service” takes 2–5 days and usually involves asking in Slack
After: one scaffold that creates:
- a service repo
- a standard CI workflow (build/test/scan)
- a GitOps deploy entry (ArgoCD)
- baseline observability (logs/metrics/traces hooks)
You can do this with Backstage scaffolder (or even a simple cookiecutter + repo template), but keep it configuration-first.
Here’s a minimal Backstage template that creates a repo and wires CI via a reusable workflow:
# catalog/templates/node-service/template.yaml
apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
name: node-service
title: Node.js Service (Paved Road)
spec:
owner: platform
type: service
parameters:
- title: Service Info
required: [name]
properties:
name:
type: string
description: Unique service name
steps:
- id: fetch
name: Fetch skeleton
action: fetch:template
input:
url: ./skeleton
values:
name: ${{ parameters.name }}
- id: publish
name: Publish to GitHub
action: publish:github
input:
repoUrl: github.com?owner=acme&repo=${{ parameters.name }}
- id: register
name: Register in catalog
action: catalog:register
input:
repoContentsUrl: ${{ steps.publish.output.repoContentsUrl }}
catalogInfoPath: '/catalog-info.yaml'And the reusable GitHub Actions workflow that every service consumes:
# .github/workflows/ci.yml in each service
name: ci
on:
pull_request:
push:
branches: [main]
jobs:
build:
uses: acme/platform-workflows/.github/workflows/node-ci.yml@v3
with:
node_version: '20'# acme/platform-workflows/.github/workflows/node-ci.yml
name: node-ci
on:
workflow_call:
inputs:
node_version:
required: true
type: string
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: ${{ inputs.node_version }}
cache: npm
- run: npm ci
- run: npm test
- run: npm run lint
- name: SAST
run: npx semgrep --config=autoWhy this is a good abstraction:
- You didn’t build a CI system. You used GitHub Actions the way it was meant to be used.
- You can version it (
@v3) and roll out changes without repo-by-repo archaeology. - Teams can still override when needed, but the default is fast.
Before/after example: “deploy to Kubernetes” without making everyone a Kubernetes expert
Kubernetes isn’t hard in the abstract. It’s hard at 3 a.m. when you’re paging through kubectl describe output and realizing half your services have no resource limits.
Before: each team owns:
- Helm chart structure
- ingress config
- HPA settings
- secrets wiring
- rollout strategy
And every team makes different choices, so the platform team ends up debugging all of them anyway.
After: standardize the deployment contract:
- one base Helm chart (or a small set:
web,worker) - ArgoCD for GitOps deployments
- sane defaults for requests/limits, probes, and PDBs
Example: an ArgoCD “app of apps” where each service is just values:
# environments/prod/services/payments.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: payments
spec:
project: prod
source:
repoURL: https://github.com/acme/platform-gitops
targetRevision: main
path: charts/web-service
helm:
values: |
image:
repository: ghcr.io/acme/payments
tag: 1.42.0
ingress:
host: payments.acme.com
resources:
requests:
cpu: 250m
memory: 512Mi
limits:
cpu: 1
memory: 1Gi
rollout:
strategy: canary
steps: [10, 50, 100]
destination:
server: https://kubernetes.default.svc
namespace: payments
syncPolicy:
automated:
prune: true
selfHeal: trueWhat teams control: image, host, a few knobs.
What the platform controls: probes, service account patterns, network policy defaults, logging format, baseline alerts.
This is the paved road. Not because it’s restrictive, but because it eliminates 90% of the yak shave.
The cost/benefit trade-off (the honest spreadsheet version)
Abstraction is not free. You’re buying speed and consistency with platform maintenance and constraints.
Here’s the trade-off I’ve seen pencil out in real orgs (call it 30–150 engineers):
Costs you will pay:
- 1–3 engineers maintaining golden paths (templates, workflows, base charts)
- versioning and deprecation work (yes, it’s work)
- occasional frustration from teams hitting the edges
Benefits you can actually measure:
- Time-to-first-deploy drops from days to hours (common: 2–5 days → 30–90 minutes)
- Change failure rate drops when rollout/rollback is standardized (I’ve seen 20–30% → single digits)
- MTTR improves because debugging is consistent (same dashboards, same logs, same deploy mechanics)
- Cloud cost variance shrinks when sizing defaults exist (fewer “why is this pod requesting 8 CPUs?” moments)
Concrete “before/after” I’ve personally watched:
- Before: 14 services, 11 different ingress patterns, 6 logging formats, 3 different secret managers “in progress.”
- After: 14 services on one chart + ArgoCD; two approved exceptions; platform team stopped being the human grep for YAML.
The key is keeping the platform thin. If you find yourself implementing a custom orchestrator, you’ve left the paved road and started building a theme park.
What makes paved roads stick: versioning, escape hatches, and brutal clarity
The adoption killer is when the paved road becomes a trap: no way out, unclear ownership, surprise breaking changes.
What’s worked for us (and for clients GitPlumbers has rescued):
Version everything
- Reusable workflows:
@v3,@v4tags - Helm charts: semver, release notes
- Templates: stamped with versions and migration guides
- Reusable workflows:
Publish an exception process
- “Escape hatch” is allowed, but it’s explicit
- Treat exceptions like architecture decisions: time-boxed, reviewed, revisited
Define your contract (and stick to it)
- Platform guarantees: deploy mechanics, baseline security, observability hooks
- Team guarantees: service SLOs, runbooks, on-call ownership
Instrument the platform itself
- adoption rate (how many services use the golden path)
- DORA metrics by cohort (golden path vs bespoke)
- platform MTTR (yes, your platform has reliability too)
If you can’t explain the paved road in five minutes to a new senior hire, it’s too complicated.
Where GitPlumbers fits (when you’re already in the ditch)
A lot of teams call GitPlumbers after they’ve tried to “platform their way out” and ended up with:
- half-migrated GitOps
- Terraform module sprawl
- a Backstage instance nobody trusts
- Kubernetes clusters that are running a different reality per namespace
We’re good at the unglamorous part: untangling the current mess, choosing the smallest set of paved roads that will move the metrics, and refactoring your internal tooling so it’s composable instead of bespoke.
If you want a sanity check, start by mapping:
- the top 10 recurring infrastructure tasks engineers do
- which of those tasks cause incidents or security findings
- how many distinct “ways” you currently do each task
That’s your paved-road backlog. Not “build a portal.”
Build the road. Put up guardrails. Let teams drive.
Key takeaways
- A paved road is an opinionated default path that’s faster than DIY—don’t build a platform “menu.”
- Prefer configuration and templates over bespoke code: reuse GitHub Actions workflows, Terraform modules, ArgoCD patterns, and Backstage scaffolding.
- Abstract outcomes ("I need a service with a DB") not implementation details ("here’s 800 lines of Terraform").
- Measure success with lead time, change failure rate, MTTR, and platform adoption—not number of platform features.
- Version your golden paths, publish deprecation windows, and treat the platform like a product (but keep it thin).
Implementation checklist
- Define 2–3 paved roads (web service, async worker, batch job) before you build a portal.
- Create a single service scaffold that includes CI/CD, observability, and baseline security defaults.
- Standardize deployments with GitOps (e.g., ArgoCD) and eliminate per-team snowflake pipelines.
- Ship reusable CI building blocks (GitHub Actions `workflow_call`) instead of copy/paste YAML.
- Make “escape hatches” explicit and reviewed (exceptions are a process, not a fork).
- Track adoption and outcomes: time-to-first-deploy, PR cycle time, MTTR, and cost per environment.
- Assign an owner for each golden path and publish a compatibility/deprecation policy.
Questions we hear from teams
- Should we build a developer portal like Backstage?
- Only if you already have real golden paths to expose. A portal without paved roads is just a prettier Confluence. Start with templates, reusable CI workflows, and a standardized deploy contract; then add Backstage to make them discoverable and self-serve.
- Isn’t this just platform engineering lock-in?
- It’s lock-in to your own standards, not to a vendor. If you implement golden paths using commodity tools (Terraform, ArgoCD, GitHub Actions) and keep abstractions thin, you can evolve without rewriting everything.
- How do we avoid becoming a bottleneck?
- Make the default path self-serve and fast, publish explicit escape hatches, and version your building blocks. The platform team should ship templates and contracts, not manually provision environments for every request.
- What metrics prove this is working?
- Track time-to-first-deploy for new services, DORA metrics (lead time, deployment frequency, change failure rate, MTTR), adoption of golden paths, and cloud cost variance (especially around CPU/memory requests/limits and idle environments).
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
