Should we build a developer portal like Backstage?

Only if you already have real golden paths to expose. A portal without paved roads is just a prettier Confluence. Start with templates, reusable CI workflows, and a standardized deploy contract; then add Backstage to make them discoverable and self-serve.

Isn’t this just platform engineering lock-in?

It’s lock-in to your own standards, not to a vendor. If you implement golden paths using commodity tools (Terraform, ArgoCD, GitHub Actions) and keep abstractions thin, you can evolve without rewriting everything.

How do we avoid becoming a bottleneck?

Make the default path self-serve and fast, publish explicit escape hatches, and version your building blocks. The platform team should ship templates and contracts, not manually provision environments for every request.

What metrics prove this is working?

Track time-to-first-deploy for new services, DORA metrics (lead time, deployment frequency, change failure rate, MTTR), adoption of golden paths, and cloud cost variance (especially around CPU/memory requests/limits and idle environments).

Platform-productivity · Jan 8, 2026 · 8 minute read

Your Engineers Didn’t Join to Debug Terraform: Build the Paved Road, Not Another Snowflake Tool

Internal tooling should hide infrastructure complexity, not repackage it. Here’s the playbook I’ve seen actually work: opinionated defaults, thin abstractions, and ruthless avoidance of bespoke platform products.

GitPlumbers Editorial Team

Platform Engineering & Code Rescue (20-year veterans)

We’ve rebuilt platforms in the real world: post-merger Terraform sprawl, Kubernetes-by-cargo-cult, half-finished GitOps, and AI-generated codebases that shipped straight to prod. GitPlumbers helps teams simplify, standardize, and ship safely—without silver bullets.

Build the paved road with thin, versioned defaults. Don’t build a bespoke platform product you’ll be on-call for forever.

Back to all posts

The 2 a.m. Slack thread where “self-serve infra” goes to die

I’ve watched this movie at least a dozen times. A team is “moving fast,” someone copies a Terraform snippet from Confluence, tweaks it until terraform apply stops screaming, and ships. Two months later you’re in a 2 a.m. Slack thread because:

the EKS node group autoscaling settings are different in each environment
half the services log JSON, half log unstructured text
nobody remembers why one team pinned provider "aws" to an ancient version
security asks why there are 47 S3 buckets with public ACLs “temporarily” enabled

The root problem isn’t that engineers can’t learn infra. It’s that you’re making every team pay the full cognitive cost of infrastructure—over and over—then acting surprised when reliability and cost go sideways.

The fix is not “build more internal tooling.” The fix is build a paved road: a default path that’s so frictionless that DIY feels like doing your own dental work.

When internal tooling becomes your most expensive product

Here’s what I’ve seen fail: platform teams building a bespoke portal/CLI that tries to be everything. It starts as “a thin wrapper over Terraform,” and ends up as:

a custom state machine
a plugin ecosystem
an RBAC model no one understands
a backlog bigger than the app teams’ backlogs

If your internal tooling requires a dedicated on-call rotation, congratulations—you built a product. And unlike real products, it has the world’s worst customer base: your own engineers, who will route around it the second it slows them down.

What actually works is boring:

Standardize the 80% with opinionated defaults
Compose existing tools (Terraform, ArgoCD, GitHub Actions, Backstage) instead of replacing them
Keep abstractions thin: expose the outcome, not every knob

A good rule of thumb: if your “abstraction” needs its own DSL, you’re probably building a trap.

Paved-road defaults: what to abstract (and what not to)

The goal is to hide infrastructure complexity without hiding reality. Abstract the parts that are repetitive, risky, and easy to mess up.

Prioritize abstractions that:

reduce time-to-first-deploy for a new service
reduce change failure rate and MTTR by standardizing deploy/rollback paths
reduce cloud cost variance by defaulting sane sizing/limits
enforce baseline security controls (IAM boundaries, network policy, secrets handling)

Avoid abstractions that:

require you to model every possible edge case
turn platform work into “ticket-driven infrastructure”
prevent teams from debugging (no visibility, no escape hatch)

The paved road isn’t about control. It’s about making the safe path the easiest path.

Before/after example: from Terraform copy/paste to a golden-path service scaffold

Before (what I still see in 2026):

each team has its own repo template
CI pipelines are forked YAML snowflakes
Terraform is sprinkled everywhere (with different module versions)
“creating a service” takes 2–5 days and usually involves asking in Slack

After: one scaffold that creates:

a service repo
a standard CI workflow (build/test/scan)
a GitOps deploy entry (ArgoCD)
baseline observability (logs/metrics/traces hooks)

You can do this with Backstage scaffolder (or even a simple cookiecutter + repo template), but keep it configuration-first.

Here’s a minimal Backstage template that creates a repo and wires CI via a reusable workflow:

# catalog/templates/node-service/template.yaml
apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
  name: node-service
  title: Node.js Service (Paved Road)
spec:
  owner: platform
  type: service
  parameters:
    - title: Service Info
      required: [name]
      properties:
        name:
          type: string
          description: Unique service name
  steps:
    - id: fetch
      name: Fetch skeleton
      action: fetch:template
      input:
        url: ./skeleton
        values:
          name: ${{ parameters.name }}
    - id: publish
      name: Publish to GitHub
      action: publish:github
      input:
        repoUrl: github.com?owner=acme&repo=${{ parameters.name }}
    - id: register
      name: Register in catalog
      action: catalog:register
      input:
        repoContentsUrl: ${{ steps.publish.output.repoContentsUrl }}
        catalogInfoPath: '/catalog-info.yaml'

And the reusable GitHub Actions workflow that every service consumes:

# .github/workflows/ci.yml in each service
name: ci
on:
  pull_request:
  push:
    branches: [main]

jobs:
  build:
    uses: acme/platform-workflows/.github/workflows/node-ci.yml@v3
    with:
      node_version: '20'

# acme/platform-workflows/.github/workflows/node-ci.yml
name: node-ci
on:
  workflow_call:
    inputs:
      node_version:
        required: true
        type: string

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: ${{ inputs.node_version }}
          cache: npm
      - run: npm ci
      - run: npm test
      - run: npm run lint
      - name: SAST
        run: npx semgrep --config=auto

Why this is a good abstraction:

You didn’t build a CI system. You used GitHub Actions the way it was meant to be used.
You can version it (@v3) and roll out changes without repo-by-repo archaeology.
Teams can still override when needed, but the default is fast.

Before/after example: “deploy to Kubernetes” without making everyone a Kubernetes expert

Kubernetes isn’t hard in the abstract. It’s hard at 3 a.m. when you’re paging through kubectl describe output and realizing half your services have no resource limits.

Before: each team owns:

Helm chart structure
ingress config
HPA settings
secrets wiring
rollout strategy

And every team makes different choices, so the platform team ends up debugging all of them anyway.

After: standardize the deployment contract:

one base Helm chart (or a small set: web, worker)
ArgoCD for GitOps deployments
sane defaults for requests/limits, probes, and PDBs

Example: an ArgoCD “app of apps” where each service is just values:

# environments/prod/services/payments.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: payments
spec:
  project: prod
  source:
    repoURL: https://github.com/acme/platform-gitops
    targetRevision: main
    path: charts/web-service
    helm:
      values: |
        image:
          repository: ghcr.io/acme/payments
          tag: 1.42.0
        ingress:
          host: payments.acme.com
        resources:
          requests:
            cpu: 250m
            memory: 512Mi
          limits:
            cpu: 1
            memory: 1Gi
        rollout:
          strategy: canary
          steps: [10, 50, 100]
  destination:
    server: https://kubernetes.default.svc
    namespace: payments
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

What teams control: image, host, a few knobs.

What the platform controls: probes, service account patterns, network policy defaults, logging format, baseline alerts.

This is the paved road. Not because it’s restrictive, but because it eliminates 90% of the yak shave.

The cost/benefit trade-off (the honest spreadsheet version)

Abstraction is not free. You’re buying speed and consistency with platform maintenance and constraints.

Here’s the trade-off I’ve seen pencil out in real orgs (call it 30–150 engineers):

Costs you will pay:

1–3 engineers maintaining golden paths (templates, workflows, base charts)
versioning and deprecation work (yes, it’s work)
occasional frustration from teams hitting the edges

Benefits you can actually measure:

Time-to-first-deploy drops from days to hours (common: 2–5 days → 30–90 minutes)
Change failure rate drops when rollout/rollback is standardized (I’ve seen 20–30% → single digits)
MTTR improves because debugging is consistent (same dashboards, same logs, same deploy mechanics)
Cloud cost variance shrinks when sizing defaults exist (fewer “why is this pod requesting 8 CPUs?” moments)

Concrete “before/after” I’ve personally watched:

Before: 14 services, 11 different ingress patterns, 6 logging formats, 3 different secret managers “in progress.”
After: 14 services on one chart + ArgoCD; two approved exceptions; platform team stopped being the human grep for YAML.

The key is keeping the platform thin. If you find yourself implementing a custom orchestrator, you’ve left the paved road and started building a theme park.

What makes paved roads stick: versioning, escape hatches, and brutal clarity

The adoption killer is when the paved road becomes a trap: no way out, unclear ownership, surprise breaking changes.

What’s worked for us (and for clients GitPlumbers has rescued):

Version everything
- Reusable workflows: @v3, @v4 tags
- Helm charts: semver, release notes
- Templates: stamped with versions and migration guides
Publish an exception process
- “Escape hatch” is allowed, but it’s explicit
- Treat exceptions like architecture decisions: time-boxed, reviewed, revisited
Define your contract (and stick to it)
- Platform guarantees: deploy mechanics, baseline security, observability hooks
- Team guarantees: service SLOs, runbooks, on-call ownership
Instrument the platform itself
- adoption rate (how many services use the golden path)
- DORA metrics by cohort (golden path vs bespoke)
- platform MTTR (yes, your platform has reliability too)

If you can’t explain the paved road in five minutes to a new senior hire, it’s too complicated.

Where GitPlumbers fits (when you’re already in the ditch)

A lot of teams call GitPlumbers after they’ve tried to “platform their way out” and ended up with:

half-migrated GitOps
Terraform module sprawl
a Backstage instance nobody trusts
Kubernetes clusters that are running a different reality per namespace

We’re good at the unglamorous part: untangling the current mess, choosing the smallest set of paved roads that will move the metrics, and refactoring your internal tooling so it’s composable instead of bespoke.

If you want a sanity check, start by mapping:

the top 10 recurring infrastructure tasks engineers do
which of those tasks cause incidents or security findings
how many distinct “ways” you currently do each task

That’s your paved-road backlog. Not “build a portal.”

Build the road. Put up guardrails. Let teams drive.

Related Resources

Key takeaways

A paved road is an opinionated default path that’s faster than DIY—don’t build a platform “menu.”
Prefer configuration and templates over bespoke code: reuse GitHub Actions workflows, Terraform modules, ArgoCD patterns, and Backstage scaffolding.
Abstract outcomes ("I need a service with a DB") not implementation details ("here’s 800 lines of Terraform").
Measure success with lead time, change failure rate, MTTR, and platform adoption—not number of platform features.
Version your golden paths, publish deprecation windows, and treat the platform like a product (but keep it thin).

Implementation checklist

Define 2–3 paved roads (web service, async worker, batch job) before you build a portal.
Create a single service scaffold that includes CI/CD, observability, and baseline security defaults.
Standardize deployments with GitOps (e.g., ArgoCD) and eliminate per-team snowflake pipelines.
Ship reusable CI building blocks (GitHub Actions `workflow_call`) instead of copy/paste YAML.
Make “escape hatches” explicit and reviewed (exceptions are a process, not a fork).
Track adoption and outcomes: time-to-first-deploy, PR cycle time, MTTR, and cost per environment.
Assign an owner for each golden path and publish a compatibility/deprecation policy.

Questions we hear from teams

Should we build a developer portal like Backstage?: Only if you already have real golden paths to expose. A portal without paved roads is just a prettier Confluence. Start with templates, reusable CI workflows, and a standardized deploy contract; then add Backstage to make them discoverable and self-serve.
Isn’t this just platform engineering lock-in?: It’s lock-in to your own standards, not to a vendor. If you implement golden paths using commodity tools (Terraform, ArgoCD, GitHub Actions) and keep abstractions thin, you can evolve without rewriting everything.
How do we avoid becoming a bottleneck?: Make the default path self-serve and fast, publish explicit escape hatches, and version your building blocks. The platform team should ship templates and contracts, not manually provision environments for every request.
What metrics prove this is working?: Track time-to-first-deploy for new services, DORA metrics (lead time, deployment frequency, change failure rate, MTTR), adoption of golden paths, and cloud cost variance (especially around CPU/memory requests/limits and idle environments).

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to GitPlumbers about a paved-road platform audit See GitPlumbers case studies