The Internal Platform That Stopped Our Infra Death Spiral

Stop asking every squad to be SREs. Give them a paved road that hides the messy bits and ships fast by default.

“Standardize the contracts and pave the road. Everything else is just YAML cosplay.”
Back to all posts

You don’t need 40 SREs. You need a paved road

I’ve watched too many orgs confuse freedom with chaos. One recent Series D fintech had 120 engineers, 80 services, and 30 different ways to ship to Kubernetes. Every team ran its own Terraform, hand-rolled Helm charts, and copy-pasted kubectl incantations from a 2019 wiki. Incidents looked like archaeology. Onboarding took two weeks. MTTR hovered near four hours because nobody knew which pipeline mattered.

What saved them wasn’t another shiny platform slide deck. It was a brutally simple internal toolchain: a small CLI, a single Helm chart, versioned Terraform modules, a reusable CI workflow, and GitOps with Argo CD. Opinions over options. Defaults over bespoke. And measurable outcomes in weeks, not quarters.

Why this matters: infra complexity is a tax

When every squad runs their own infra flavor, you pay for it in:

  • Cognitive load: engineers learn Terraform, Helm, Ingress, and IAM just to ship a Flask app.
  • Inconsistent SLOs: prod behaviour depends on who last edited values.yaml.
  • Slow delivery: lead time stretches from hours to days as teams reinvent pipelines.
  • Cloud spend drift: five different autoscaling approaches, none monitored.

I’ve seen this fail at startups and at FAANG-scale alike. The fix isn’t to lock everything down; it’s to standardize the contracts and let the platform handle translation. Give teams an obvious, boring path that works 95% of the time, with reviewable escape hatches for the other 5%.

Before vs. after: what actually changes in a week

Here’s the before. A typical service deploy looked like this:

# team-a/service/Makefile
apply:
	terraform -chdir=infra init
	terraform -chdir=infra apply -var env=$(ENV)
	helm upgrade --install svc charts/service -f env/$(ENV).yaml \
	  --set image.tag=$(GIT_SHA)
	kubectl apply -f k8s/ingress-$(ENV).yaml

Each repo had its own snowflake. No two env/*.yaml files matched. Rollbacks required Slack archaeology.

And here’s the after. Same outcome, fewer footguns:

# developer machine (or CI)
# scaffold a service with paved-road defaults
$ gp app create --template=python-api --name=payments

# deploy through a single reusable workflow
$ gp deploy --env=staging

Under the hood:

  • gp app create writes a service.yaml, standard Dockerfile, and the only Helm values we let vary.
  • CI calls a reusable workflow that renders one chart, updates a GitOps repo, and lets Argo CD sync.
  • Terraform comes via a versioned module that encodes our VPC, IAM, KMS, and database opinions.

Result after four weeks at that fintech:

  • Lead time: 3 days → 45 minutes median.
  • Onboarding: 10 days → 1 day to first prod deploy.
  • MTTR: 4h → 70m, thanks to consistent logs/metrics.
  • Cloud bill: 18% savings from sane defaults and autoscaling that wasn’t copy/pasted from 2019.

The contracts: standardize what devs declare, not how it’s provisioned

Start with a minimal but strict service descriptor. This is the API between app teams and your platform.

# service.yaml
name: payments
owner: team-payments
runtime: k8s
language: python
ports:
  - 8080
ingress:
  host: payments.example.com
  path: /
slo:
  availability: 99.9
  latency_p95_ms: 300
database:
  type: postgres
  storage_gb: 20
resources:
  requests:
    cpu: "250m"
    memory: "256Mi"
  limits:
    cpu: "1"
    memory: "512Mi"
tier: 1

One chart, many services. Developers can override only what you expose:

# helm/values.paved-road.yaml
replicaCount: 2
autoscaling:
  enabled: true
  minReplicas: 2
  maxReplicas: 6
  targetCPUUtilizationPercentage: 70
logging: { level: INFO }
tracing: { enabled: true }
resources:
  requests: { cpu: 250m, memory: 256Mi }
  limits:   { cpu: 1,    memory: 512Mi }

Infra is encoded in versioned Terraform modules. Teams don’t author VPCs or KMS keys—they declare intent.

# infra/main.tf
module "service" {
  source       = "git::ssh://git.example.com/platform/terraform-modules//gp-service?ref=v1.4.0"
  name         = var.name
  env          = var.env
  runtime      = "k8s"
  database     = var.database
  observability = {
    logs_bucket = data.terraform_remote_state.obs.outputs.logs_bucket
  }
}

And a single reusable CI job deploys everything consistently:

# .github/workflows/deploy.yaml
name: Deploy
on:
  workflow_dispatch:
  push:
    branches: [ main ]

jobs:
  deploy:
    uses: org-platform/.github/workflows/deploy-paved-road.yaml@v3
    with:
      service_descriptor: service.yaml
      environment: staging
    secrets: inherit

All deploys now look the same in logs, metrics, and Argo CD. That’s how you cut MTTR and review time.

The internal CLI: one verb away from done

Give devs a small binary that wraps the platform decisions. I like Go + Cobra for zero-dep installs.

// cmd/gp/main.go
package main

import (
  "log"
  "os"

  "github.com/spf13/cobra"
)

func main() {
  root := &cobra.Command{Use: "gp"}

  deploy := &cobra.Command{
    Use:   "deploy",
    Short: "Deploy current service via GitOps",
    RunE: func(cmd *cobra.Command, args []string) error {
      env, _ := cmd.Flags().GetString("env")
      svc, err := LoadServiceDescriptor("service.yaml")
      if err != nil { return err }
      // 1) Render values from service.yaml
      // 2) Commit to GitOps repo (branch + PR)
      // 3) Wait for Argo CD sync
      return DeployViaGitOps(svc, env)
    },
  }
  deploy.Flags().String("env", "staging", "target environment")

  root.AddCommand(deploy)
  if err := root.Execute(); err != nil { log.Fatal(err); os.Exit(1) }
}

Three to five verbs is enough:

  • gp app create – scaffold repo with service.yaml, Dockerfile, CI.
  • gp deploy – update GitOps repo and wait for health.
  • gp db create – provision managed Postgres via Terraform module, emit credentials into Secret Manager.
  • gp logs – stream app logs with sane filters.
  • gp status – show SLOs and Argo health.

Keep the CLI thin; push logic into APIs and reusable workflows so you can upgrade behavior without shipping binaries daily.

Guardrails: escape hatches without chaos

You will have edge cases. Don’t make them the default. Use policy-as-code with documented exceptions.

  • OPA/Rego or Kyverno to enforce minimum resources, required labels, and network policies.
  • Exceptions via a YAML file in the repo with an expiration date and a ticket link.
# policy/resources.rego
package policies

default allow = false

allow {
  input.resources.limits.cpu != null
  input.resources.limits.memory != null
  input.resources.requests.cpu != null
  input.resources.requests.memory != null
}
# .platform/exception.yaml
id: allow-high-memory
reason: "Black Friday load testing"
expires: "2025-12-01"
approved_by: "sre-oncall"
overrides:
  resources:
    limits:
      memory: "4Gi"

CI checks these in a single place and flags expired exceptions. Ops keeps control; teams move fast.

Rollout plan: two sprints to impact

You don’t need a platform team of 20. You need focus.

  1. Sprint 1 – Define contracts and defaults

    • Write service.yaml schema and a minimal Helm chart with paved-road values.
    • Wrap your infra in a gp-service Terraform module (v0.1.0). No features, just the basics.
    • Build a reusable CI workflow (v1) that takes service.yaml and drives GitOps.
  2. Sprint 2 – Ship the CLI and one golden path

    • Publish gp binary with app create and deploy.
    • Onboard 2-3 volunteer teams. Migrate a service end-to-end.
    • Add guardrails (OPA) for resources and labels; wire logs/metrics dashboards.
  3. Phase 2 – Hardening and escape hatches

    • db create, logs, and status commands.
    • Cost controls: default requests/limits and HPA tuned for your workloads.
    • Backstage integration so new services show up automatically in the catalog.

Backstage template example:

# templates/service-template.yaml
apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
  name: paved-road-service
spec:
  owner: platform
  steps:
    - id: fetch
      action: fetch:template
      input:
        url: ./skeleton
    - id: gp-init
      action: run:script
      input:
        script: gp app create --template ${{ parameters.language }} --name ${{ parameters.name }}

Pick one runtime (Kubernetes, ECS, or Nomad) for the paved road. Add others later if you must. The goal is fewer ways to be wrong.

What to measure (and what we’ve actually seen)

Track business outcomes, not lines of YAML deleted:

  • Lead time for changes – commit to prod. Target: same day for non-data changes.
  • Deployment frequency – per service per day/week.
  • MTTR – measure by Argo health + SLO burn rate. Target: <90 minutes.
  • Change failure rate – rollbacks/patches per deploy.
  • Onboarding time – new hire to first prod deploy.
  • Infra cost per service – requests vs. usage, idle pods, EBS volumes.

From three recent GitPlumbers engagements:

  • Lead time: 70–90% reduction once teams use the single workflow.
  • Onboarding: 5–10x faster to first prod change.
  • CFR: down 30–50% through consistent rollouts and canaries.
  • Cost: 10–25% savings via right-sized defaults and shared modules.

If these don’t move within six weeks, your paved road is either not actually paved (too many options) or it’s not the default path (no enforcement or incentives).

What I’d do differently next time

  • Start with the CLI and the CI contract, not a platform monolith. You can fake the infra modules and harden later.
  • Don’t over-index on Backstage first. Make the road work from a terminal; catalog comes after.
  • Version everything: gp@v0.3, terraform-modules@v1.5, deploy-workflow@v3.
  • Publish a short “When to leave the paved road” doc. Require a design review for exceptions over 30 days.
  • Keep the platform backlog boring: reliability, speed, and cost. Say no to bespoke features unless 3+ teams need them.

If you’re in the “every repo is special” swamp, the quickest win is shipping the reusable CI workflow and a thin gp deploy. Do that, and a lot of dragons disappear without a reorg. And if you want a crew that’s done this in messy real-world stacks, GitPlumbers lives for this work.

Related Resources

Key takeaways

  • Favor a paved-road platform with opinionated defaults over bespoke tools per team.
  • Standardize the contracts (service descriptor, pipelines, runtime) and let the platform translate to infra.
  • Ship an internal CLI with 3-5 verbs that developers actually use; everything else is an escape hatch.
  • Measure impact with DORA metrics and infra costs, not vanity dashboards.
  • Use policy-as-code to keep guardrails while allowing overrides via reviewable exceptions.

Implementation checklist

  • Define a minimal service descriptor (name, owner, runtime, ports, SLO, data needs).
  • Codify a single deploy workflow as a reusable CI job (GitHub Actions, GitLab CI).
  • Wrap infra in versioned Terraform modules; avoid snowflake plans in each repo.
  • Publish an internal CLI (Go/Cobra or TS/oclif) with create, deploy, db, logs, status.
  • Adopt GitOps (Argo CD/Flux) and a single Helm chart with sane defaults.
  • Enforce guardrails with OPA/Kyverno; allow documented exemptions with TTLs.
  • Track DORA metrics, onboarding time, and cloud spend per service.

Questions we hear from teams

How do we handle teams that need non-Kubernetes runtimes?
Make the paved road one runtime first. Provide a documented escape hatch via a separate workflow (e.g., ECS) and require a short design review. If 3+ teams choose it, consider making it a first-class paved road later.
What if our Terraform is already a mess across repos?
Create a wrapper module that encodes sane defaults and migrate incrementally. Start by moving just one resource (e.g., databases) under the module. Version it and cut a PR template that blocks direct changes outside the module path.
Do we need Backstage to start?
No. Start with a CLI and a reusable CI job. Backstage helps with discoverability and templates, but it’s optional. If you add it, wire it to your service descriptor so metadata stays in one place.
How do we avoid blocking engineers with guardrails?
Use policy-as-code with time-bound exceptions. Make exceptions easy to request, visible in code reviews, and automatically expiring. Alert on exceptions that pass their TTL.
How do we prove ROI to finance?
Track DORA metrics and infra cost by service. When lead time drops and CFR improves, pair that with cost deltas from right-sized resources. We typically see 10–25% savings within a quarter without feature freezes.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to a GitPlumbers principal See how we build paved roads

Related resources