The Internal Platform That Stopped Our Infra Death Spiral
Stop asking every squad to be SREs. Give them a paved road that hides the messy bits and ships fast by default.
“Standardize the contracts and pave the road. Everything else is just YAML cosplay.”Back to all posts
You don’t need 40 SREs. You need a paved road
I’ve watched too many orgs confuse freedom with chaos. One recent Series D fintech had 120 engineers, 80 services, and 30 different ways to ship to Kubernetes. Every team ran its own Terraform, hand-rolled Helm charts, and copy-pasted kubectl incantations from a 2019 wiki. Incidents looked like archaeology. Onboarding took two weeks. MTTR hovered near four hours because nobody knew which pipeline mattered.
What saved them wasn’t another shiny platform slide deck. It was a brutally simple internal toolchain: a small CLI, a single Helm chart, versioned Terraform modules, a reusable CI workflow, and GitOps with Argo CD. Opinions over options. Defaults over bespoke. And measurable outcomes in weeks, not quarters.
Why this matters: infra complexity is a tax
When every squad runs their own infra flavor, you pay for it in:
- Cognitive load: engineers learn Terraform, Helm, Ingress, and IAM just to ship a Flask app.
- Inconsistent SLOs: prod behaviour depends on who last edited
values.yaml. - Slow delivery: lead time stretches from hours to days as teams reinvent pipelines.
- Cloud spend drift: five different autoscaling approaches, none monitored.
I’ve seen this fail at startups and at FAANG-scale alike. The fix isn’t to lock everything down; it’s to standardize the contracts and let the platform handle translation. Give teams an obvious, boring path that works 95% of the time, with reviewable escape hatches for the other 5%.
Before vs. after: what actually changes in a week
Here’s the before. A typical service deploy looked like this:
# team-a/service/Makefile
apply:
terraform -chdir=infra init
terraform -chdir=infra apply -var env=$(ENV)
helm upgrade --install svc charts/service -f env/$(ENV).yaml \
--set image.tag=$(GIT_SHA)
kubectl apply -f k8s/ingress-$(ENV).yamlEach repo had its own snowflake. No two env/*.yaml files matched. Rollbacks required Slack archaeology.
And here’s the after. Same outcome, fewer footguns:
# developer machine (or CI)
# scaffold a service with paved-road defaults
$ gp app create --template=python-api --name=payments
# deploy through a single reusable workflow
$ gp deploy --env=stagingUnder the hood:
gp app createwrites aservice.yaml, standard Dockerfile, and the only Helm values we let vary.- CI calls a reusable workflow that renders one chart, updates a GitOps repo, and lets Argo CD sync.
- Terraform comes via a versioned module that encodes our VPC, IAM, KMS, and database opinions.
Result after four weeks at that fintech:
- Lead time: 3 days → 45 minutes median.
- Onboarding: 10 days → 1 day to first prod deploy.
- MTTR: 4h → 70m, thanks to consistent logs/metrics.
- Cloud bill: 18% savings from sane defaults and autoscaling that wasn’t copy/pasted from 2019.
The contracts: standardize what devs declare, not how it’s provisioned
Start with a minimal but strict service descriptor. This is the API between app teams and your platform.
# service.yaml
name: payments
owner: team-payments
runtime: k8s
language: python
ports:
- 8080
ingress:
host: payments.example.com
path: /
slo:
availability: 99.9
latency_p95_ms: 300
database:
type: postgres
storage_gb: 20
resources:
requests:
cpu: "250m"
memory: "256Mi"
limits:
cpu: "1"
memory: "512Mi"
tier: 1One chart, many services. Developers can override only what you expose:
# helm/values.paved-road.yaml
replicaCount: 2
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 6
targetCPUUtilizationPercentage: 70
logging: { level: INFO }
tracing: { enabled: true }
resources:
requests: { cpu: 250m, memory: 256Mi }
limits: { cpu: 1, memory: 512Mi }Infra is encoded in versioned Terraform modules. Teams don’t author VPCs or KMS keys—they declare intent.
# infra/main.tf
module "service" {
source = "git::ssh://git.example.com/platform/terraform-modules//gp-service?ref=v1.4.0"
name = var.name
env = var.env
runtime = "k8s"
database = var.database
observability = {
logs_bucket = data.terraform_remote_state.obs.outputs.logs_bucket
}
}And a single reusable CI job deploys everything consistently:
# .github/workflows/deploy.yaml
name: Deploy
on:
workflow_dispatch:
push:
branches: [ main ]
jobs:
deploy:
uses: org-platform/.github/workflows/deploy-paved-road.yaml@v3
with:
service_descriptor: service.yaml
environment: staging
secrets: inheritAll deploys now look the same in logs, metrics, and Argo CD. That’s how you cut MTTR and review time.
The internal CLI: one verb away from done
Give devs a small binary that wraps the platform decisions. I like Go + Cobra for zero-dep installs.
// cmd/gp/main.go
package main
import (
"log"
"os"
"github.com/spf13/cobra"
)
func main() {
root := &cobra.Command{Use: "gp"}
deploy := &cobra.Command{
Use: "deploy",
Short: "Deploy current service via GitOps",
RunE: func(cmd *cobra.Command, args []string) error {
env, _ := cmd.Flags().GetString("env")
svc, err := LoadServiceDescriptor("service.yaml")
if err != nil { return err }
// 1) Render values from service.yaml
// 2) Commit to GitOps repo (branch + PR)
// 3) Wait for Argo CD sync
return DeployViaGitOps(svc, env)
},
}
deploy.Flags().String("env", "staging", "target environment")
root.AddCommand(deploy)
if err := root.Execute(); err != nil { log.Fatal(err); os.Exit(1) }
}Three to five verbs is enough:
gp app create– scaffold repo withservice.yaml, Dockerfile, CI.gp deploy– update GitOps repo and wait for health.gp db create– provision managed Postgres via Terraform module, emit credentials into Secret Manager.gp logs– stream app logs with sane filters.gp status– show SLOs and Argo health.
Keep the CLI thin; push logic into APIs and reusable workflows so you can upgrade behavior without shipping binaries daily.
Guardrails: escape hatches without chaos
You will have edge cases. Don’t make them the default. Use policy-as-code with documented exceptions.
- OPA/Rego or Kyverno to enforce minimum resources, required labels, and network policies.
- Exceptions via a YAML file in the repo with an expiration date and a ticket link.
# policy/resources.rego
package policies
default allow = false
allow {
input.resources.limits.cpu != null
input.resources.limits.memory != null
input.resources.requests.cpu != null
input.resources.requests.memory != null
}# .platform/exception.yaml
id: allow-high-memory
reason: "Black Friday load testing"
expires: "2025-12-01"
approved_by: "sre-oncall"
overrides:
resources:
limits:
memory: "4Gi"CI checks these in a single place and flags expired exceptions. Ops keeps control; teams move fast.
Rollout plan: two sprints to impact
You don’t need a platform team of 20. You need focus.
Sprint 1 – Define contracts and defaults
- Write
service.yamlschema and a minimal Helm chart with paved-road values. - Wrap your infra in a
gp-serviceTerraform module (v0.1.0). No features, just the basics. - Build a reusable CI workflow (v1) that takes
service.yamland drives GitOps.
- Write
Sprint 2 – Ship the CLI and one golden path
- Publish
gpbinary withapp createanddeploy. - Onboard 2-3 volunteer teams. Migrate a service end-to-end.
- Add guardrails (OPA) for resources and labels; wire logs/metrics dashboards.
- Publish
Phase 2 – Hardening and escape hatches
db create,logs, andstatuscommands.- Cost controls: default requests/limits and HPA tuned for your workloads.
- Backstage integration so new services show up automatically in the catalog.
Backstage template example:
# templates/service-template.yaml
apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
name: paved-road-service
spec:
owner: platform
steps:
- id: fetch
action: fetch:template
input:
url: ./skeleton
- id: gp-init
action: run:script
input:
script: gp app create --template ${{ parameters.language }} --name ${{ parameters.name }}Pick one runtime (Kubernetes, ECS, or Nomad) for the paved road. Add others later if you must. The goal is fewer ways to be wrong.
What to measure (and what we’ve actually seen)
Track business outcomes, not lines of YAML deleted:
- Lead time for changes – commit to prod. Target: same day for non-data changes.
- Deployment frequency – per service per day/week.
- MTTR – measure by Argo health + SLO burn rate. Target: <90 minutes.
- Change failure rate – rollbacks/patches per deploy.
- Onboarding time – new hire to first prod deploy.
- Infra cost per service – requests vs. usage, idle pods, EBS volumes.
From three recent GitPlumbers engagements:
- Lead time: 70–90% reduction once teams use the single workflow.
- Onboarding: 5–10x faster to first prod change.
- CFR: down 30–50% through consistent rollouts and canaries.
- Cost: 10–25% savings via right-sized defaults and shared modules.
If these don’t move within six weeks, your paved road is either not actually paved (too many options) or it’s not the default path (no enforcement or incentives).
What I’d do differently next time
- Start with the CLI and the CI contract, not a platform monolith. You can fake the infra modules and harden later.
- Don’t over-index on Backstage first. Make the road work from a terminal; catalog comes after.
- Version everything:
gp@v0.3,terraform-modules@v1.5,deploy-workflow@v3. - Publish a short “When to leave the paved road” doc. Require a design review for exceptions over 30 days.
- Keep the platform backlog boring: reliability, speed, and cost. Say no to bespoke features unless 3+ teams need them.
If you’re in the “every repo is special” swamp, the quickest win is shipping the reusable CI workflow and a thin gp deploy. Do that, and a lot of dragons disappear without a reorg. And if you want a crew that’s done this in messy real-world stacks, GitPlumbers lives for this work.
Key takeaways
- Favor a paved-road platform with opinionated defaults over bespoke tools per team.
- Standardize the contracts (service descriptor, pipelines, runtime) and let the platform translate to infra.
- Ship an internal CLI with 3-5 verbs that developers actually use; everything else is an escape hatch.
- Measure impact with DORA metrics and infra costs, not vanity dashboards.
- Use policy-as-code to keep guardrails while allowing overrides via reviewable exceptions.
Implementation checklist
- Define a minimal service descriptor (name, owner, runtime, ports, SLO, data needs).
- Codify a single deploy workflow as a reusable CI job (GitHub Actions, GitLab CI).
- Wrap infra in versioned Terraform modules; avoid snowflake plans in each repo.
- Publish an internal CLI (Go/Cobra or TS/oclif) with create, deploy, db, logs, status.
- Adopt GitOps (Argo CD/Flux) and a single Helm chart with sane defaults.
- Enforce guardrails with OPA/Kyverno; allow documented exemptions with TTLs.
- Track DORA metrics, onboarding time, and cloud spend per service.
Questions we hear from teams
- How do we handle teams that need non-Kubernetes runtimes?
- Make the paved road one runtime first. Provide a documented escape hatch via a separate workflow (e.g., ECS) and require a short design review. If 3+ teams choose it, consider making it a first-class paved road later.
- What if our Terraform is already a mess across repos?
- Create a wrapper module that encodes sane defaults and migrate incrementally. Start by moving just one resource (e.g., databases) under the module. Version it and cut a PR template that blocks direct changes outside the module path.
- Do we need Backstage to start?
- No. Start with a CLI and a reusable CI job. Backstage helps with discoverability and templates, but it’s optional. If you add it, wire it to your service descriptor so metadata stays in one place.
- How do we avoid blocking engineers with guardrails?
- Use policy-as-code with time-bound exceptions. Make exceptions easy to request, visible in code reviews, and automatically expiring. Alert on exceptions that pass their TTL.
- How do we prove ROI to finance?
- Track DORA metrics and infra cost by service. When lead time drops and CFR improves, pair that with cost deltas from right-sized resources. We typically see 10–25% savings within a quarter without feature freezes.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
