Stop Building a Platform; Build a Paved Road: “Just-Enough” Patterns That Unblock Teams
If your platform feels like a toll booth, you’ve overbuilt it. Here’s the minimal, boring stack that lets product teams ship without filing tickets or learning your bespoke CLI.
Paved roads beat platforms. If engineers can ship without asking permission, you built the right thing.Back to all posts
When the platform became the product (and froze delivery)
I walked into a fintech that had “finished” its internal developer platform. Custom golang
CLI for everything, homegrown release manager, and a dashboard no one trusted. PRs waited three days for a deployment ticket to clear. Meanwhile, the CFO’s weekly report showed their EKS
spend doubling while feature throughput halved. MTTR? North of four hours because rollbacks required a platform engineer on Slack.
I’ve seen this movie. We centralize to “help” teams, then bury them under opinions, abstractions, and tickets. The result is a platform that becomes the product—and the actual product stalls.
What worked here wasn’t another rewrite. It was going the other direction: a just-enough platform—boring, paved-road defaults, GitOps, and clear off-ramps. No silver bullets. Just fewer moving parts and predictable paths to prod.
What “just-enough platform” actually means
A just-enough platform is not a suite. It’s a thin layer that:
- Offers an opinionated paved road with sane defaults.
- Makes the basics zero-configuration: build, test, containerize, deploy, observe.
- Stays ejectable: teams can deviate if they accept the blast radius and support.
- Measures outcomes, not tool adoption: DORA metrics, MTTR, and cost per service.
The thin waist looks like this:
git
(GitHub/GitLab) →GitHub Actions
orGitLab CI
→ghcr
/ECR
→ArgoCD
→Kubernetes
(EKS
/GKE
).- IaC with
Terraform
for cloud primitives.Helm
orKustomize
for app manifests. Backstage
for discovery and golden paths.Prometheus/Grafana
+OpenTelemetry
for SLOs and traces.
Principles:
- Defaults over enforcement. Guardrails with
Kyverno
/OPA
, not gates via tickets. - One of each. One CI, one GitOps, one registry. Swap later behind interfaces.
- No bespoke CLIs. If
kubectl
,helm
, orargocd
can do it, use them. - Document off-ramps. If a team needs
Spinnaker
orPulumi
, fine—own the on-call and costs.
Paved road, not lock-in: templates you can eject from
Ship a real service template that compiles, tests, traces, and deploys on day one. No workshops required.
Example repo layout:
service/
app code (node
,go
,python
—use language-specific templates).github/workflows/ci.yml
container build + testdeploy/
withhelm
chart (orkustomize
base/overlays)opentelemetry.yaml
with default tracing exporterdocs/runbook.md
including SLOs and on-call rotation
A minimal GitHub Actions
pipeline that most teams won’t need to touch:
name: ci
on: [push]
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 20
- run: npm ci
- run: npm test -- --ci
- run: docker build -t ghcr.io/acme/payment:${{ github.sha }} .
- run: echo ${{ secrets.GHCR_TOKEN }} | docker login ghcr.io -u ${{ github.actor }} --password-stdin
- run: docker push ghcr.io/acme/payment:${{ github.sha }}
A simple Helm
values file that’s safe by default:
replicaCount: 3
resources:
requests:
cpu: 200m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
autoscaling:
enabled: true
minReplicas: 3
maxReplicas: 10
targetCPUUtilizationPercentage: 70
And ArgoCD
tracks the desired state from a single environments
repo:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: payment
spec:
destination:
namespace: payments
server: https://kubernetes.default.svc
source:
repoURL: https://github.com/acme/infra-environments
path: overlays/prod/payment
targetRevision: main
syncPolicy:
automated:
prune: true
selfHeal: true
If a team needs to eject (say, they want Kafka Streams
with custom sidecars), they can swap the deploy/
directory or add an overlay. They still benefit from the paved road for CI, registry, and observability.
Before/after: the boring path beat the bespoke one
Two recent engagements where “just-enough” won on both speed and cost.
Fintech (40 services, PCI scope)
- Before: Custom
golang
platform CLI,Spinnaker
, three bespoke admission controllers, five different CI systems. Lead time 3–5 days. Change failure rate ~20%. MTTR ~4h. Monthly K8s bill: $310k. - After: One CI (
GitHub Actions
),ArgoCD
app-of-apps, standardHelm
charts,Backstage
catalog + templates,Kyverno
policies for limits and image signatures,OpenTelemetry
tracing. Lead time down 70% (to same-day). Change failure rate 8–10%. MTTR ~45m. K8s bill down 28% within 60 days (fewer nodes, right-sized requests).
- Before: Custom
B2B SaaS (multi-tenant data plane)
- Before: 17
EKS
clusters (per-customer), snowflake Terraform, manual certificates. Nightly build farm ran hot 24/7. SRE team drowning in tickets. - After: Consolidated to 4 clusters with namespace-based tenancy,
Kyverno
for quotas and network policies,cert-manager
,ExternalDNS
, andcluster-autoscaler
.ArgoCD
managed per-tenant overlays. Saved ~$85k/month. On-call pages dropped 43%. New tenant onboarding time from 3 days to 90 minutes.
- Before: 17
The pattern: fewer tools, fewer handoffs, fewer exceptions. The paved road isn’t magic, it’s friction removal.
The minimum viable platform (MVP) you can stand up in weeks
You don’t need a platform team of 20. You need a minimal, boring stack and crisp responsibilities.
- Cloud:
EKS 1.29
(orGKE Autopilot
),Terraform
modules (terraform-aws-modules/eks
v19.21.0) - GitOps:
ArgoCD
(app-of-apps), singleinfra-environments
repo,Helm
for charts - CI:
GitHub Actions
with reusable workflows - Discovery:
Backstage
for catalog and templates - Guardrails:
Kyverno
policies for limits, quotas, image signatures;OPA
/Conftest
for Terraform PRs - Observability:
Prometheus
,Loki
,Grafana
,OpenTelemetry
SDKs in templates
Terraform
for EKS
(trimmed for clarity):
module "eks" {
source = "terraform-aws-modules/eks/aws"
version = "19.21.0"
cluster_version = "1.29"
cluster_name = "acme-prod"
vpc_id = var.vpc_id
subnet_ids = var.private_subnets
enable_irsa = true
eks_managed_node_groups = {
general = {
desired_size = 3
instance_types = ["m6i.large"]
min_size = 2
max_size = 10
}
}
}
A Kyverno
policy that forces sane defaults without tickets:
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-requests-limits
spec:
validationFailureAction: enforce
rules:
- name: require-cpu-mem
match:
resources:
kinds: [Deployment, StatefulSet]
validate:
message: "cpu/memory requests and limits are required"
pattern:
spec:
template:
spec:
containers:
- resources:
requests:
cpu: "?*"
memory: "?*"
limits:
cpu: "?*"
memory: "?*"
Backstage
golden-path template points to the service skeleton, so “new service” means code + CI + deploy in one click. Keep the template opinionated: tracing on, health checks required, runbook.md
present, SLO.md
stubbed.
Operating model: platform is a product, not a police force
You’ll fail if your operating model is “open a ticket.” Instead:
- Publish platform SLOs: e.g., 99.9% for
ArgoCD
sync availability, <10m restore time for theenvironments
repo. - Create a public RFC process. Breaking changes require an RFC and a migration timeline.
- Set a clear deprecation policy: 90 days for minor, 180 for major; platform supplies codemods/migration scripts.
- Define ownership boundaries:
- Platform owns templates, infra repos, cluster ops, guardrails, and docs.
- Product teams own services, cost, and on-call beyond the paved road.
- Run weekly office hours and publish migration dashboards in
Backstage
. - Budget for enablement over enforcement: pair on the first two migrations per team.
KPIs to track jointly with product:
- Lead time for changes, MTTR, change failure rate (DORA)
- Infra cost per service (+ trend), idle vs. used CPU/mem, node spot/on-demand mix
- Developer sentiment (quarterly survey): “It’s easy to ship on the paved road” score
The 90‑day playbook to get there
You don’t need a two-year platform roadmap. You need a quarter.
- Baseline. Pull 90 days of DORA metrics, incident data, and the cloud bill. Pick 3–5 exemplar services to migrate.
- Choose defaults. One CI, one GitOps, one registry, one IaC framework. Write them down.
- Build the service template. Include CI, Dockerfile, Helm/Kustomize, health checks,
OpenTelemetry
, andrunbook.md
. - Stand up GitOps. Install
ArgoCD
. Create aninfra-environments
repo withdev/stage/prod
and app-of-apps. - Add guardrails.
Kyverno
policies for limits/quotas, image signature verification;Conftest
in Terraform PRs. - Backstage. Catalog existing services, publish the golden-path template and a scorecard.
- Migrate two services end-to-end. Pair program with the product teams. Measure before/after metrics.
- Publish the eject path. Document how to deviate and what responsibilities shift to the team.
- Review and iterate. Kill anything unused. Don’t add features unless a metric demands it.
Anti‑patterns to avoid (I’ve seen these sink good teams)
- Building a bespoke CLI to hide
kubectl
/helm
. It will drift, and only two people will understand it. - Multi-cloud “for portability” on day one. You’ll standardize on least common denominator and pay double.
- Service mesh everywhere. Start without
Istio
. Add when you actually need mTLS or traffic policy complexity. - Per-PR ephemeral environments for everything. Great for UI smoke tests; a money pit for stateful systems.
- Animated dashboards instead of SLOs. Measure user-facing latency, error budgets, and MTTR.
- Ticket-driven deployment gates. Replace with policies and automated checks in CI/GitOps.
- Snowflake clusters per team. Use namespaces and quotas; scale clusters when you outgrow them.
If it feels clever, it probably won’t scale. If it’s boring and well-understood, engineers will ship on it. That’s the point.
Key takeaways
- A just-enough platform prioritizes paved-road defaults and ejectability over custom tooling.
- Treat the platform like a product with SLOs, a public roadmap, and a clear deprecation policy.
- Use a thin-waist architecture: VCS → CI → registry → GitOps → runtime. Swap parts without re-platforming.
- Measure what matters: lead time, MTTR, change failure rate, and infra cost per service.
- Start with service templates, ArgoCD app-of-apps, and guardrails; add complexity only when metrics demand it.
Implementation checklist
- Pick one CI and one GitOps tool; document them as the paved road.
- Ship a service template repo with e2e CI, container build, Helm/Kustomize, and tracing baked in.
- Stand up a single Backstage catalog with golden-path docs and scorecards.
- Adopt ArgoCD app-of-apps to keep environments simple and auditable.
- Enforce minimal guardrails with Kyverno/OPA: resource limits, image provenance, namespace quotas.
- Define platform SLOs (e.g., 99.9% for GitOps sync) and an RFC process for breaking changes.
- Baseline DORA + cost KPIs and review them weekly with product and platform leads.
- Publish an eject path: teams can deviate if they own the on-call and costs.
Questions we hear from teams
- How do we avoid creating a new bottleneck in the platform team?
- Keep the platform surface area small. Offer paved-road defaults and guardrails, not custom workflows. Publish an eject path and require teams that deviate to own on-call and costs. Use GitOps so deployments don’t route through platform engineers.
- What if a team needs something the paved road doesn’t support (e.g., `Kafka` with custom networking)?
- Let them deviate with an RFC. If it’s broadly useful, productize it into the template. Otherwise, they own the runbooks and SLOs. The thin waist (VCS → CI → registry → GitOps → runtime) ensures deviations don’t force a re-platform.
- Is `ArgoCD` mandatory? Can we use `Flux`?
- Pick one. `ArgoCD` and `Flux` both work. The point is GitOps with automated sync and auditable drift, not the specific logo. If you switch later, keep the same repository structure to minimize churn.
- How do we measure success without gaming the numbers?
- Use DORA metrics (lead time, deployment frequency, change failure rate, MTTR) from your VCS/CI/CD events, not self-reporting. Pair with cost per service and error budget burn. Publish a monthly scorecard in Backstage.
- What size team do we need to run this?
- We’ve run this with 3–5 platform engineers supporting 20–40 services. Focus on enablement, automation, and ruthless scope control. Add headcount only when SLOs slip or migrations back up.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.