Stop Building a Platform; Build a Paved Road: “Just-Enough” Patterns That Unblock Teams

If your platform feels like a toll booth, you’ve overbuilt it. Here’s the minimal, boring stack that lets product teams ship without filing tickets or learning your bespoke CLI.

Paved roads beat platforms. If engineers can ship without asking permission, you built the right thing.
Back to all posts

When the platform became the product (and froze delivery)

I walked into a fintech that had “finished” its internal developer platform. Custom golang CLI for everything, homegrown release manager, and a dashboard no one trusted. PRs waited three days for a deployment ticket to clear. Meanwhile, the CFO’s weekly report showed their EKS spend doubling while feature throughput halved. MTTR? North of four hours because rollbacks required a platform engineer on Slack.

I’ve seen this movie. We centralize to “help” teams, then bury them under opinions, abstractions, and tickets. The result is a platform that becomes the product—and the actual product stalls.

What worked here wasn’t another rewrite. It was going the other direction: a just-enough platform—boring, paved-road defaults, GitOps, and clear off-ramps. No silver bullets. Just fewer moving parts and predictable paths to prod.

What “just-enough platform” actually means

A just-enough platform is not a suite. It’s a thin layer that:

  • Offers an opinionated paved road with sane defaults.
  • Makes the basics zero-configuration: build, test, containerize, deploy, observe.
  • Stays ejectable: teams can deviate if they accept the blast radius and support.
  • Measures outcomes, not tool adoption: DORA metrics, MTTR, and cost per service.

The thin waist looks like this:

  • git (GitHub/GitLab) → GitHub Actions or GitLab CIghcr/ECRArgoCDKubernetes (EKS/GKE).
  • IaC with Terraform for cloud primitives. Helm or Kustomize for app manifests.
  • Backstage for discovery and golden paths. Prometheus/Grafana + OpenTelemetry for SLOs and traces.

Principles:

  • Defaults over enforcement. Guardrails with Kyverno/OPA, not gates via tickets.
  • One of each. One CI, one GitOps, one registry. Swap later behind interfaces.
  • No bespoke CLIs. If kubectl, helm, or argocd can do it, use them.
  • Document off-ramps. If a team needs Spinnaker or Pulumi, fine—own the on-call and costs.

Paved road, not lock-in: templates you can eject from

Ship a real service template that compiles, tests, traces, and deploys on day one. No workshops required.

Example repo layout:

  • service/ app code (node, go, python—use language-specific templates)
  • .github/workflows/ci.yml container build + test
  • deploy/ with helm chart (or kustomize base/overlays)
  • opentelemetry.yaml with default tracing exporter
  • docs/runbook.md including SLOs and on-call rotation

A minimal GitHub Actions pipeline that most teams won’t need to touch:

name: ci
on: [push]
jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
      - run: npm ci
      - run: npm test -- --ci
      - run: docker build -t ghcr.io/acme/payment:${{ github.sha }} .
      - run: echo ${{ secrets.GHCR_TOKEN }} | docker login ghcr.io -u ${{ github.actor }} --password-stdin
      - run: docker push ghcr.io/acme/payment:${{ github.sha }}

A simple Helm values file that’s safe by default:

replicaCount: 3
resources:
  requests:
    cpu: 200m
    memory: 256Mi
  limits:
    cpu: 500m
    memory: 512Mi
autoscaling:
  enabled: true
  minReplicas: 3
  maxReplicas: 10
  targetCPUUtilizationPercentage: 70

And ArgoCD tracks the desired state from a single environments repo:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: payment
spec:
  destination:
    namespace: payments
    server: https://kubernetes.default.svc
  source:
    repoURL: https://github.com/acme/infra-environments
    path: overlays/prod/payment
    targetRevision: main
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

If a team needs to eject (say, they want Kafka Streams with custom sidecars), they can swap the deploy/ directory or add an overlay. They still benefit from the paved road for CI, registry, and observability.

Before/after: the boring path beat the bespoke one

Two recent engagements where “just-enough” won on both speed and cost.

  • Fintech (40 services, PCI scope)

    • Before: Custom golang platform CLI, Spinnaker, three bespoke admission controllers, five different CI systems. Lead time 3–5 days. Change failure rate ~20%. MTTR ~4h. Monthly K8s bill: $310k.
    • After: One CI (GitHub Actions), ArgoCD app-of-apps, standard Helm charts, Backstage catalog + templates, Kyverno policies for limits and image signatures, OpenTelemetry tracing. Lead time down 70% (to same-day). Change failure rate 8–10%. MTTR ~45m. K8s bill down 28% within 60 days (fewer nodes, right-sized requests).
  • B2B SaaS (multi-tenant data plane)

    • Before: 17 EKS clusters (per-customer), snowflake Terraform, manual certificates. Nightly build farm ran hot 24/7. SRE team drowning in tickets.
    • After: Consolidated to 4 clusters with namespace-based tenancy, Kyverno for quotas and network policies, cert-manager, ExternalDNS, and cluster-autoscaler. ArgoCD managed per-tenant overlays. Saved ~$85k/month. On-call pages dropped 43%. New tenant onboarding time from 3 days to 90 minutes.

The pattern: fewer tools, fewer handoffs, fewer exceptions. The paved road isn’t magic, it’s friction removal.

The minimum viable platform (MVP) you can stand up in weeks

You don’t need a platform team of 20. You need a minimal, boring stack and crisp responsibilities.

  • Cloud: EKS 1.29 (or GKE Autopilot), Terraform modules (terraform-aws-modules/eks v19.21.0)
  • GitOps: ArgoCD (app-of-apps), single infra-environments repo, Helm for charts
  • CI: GitHub Actions with reusable workflows
  • Discovery: Backstage for catalog and templates
  • Guardrails: Kyverno policies for limits, quotas, image signatures; OPA/Conftest for Terraform PRs
  • Observability: Prometheus, Loki, Grafana, OpenTelemetry SDKs in templates

Terraform for EKS (trimmed for clarity):

module "eks" {
  source          = "terraform-aws-modules/eks/aws"
  version         = "19.21.0"
  cluster_version = "1.29"
  cluster_name    = "acme-prod"
  vpc_id          = var.vpc_id
  subnet_ids      = var.private_subnets
  enable_irsa     = true

  eks_managed_node_groups = {
    general = {
      desired_size = 3
      instance_types = ["m6i.large"]
      min_size = 2
      max_size = 10
    }
  }
}

A Kyverno policy that forces sane defaults without tickets:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-requests-limits
spec:
  validationFailureAction: enforce
  rules:
    - name: require-cpu-mem
      match:
        resources:
          kinds: [Deployment, StatefulSet]
      validate:
        message: "cpu/memory requests and limits are required"
        pattern:
          spec:
            template:
              spec:
                containers:
                  - resources:
                      requests:
                        cpu: "?*"
                        memory: "?*"
                      limits:
                        cpu: "?*"
                        memory: "?*"

Backstage golden-path template points to the service skeleton, so “new service” means code + CI + deploy in one click. Keep the template opinionated: tracing on, health checks required, runbook.md present, SLO.md stubbed.

Operating model: platform is a product, not a police force

You’ll fail if your operating model is “open a ticket.” Instead:

  • Publish platform SLOs: e.g., 99.9% for ArgoCD sync availability, <10m restore time for the environments repo.
  • Create a public RFC process. Breaking changes require an RFC and a migration timeline.
  • Set a clear deprecation policy: 90 days for minor, 180 for major; platform supplies codemods/migration scripts.
  • Define ownership boundaries:
    • Platform owns templates, infra repos, cluster ops, guardrails, and docs.
    • Product teams own services, cost, and on-call beyond the paved road.
  • Run weekly office hours and publish migration dashboards in Backstage.
  • Budget for enablement over enforcement: pair on the first two migrations per team.

KPIs to track jointly with product:

  • Lead time for changes, MTTR, change failure rate (DORA)
  • Infra cost per service (+ trend), idle vs. used CPU/mem, node spot/on-demand mix
  • Developer sentiment (quarterly survey): “It’s easy to ship on the paved road” score

The 90‑day playbook to get there

You don’t need a two-year platform roadmap. You need a quarter.

  1. Baseline. Pull 90 days of DORA metrics, incident data, and the cloud bill. Pick 3–5 exemplar services to migrate.
  2. Choose defaults. One CI, one GitOps, one registry, one IaC framework. Write them down.
  3. Build the service template. Include CI, Dockerfile, Helm/Kustomize, health checks, OpenTelemetry, and runbook.md.
  4. Stand up GitOps. Install ArgoCD. Create an infra-environments repo with dev/stage/prod and app-of-apps.
  5. Add guardrails. Kyverno policies for limits/quotas, image signature verification; Conftest in Terraform PRs.
  6. Backstage. Catalog existing services, publish the golden-path template and a scorecard.
  7. Migrate two services end-to-end. Pair program with the product teams. Measure before/after metrics.
  8. Publish the eject path. Document how to deviate and what responsibilities shift to the team.
  9. Review and iterate. Kill anything unused. Don’t add features unless a metric demands it.

Anti‑patterns to avoid (I’ve seen these sink good teams)

  • Building a bespoke CLI to hide kubectl/helm. It will drift, and only two people will understand it.
  • Multi-cloud “for portability” on day one. You’ll standardize on least common denominator and pay double.
  • Service mesh everywhere. Start without Istio. Add when you actually need mTLS or traffic policy complexity.
  • Per-PR ephemeral environments for everything. Great for UI smoke tests; a money pit for stateful systems.
  • Animated dashboards instead of SLOs. Measure user-facing latency, error budgets, and MTTR.
  • Ticket-driven deployment gates. Replace with policies and automated checks in CI/GitOps.
  • Snowflake clusters per team. Use namespaces and quotas; scale clusters when you outgrow them.

If it feels clever, it probably won’t scale. If it’s boring and well-understood, engineers will ship on it. That’s the point.

Related Resources

Key takeaways

  • A just-enough platform prioritizes paved-road defaults and ejectability over custom tooling.
  • Treat the platform like a product with SLOs, a public roadmap, and a clear deprecation policy.
  • Use a thin-waist architecture: VCS → CI → registry → GitOps → runtime. Swap parts without re-platforming.
  • Measure what matters: lead time, MTTR, change failure rate, and infra cost per service.
  • Start with service templates, ArgoCD app-of-apps, and guardrails; add complexity only when metrics demand it.

Implementation checklist

  • Pick one CI and one GitOps tool; document them as the paved road.
  • Ship a service template repo with e2e CI, container build, Helm/Kustomize, and tracing baked in.
  • Stand up a single Backstage catalog with golden-path docs and scorecards.
  • Adopt ArgoCD app-of-apps to keep environments simple and auditable.
  • Enforce minimal guardrails with Kyverno/OPA: resource limits, image provenance, namespace quotas.
  • Define platform SLOs (e.g., 99.9% for GitOps sync) and an RFC process for breaking changes.
  • Baseline DORA + cost KPIs and review them weekly with product and platform leads.
  • Publish an eject path: teams can deviate if they own the on-call and costs.

Questions we hear from teams

How do we avoid creating a new bottleneck in the platform team?
Keep the platform surface area small. Offer paved-road defaults and guardrails, not custom workflows. Publish an eject path and require teams that deviate to own on-call and costs. Use GitOps so deployments don’t route through platform engineers.
What if a team needs something the paved road doesn’t support (e.g., `Kafka` with custom networking)?
Let them deviate with an RFC. If it’s broadly useful, productize it into the template. Otherwise, they own the runbooks and SLOs. The thin waist (VCS → CI → registry → GitOps → runtime) ensures deviations don’t force a re-platform.
Is `ArgoCD` mandatory? Can we use `Flux`?
Pick one. `ArgoCD` and `Flux` both work. The point is GitOps with automated sync and auditable drift, not the specific logo. If you switch later, keep the same repository structure to minimize churn.
How do we measure success without gaming the numbers?
Use DORA metrics (lead time, deployment frequency, change failure rate, MTTR) from your VCS/CI/CD events, not self-reporting. Pair with cost per service and error budget burn. Publish a monthly scorecard in Backstage.
What size team do we need to run this?
We’ve run this with 3–5 platform engineers supporting 20–40 services. Focus on enablement, automation, and ruthless scope control. Add headcount only when SLOs slip or migrations back up.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to a GitPlumbers architect Download the Just-Enough Platform Checklist

Related resources