The ADRs and Paved Roads That Killed Drift and Made Refactors Boring

Your teams don’t need another bespoke tool—they need decisions written down and a default path that makes the right thing the easy thing. Here’s the playbook that’s actually worked at scale.

Make the right thing the default, and drift becomes the exception, not the culture.
Back to all posts

The outage that shouldn’t have happened

I’ve watched a Fortune 500 burn a weekend because two teams “solved” ingress differently. One used a bespoke NGINX Ingress with hand-tuned annotations; the other half‑migrated to Istio with Gateway API. Certs renewed on Tuesday, DNS lagged, traffic split went sideways, and on‑call ate glass. The kicker: both solutions were decent—just inconsistent. There was no written decision, no ADR explaining which ingress to use or why, and no paved road that made the right choice obvious.

If any of that feels familiar, you don’t have a tooling problem. You have a decision hygiene problem and a defaults problem. ADRs stop you from re-litigating the same decision every quarter. Paved roads make doing the right thing fast. Together, they kill drift and make refactors boring.

ADRs: make decisions durable (and debatable)

ADRs—Architecture Decision Records—are not paperwork. They’re the cheapest way to capture context before it evaporates.

  • Where they live: In the repo that the decision affects, under docs/adr.
  • Who owns them: The team that owns the code; platform reviews for cross‑cutting decisions.
  • When to write one: Any change that affects many services, requires coordination, or changes a platform contract.

I’ve seen this fail when ADRs become a Confluence graveyard. Keep them in Git, versioned, and referenced in PRs.

Example ADR:

# ADR 0005: Standardize on Gateway API for north-south traffic

Date: 2025-04-15
Status: Accepted
Context:
- We currently run NGINX Ingress v1.8 with per-team annotations and Istio 1.19 for some services.
- TLS renewals and DNS records are handled inconsistently, causing incident #INC-3421.
Decision:
- Adopt Kubernetes Gateway API with Contour 1.29 for all new services.
- Use cert-manager for TLS and ExternalDNS for DNS.
- Provide a Helm chart `platform/ingress-gateway` as the paved road.
Consequences:
- Teams migrating from NGINX must move ingress rules to Gateway API `HTTPRoute`.
- Platform will maintain upgrade schedule; exceptions require `adr-exception` label.
Links:
- RFC #1234
- Runbook: k8s-ingress-gateway.md

Simple, searchable, and it survives reorgs.

Paved roads: cheap velocity, expensive chaos avoided

A paved road (golden path) is the set of defaults that make the 80% case trivial: opinionated templates, one ingress flavor, one CI template, blessed Terraform modules. Not a mandate—just the frictionless path.

  • Make it ergonomic: Template repos, Backstage scaffolder, cookiecutter, or projen. One command to bootstrap.
  • Bake in guardrails: pre-commit hooks, conftest for policy, CODEOWNERS on critical files, and GitOps (ArgoCD/Flux) for drift visibility.
  • Document the contract: What the paved road guarantees (SLOs, upgrade cadence) and what it expects (resource quotas, labels).

If paved roads feel like red tape, you built a toll road. If they feel like cheating, you got it right.

Before/after: concrete fixes that stopped drift

Here are three changes we’ve implemented at clients that paid for themselves within a quarter.

1) Ingress standardization

Before:

  • Two ingress stacks: nginx-ingress with ad-hoc annotations and Istio for “special” services.
  • Cert renewal handled per-team; DNS updated by whoever remembered.
  • MTTR for ingress incidents: 3–5 hours; on-call fatigue high.

After (paved road):

  • Gateway API + Contour, cert-manager, ExternalDNS as the default.
  • Helm chart platform/ingress-gateway with sane defaults: HSTS on, HTTP->HTTPS redirect, max-body-size, standard retries.
  • OPA policy to block direct Ingress resources in new namespaces.
  • ADR documents the why and the migration path.

Snippet: conftest policy to prevent drift:

package k8s.policy

violation[msg] {
  input.kind == "Ingress"
  msg := "Use Gateway API; Ingress is disabled by policy"
}

Results (3 months):

  • Ingress-related incidents down 68%.
  • New service bootstrap time from 2 days to 3 hours.
  • One command migration for 70% of services; exceptions ADR’d and tracked.

2) Terraform without homegrown wrappers

Before:

  • Every team had a bespoke Terraform wrapper: shell scripts, terragrunt, or Makefiles that only one person understood.
  • Drift everywhere: untagged resources, inconsistent IAM, surprise bills.

After (paved road):

  • Internal Terraform Registry with versioned modules: aws-vpc, eks-cluster, rds-postgres, each with sensible defaults.
  • tflint, tfsec, and OPA checks in CI; no apply without passing policy.
  • One Makefile and GitHub Actions template across repos.

CI snippet:

- name: Validate Terraform
  run: |
    terraform fmt -check
    terraform init -backend-config=env/${{ github.ref }}.hcl
    terraform validate
    tflint --config .tflint.hcl
    tfsec .
    conftest test -p policy/ terraform/

Results (2 quarters):

  • Infra PR lead time reduced 40%.
  • Unused EBS volumes down 72% (tagging and TTL enforced by module defaults).
  • Platform upgrades (EKS minor) completed in 2 weeks fleet-wide, not 8.

3) Feature flags without conditional spaghetti

Before:

  • Teams hand-rolled if (process.env.FEATURE_X) checks; flags lingered and broke refactors later.

After (paved road):

  • Standardized on OpenFeature SDK with a thin wrapper. Flags expire by default and include typed evaluation.
  • ADR clarifies: long-lived kill switches only; experiment flags TTL 30 days enforced by CI.

Node wrapper example:

import { OpenFeature } from '@openfeature/server-sdk';

export async function boolFlag(key, defaultValue) {
  const client = OpenFeature.getClient();
  return client.getBooleanValue(key, defaultValue, { owner: 'team-xyz' });
}

Results (1 quarter):

  • Flag debt cut in half; safe refactors around config paths actually stuck.
  • Incidents due to stale flags: zero.

The 60-day rollout that actually works

You don’t need a platform moonshot. You need a clear decision trail and a default path that’s easier than freelancing.

  1. Week 1–2: Pick one domain and write ADRs

    • Choose a high-churn, high-pain area: ingress, Terraform modules, or CI templates.
    • Create docs/adr in the affected repos and ship 2–3 ADRs with real decisions.
    • Add ADR links to PR templates and the repo README.
  2. Week 3–4: Build the paved road

    • Ship a template repo: service-template-go or tf-template-module.
    • Add pre-commit hooks for linting and conftest policies.
    • Publish a GitHub Action workflow as a reusable action.
  3. Week 5–6: Pilot with two teams

    • Migrate one service and one infra stack. Pair program. Capture friction.
    • Track exceptions via an adr-exception label and short ADRs explaining why.
  4. Week 7–8: Broadcast and enforce guardrails

    • Present results. Merge the template into Backstage or org-scoped templates.
    • Add light enforcement: block anti-patterns in CI, not by policing Slack.

Pro tips:

  • Keep exceptions documented and time-bound. If an exception survives 90 days, it earned an ADR.
  • Budget platform time for upgrades. A paved road you can’t maintain becomes a gravel path.

Metrics that prove it’s working

Executives don’t buy stories—they buy deltas.

  • Drift rate: % of repos bypassing paved road (target <10%).
  • Time to first PR on new service: baseline vs after paved road (target: hours, not days).
  • Fleet upgrade cycle: time to roll a library or cluster minor across all services.
  • MTTR/SLOs: incident resolution time for ingress/build/deploy failures.
  • Config variance: number of unique CI templates, ingress controllers, and TF wrappers (trend down).
  • Refactor throughput: number of services auto-migrated by scripted refactor (ts-morph, codemods) without incident.

Instrument it:

  • Emit “off-road” events in CI when policies block builds; aggregate with Prometheus or DataDog.
  • Tag resources with provisioner=paved-road vs manual in Terraform; track cost delta.

Common failure modes (and how to dodge them)

I’ve seen these sink otherwise solid efforts:

  • The committee that never decides: ADRs stall if every decision requires a steering committee. Timebox RFCs to one week. Default to the platform owner’s call.
  • Too many exceptions: If half your org is on “temporary” exemptions, you paved the wrong road. Interview teams, tighten defaults, and deprecate exceptions aggressively.
  • Paved road that isn’t maintained: Upgrades fall behind; teams abandon it. Publish a quarterly upgrade plan with dates. Use GitOps to roll changes.
  • Gatekeeping over guardrails: Humans reviewing YAML for “standards” don’t scale. Push policy to automation, keep humans for design.
  • Docs without scaffolding: Telling teams what to do without making it easy is theater. Ship templates and one-liners.

What we’d do again (and what we’d skip)

What worked:

  • ADRs in-repo keep context tight and searchable.
  • A single, boring ingress stack paid off faster than any service mesh tuning we’ve ever done.
  • Terraform module registry plus OPA caught 80% of issues before they hit the cloud.

What we’d skip:

  • Building bespoke CLIs to “simplify everything.” They age like milk. Prefer plain Makefiles and documented gh/az/aws commands.
  • Mandating a mesh for east-west traffic before teams need it. Start with ingress; earn complexity.

If your platform feels like a museum of bespoke art pieces, you’re paying the drift tax every sprint. ADRs and paved roads are the boring, industrial fixtures that stop the bleeding and let you refactor safely—on your terms, on your timeline.

Related Resources

Key takeaways

  • ADRs make decisions durable, discoverable, and debatable—so you stop relearning the same lessons every quarter.
  • Paved roads make defaults ergonomic and safe; exceptions become rare and intentional, not accidental drift.
  • Favor simplification: fewer stacks, one ingress, one CI template, one Terraform module per class of resource.
  • Guardrails beat gatekeeping: enforce with `pre-commit`, `OPA/Conftest`, `CODEOWNERS`, and GitOps.
  • Prove value with metrics: MTTR, time-to-first-PR, fleet upgrade time, drift rate, and incident frequency.

Implementation checklist

  • Create an ADR template and store ADRs in-repo under `docs/adr` with a README index.
  • Adopt a golden path per workload type (service, job, UI) with template repos and scaffolding.
  • Ship a default ingress stack (e.g., `Gateway API` + `Contour`/`Istio` + `cert-manager` + `ExternalDNS`).
  • Publish a blessed Terraform module registry with versioning and OPA policies.
  • Wire CI templates (GitHub Actions/GitLab CI) with sane caching, test, SAST, and deploy jobs.
  • Add `CODEOWNERS`, `pre-commit`, and `conftest` checks to make drift loud in PRs.
  • Roll out with a 60-day pilot, migrate 2–3 real services, and publish wins and lessons.
  • Track metrics: drift exceptions, bootstrap time, incident reduction, and upgrade cycle times.

Questions we hear from teams

Do ADRs slow teams down?
Only if you make them performative. Keep ADRs short, in-repo, and tied to PRs. We aim for 1–2 pages, timeboxed reviews, and decisions landing within a week. The time you save by not re-arguing ingress or CI every quarter dwarfs the writing cost.
How do we handle exceptions to the paved road?
Treat exceptions as first-class but rare. Require a lightweight ADR for the exception, a sunset date, and a plan to converge. Track them with a label like `adr-exception` and report counts monthly.
What tools do we need to start?
Almost none. A Markdown ADR template, `CODEOWNERS`, `pre-commit`, `conftest` for policies, and a template repo. Add `Backstage` scaffolder when you’re ready to scale scaffolding organization-wide.
We already have a platform team—why add ADRs?
Platforms drift without a paper trail. ADRs make platform contracts explicit so product teams can trust them, and so platform upgrades don’t hinge on tribal memory or Slack archaeology.
How do we prove ROI to leadership?
Pick 2–3 metrics before you start: incident reduction in target domains, time-to-first-PR on new services, and fleet upgrade time. Run a 60-day pilot and compare before/after. That narrative plus numbers closes the loop.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to a GitPlumbers architect See how we cut drift 60% at a Fortune 500

Related resources