Stop Drift: ADRs and Paved Roads That Make Safe Refactors Boring
Capture decisions once, codify them into golden paths, and make the default the safe path. ADRs set the rules; paved roads automate them.
Paved roads aren’t constraints; they’re accelerators with guardrails.Back to all posts
The outage that sold me on paved roads
Friday night, Kubernetes 1.25 upgrade. Half the Ingress
objects fail because some teams used extensions/v1beta1
from a copy‑pasted Helm chart circa 2019. Others used networking.k8s.io/v1
but had bespoke annotations for different ingress controllers. Meanwhile, a few Node services are running on node:12-alpine
with an old OpenSSL—TLS handshakes start failing to a third‑party API. We didn’t have a single place to fix it. We had 17 places. That’s drift.
I’ve seen this movie at unicorns and at 20‑year‑old enterprises: orgs ship fast for 12–18 months, entropy creeps in, and every migration becomes an archaeology dig. ADRs and paved roads are the boring infrastructure that stops the bleeding. They prevent institutional amnesia and make safe refactors mind‑numbingly routine.
If it isn’t captured in an ADR and codified on a paved road, it doesn’t exist at scale.
Why drift eats your roadmap
Drift isn’t just messy repos. It’s direct drag on revenue and reliability.
- N× migrations: 1 Kubernetes upgrade becomes 12 bespoke projects.
- Variable MTTR: Oncall can’t predict behaviors when health checks, logging, and metrics differ by service.
- SLO erosion: Inconsistent timeouts/retries explode tail latency and error budgets.
- Security holes: Unpinned base images and one‑off IAM policies are where auditors feast.
- Hiring tax: Onboarding stretches from days to weeks because “it depends” on which template you landed in.
What causes it:
- Bespoke by default: Every team picks its own
Dockerfile
,eslint
, Helm chart, and deploy pattern. - Decisions lost to Slack: Context leaves with the staff engineer who authored it.
- Docs without force: “Standards” aren’t enforced by code, so they’re ignored under pressure.
The antidote is twofold: ADRs to memorialize decisions and paved roads to turn those decisions into defaults you can’t accidentally deviate from.
ADRs: decision memory that survives reorgs
ADRs aren’t bureaucracy; they’re compression for org memory. Use something lightweight like MADR and keep them in the repo where work happens.
- Scope
- Org‑wide ADRs (e.g., “We use
Terraform
on AWS; noCloudFormation
unless exempted”). - Domain ADRs (e.g., “Data platform uses
dbt
+Airflow
with OpenLineage”). - Repo ADRs (e.g., “This service uses
Node 18
,pnpm
,pino
, OpenTelemetry”).
- Org‑wide ADRs (e.g., “We use
- Convention:
docs/adr/0001-title.md
, link fromREADME.md
, and reference ADR numbers in PRs. - Tooling:
adr-tools
,adr-log
, or Backstage TechDocs to render and index. - Lifecycle:
Proposed → Accepted → Superseded
. Close the loop—when a paved road changes, supersede the ADR.
Example ADR snippet:
# ADR 0005: Standard Node Service Baseline
- Status: Accepted
- Date: 2024-09-10
- Context: Inconsistent node runtimes, logging, and OTEL config slowed incidents and upgrades.
- Decision: Adopt Node 18 LTS, `pnpm`, `pino` logging (JSON), OpenTelemetry auto-instrumentation, common `Dockerfile`, and a single Helm chart.
- Consequences: Faster upgrades via template; bespoke stacks require an exception ADR and oncall ownership.
ADRs don’t fix anything alone—they just make intent explicit. The win comes when you wire ADRs to paved roads.
Paved roads: from doc to default
A paved road is a golden path that bakes decisions into code, CI, and policy. It turns “please follow the runbook” into “the pipeline won’t let you do the unsafe thing.”
- Templates
- Service starters via
cookiecutter
, Backstage Scaffolder, orspring initializr
. - Infra via a
terraform
module registry (vpc
,rds
,eks
,s3-bucket
).
- Service starters via
- Reusable CI/CD
- GitHub Actions reusable workflows:
org/.github
withbuild-test-scan.yml
,deploy.yml
. pre-commit
withtflint
,terraform fmt
,golangci-lint
,eslint
.
- GitHub Actions reusable workflows:
- Policy & Guardrails
- Cluster:
OPA Gatekeeper
orKyverno
to block deprecated APIs and enforce labels, probes, and resource limits. - IaC:
tflint
,checkov
,conftest
to catch drift before apply.
- Cluster:
- GitOps delivery
ArgoCD
app‑of‑apps so every service manifest is versioned and consistent.
- Discoverability
- Backstage catalog with
system
/component
metadata and TechDocs. The paved road is a click away.
- Backstage catalog with
The paved road is not the only road—but if you take a side street, you own it. That’s the deal.
Before/after: standardizing Node services and Terraform
Here’s what we changed at a mid‑market fintech (70 microservices, 15 teams). We wrote two ADRs and shipped the road.
- Node service baseline
- Before
- Random mix of
winston
,bunyan
, and console logs;node:12-alpine
andnode:16-buster
images; ad‑hoc health checks. - Unique Helm charts per team; ingress annotations all over the map.
- Random mix of
- After
- Backstage template
node-service
with pinnednode:18-alpine
,pino
JSON logs,/healthz
and/readyz
, OpenTelemetry set byOTEL_EXPORTER_OTLP_ENDPOINT
. - Single Helm chart with sane defaults and ArgoCD application set by a reusable manifest.
- Backstage template
Example Dockerfile
from the paved road:
FROM node:18-alpine
WORKDIR /usr/src/app
COPY package.json pnpm-lock.yaml ./
RUN corepack enable && corepack prepare pnpm@9.0.0 --activate \
&& pnpm install --frozen-lockfile
COPY . .
RUN pnpm build && pnpm prune --prod
CMD ["node", "dist/index.js"]
GitHub Actions reusable workflow call:
name: ci
on: [push]
jobs:
call-shared:
uses: org/.github/.github/workflows/build-test-scan.yml@v3
with:
node_version: 18
Helm values excerpt with enforced probes and resources:
resources:
requests: { cpu: "200m", memory: "256Mi" }
limits: { cpu: "500m", memory: "512Mi" }
readinessProbe:
httpGet: { path: /readyz, port: 3000 }
livenessProbe:
httpGet: { path: /healthz, port: 3000 }
- Terraform modules
- Before
- Teams hand‑rolled VPCs and S3 buckets; state scattered; IAM inline policies everywhere.
- After
terraform
module registry withvpc
,eks
,rds
,s3-bucket
modules; versioned, withtflint
andcheckov
in CI; remote state in S3 + DynamoDB.
Example module call:
module "bucket" {
source = "git::ssh://git@git.example.com/infrastructure/tf-modules//s3-bucket?ref=v1.8.0"
name = "${var.env}-artifacts"
versioning = true
kms_key_id = var.kms_key_id
lifecycle_rule = [{ id = "retention", enabled = true, expiration = { days = 365 } }]
}
Guardrail with Gatekeeper to block deprecated APIs:
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sDeprecatedAPI
metadata: { name: disallow-extensions-v1beta1 }
spec:
match:
kinds: [{ apiGroups: ["extensions"], kinds: ["Ingress"] }]
Impact (first quarter):
- 70% of services migrated to the template via Backstage scaffolder actions.
- K8s 1.25 upgrade went from “quarter of pain” to a one‑day maintenance window.
- MTTR for HTTP 5xx incidents dropped 28% after log and metrics normalization.
- Time to spin a new service: 3 days → 4 hours, including CI/CD and SLOs.
Make safe refactors boring
Once the baseline is shared, org‑wide changes become find‑and‑replace with guardrails.
- Dependency upgrades: Use
Renovate
to open PRs against the template and all child repos. For Java, pair withOpenRewrite
recipes; for JS/TS usejscodeshift
codemods. - Cross‑repo edits:
Sourcegraph Batch Changes
to migrate imports, config keys, or OTEL versions in one sweep. - Progressive delivery:
Argo Rollouts
for canaries, with SLO‑aware analysis (Prometheus,latency_p99
, error rate). Roll back on policy. - Feature flags:
LaunchDarkly
orUnleash
to decouple deploy from release; refactors ship dark, then light up gradually. - Schema changes: Use
atlas
/liquibase
with backward‑compatible migrations; enforceexpand → migrate → contract
in CI.
Concrete example: we upgraded OpenTelemetry from 1.9
to 1.16
across 42 services in two days.
- Bump the Otel SDK in the template; Renovate propagates PRs.
- Run a
Sourcegraph
batch change to update env var names and exporter options. - Canary 10% with
Argo Rollouts
; analyzeerror_rate
andp95_latency
for 15 minutes. - Auto‑promote if within SLO; otherwise auto‑rollback and open an issue tagged to the team.
No war rooms. No spreadsheets. Just boring, predictable upgrades.
Run this without killing autonomy
You don’t need a platform politburo. You need a contract.
- Default to paved: New work starts on the paved road. Deviations require an exception ADR with an owner and a return‑to‑baseline plan.
- Thin platform team: 4–8 engineers who own the templates, reusable workflows, Gatekeeper/Kyverno policies, and GitOps plumbing. They don’t own feature work.
- Transparent roadmap: Quarterly ADRs for big moves (e.g., “K8s 1.28 by Q3”, “Node 20 by Q4”).
- Metrics that matter
- Onboarding time to first prod deploy
- Change failure rate and MTTR (DORA)
- Time to upgrade a shared dependency (template → 80% services)
- % services on paved road; % exceptions with an active plan
- Guardrails, not handcuffs: If a team truly needs
Rust
orCloudFormation
, fine—own oncall and compliance. Put it in an ADR.
This balance keeps creativity where it matters (business logic) and standardizes the boring bits.
What to do this quarter
You can get meaningful results in 90 days without boiling the ocean.
- Pick two baselines: e.g.,
node-service
andterraform-module
. Write ADRs and get sign‑off. - Ship templates: Use Backstage Scaffolder or
cookiecutter
to codify them. Add reusable GitHub Actions. - Enforce the floor: Turn on
tflint
,eslint
,golangci-lint
, and one cluster policy (Gatekeeper
) that blocks the top drift offender. - Go GitOps: Stand up ArgoCD and move 5 services to app‑of‑apps.
- Automate upgrades: Enable Renovate on all repos; pilot Sourcegraph Batch Changes on one cross‑repo refactor.
- Publish the scorecard: Track paved‑road adoption, MTTR, and upgrade lead time. Celebrate the wins.
If you want a sparring partner, GitPlumbers has paved roads we’ve run in fintech, SaaS, and marketplaces. We’ll tailor them to your stack and leave you with something your team can own.
Key takeaways
- ADRs prevent institutional amnesia and anchor paved-road defaults.
- Paved roads turn “read the doc” into “you can’t screw this up.”
- Favor boring, standardized tooling over bespoke snowflakes—speed follows simplicity.
- Codify defaults into templates, reusable workflows, policies, and GitOps manifests.
- Safe refactors become bulk edits when the baseline is shared (Renovate, OpenRewrite, Batch Changes).
- Measure outcomes: onboarding time, change fail rate, MTTR, time-to-upgrade, and % on paved road.
Implementation checklist
- Adopt a lightweight ADR template (e.g., MADR) and store ADRs in `docs/adr` at org and repo levels.
- Define 3–5 paved-road baselines (e.g., node-service, spring-service, terraform-module, data-pipeline).
- Ship golden templates via `cookiecutter` or Backstage scaffolder; back them with reusable CI workflows.
- Enforce guardrails with `tflint`, `golangci-lint`, `eslint`, and cluster policy (`Gatekeeper`/`Kyverno`).
- Use GitOps (ArgoCD) and an app-of-apps pattern to keep manifests and policies consistent.
- Automate upgrades with Renovate/Dependabot; use Sourcegraph Batch Changes/OpenRewrite for cross-repo refactors.
- Track adoption and outcomes on a scorecard: paved-road % coverage, MTTR, onboarding time, upgrade lead time.
- Create an ADR exception path with a review SLA; require owners for deviations and a plan to return to baseline.
Questions we hear from teams
- Aren’t ADRs just paperwork that slow teams down?
- Bad ADRs are. Good ADRs are lightweight, live next to code, and tie directly to templates and pipelines. They reduce churn by preventing re‑litigation of settled decisions and accelerate changes by clarifying default paths. Keep them short, scoped, and linked to paved roads.
- How do we handle teams that need to deviate for good reasons?
- Create an exception ADR with an owner, clear scope, and an exit plan. Deviations are fine, but the team owns oncall and compliance. Revisit quarterly. Many “exceptions” end up back on the paved road once the experiment’s value is proven—or not.
- What if we’re already deep in drift—where do we start?
- Pick one high‑leverage baseline (e.g., `node-service` or `terraform-module`). Ship a great template, enforce a minimum set of guardrails, and migrate the top 20% of services by change volume. Wins there pay for the rest. Don’t try to standardize everything at once.
- Will paved roads stifle innovation?
- They stifle novelty in the undifferentiated heavy lifting: logging, health checks, deploys, infra. That’s intentional. They free up cycles to innovate where it matters—product features and ML models—not in writing a new `Dockerfile` for the 15th time.
- Which tools do you recommend to operationalize this?
- Backstage for discoverability and scaffolding; GitHub Actions reusable workflows; ArgoCD for GitOps; Gatekeeper or Kyverno for policy; Renovate for upgrades; Sourcegraph/OpenRewrite/jscodeshift for refactors; Terraform with a module registry for IaC. Swap equivalents as needed; the pattern matters more than the brand.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.