Do we need Backstage to make this work?

No. Backstage helps with visibility and docs, but the core is ADRs + templates + policies + org-level CI. You can start with a simple repo list and a spreadsheet, then graduate to Backstage as you scale.

Won’t policies slow teams down?

Bad ones do. Start with high-signal checks tied to ADRs (base image, TLS, RPC choice, secret scanning). Keep failures actionable with messages and links to the template or doc. We typically see PR lead time decrease once ambiguity drops.

How do we handle teams that truly need exceptions?

Add an exception process: a short `EXEMPTIONS.md` with owner and expiry, or policy annotations like `adr/exception: ADR-0012 until 2025-03-01`. Make exceptions visible, time-bound, and expensive to keep.

We’re polyglot. Will this explode our maintenance?

Only if you over-customize. Keep paved-road patterns per runtime (Node, Java, Go) but share the same CI, policy engine, and GitOps model. Provide codemods/playbooks per language; centralize policy and deployment.

What’s the first week look like?

Pick one ADR with clear payoff (e.g., base image + distroless). Ship an org-level policy, update templates, and run a Renovate sweep. Celebrate one measurable win before boiling the ocean.

Platform-productivity · Nov 8, 2025 · 10 minute read

ADRs That Change Code: Paved Roads Over PowerPoints

Make architecture decisions enforceable in CI, back them with golden paths, and refactor without breaking the company.

Alex Kim

Partner, Platform & Reliability, GitPlumbers

20 years building and fixing platforms at scale. Ex-Netflix platform lead, ex-HashiCorp field CTO. I help teams turn big-bang refactors into boring releases.

ADRs that don’t touch code are just meeting notes. Wire them to the paved road and they become guardrails.

Back to all posts

The refactor that derailed Q3

A client asked us to “just switch everything to gRPC.” They had 70 microservices, three Node runtimes, two Java frameworks, and a Terraform zoo. There were ADRs—beautiful ones—in Confluence. None of them touched code. Six weeks later: broken staging, half-upgraded services, and a production incident when a bespoke gateway didn’t understand the new headers. Classic drift.

I’ve watched this movie at a FAANG scale and at seed-stage startups. The fix isn’t another meeting. It’s ADRs that bind to the paved road: defaults, templates, and policies that make the right thing the easy thing—and make drift visible the minute it happens.

Why this matters (and what goes wrong)

When ADRs live as slides, teams improvise. Improvisation is great for jazz, terrible for compliance and refactors. Symptoms you’ve seen:

Bespoke everything: hand-rolled Helm charts, one-off GitHub Actions, snowflake Dockerfiles.
Drift: Terraform providers and base images drift months behind; SLOs drift because nobody knows the golden path.
Refactor fear: org can’t change logging, tracing, or RPC without a quarter of yak shaving.

The business impact:

PR lead time balloons (we’ve measured +2–3 days) from decision ambiguity.
MTTR suffers when every service is a one-off snowflake.
Cloud bills creep from duplicated infra patterns and misconfig (e.g., sidecars running latest-with-debug).

ADRs that actually change code

ADRs are useful when they’re small, searchable, and enforceable. Here’s the pattern that works:

One page, max. Decision, context, date, tags, migration impact.
Adjacent to code. docs/adr/ in the repo (or org-level repo mirrored into templates).
Tagged and referenced. Use IDs and tags that policies can read.
Mapped to enforcement. Every ADR has at least one CI check or policy.

Example ADR that picks gRPC + HTTP/2 for inter-service calls:

# ADR-0012: gRPC for inter-service communication

Date: 2024-06-12
Status: Accepted
Tags: rpc, networking, performance

Decision
- Default to gRPC over HTTP/2 for inter-service calls.
- REST remains at the edge (public APIs) via Envoy translation.

Context
- High P99 latency between Node/Java services.
- Existing tracing is inconsistent.

Consequences
- New services must use the `svc-grpc` template.
- Existing services migrate when touching RPC (see migration ADR-0012-MIG).
- CI will reject custom HTTP clients between internal services.

Links
- Template: platform-templates/svc-grpc
- Policy: opa/policies/adr-0012-grpc.rego

Make it testable. We add a tiny machine-readable footer so CI can resolve links:

<!-- adr: {"id":"ADR-0012","policy":"opa/policies/adr-0012-grpc.rego"} -->

Pave the road: defaults, templates, and policies

If the paved road isn’t faster than bespoke, engineers will route around it. Ship opinionated, boring defaults:

Service templates with batteries included: tracing, metrics, health endpoints, OpenAPI/Protobuf, standard Dockerfile, Makefile, CI workflow.
Terraform modules pinned to versions, with built-in guardrails (encryption, least-privilege IAM, budgets).
Org-level CI via reusable workflows; you shouldn’t copy/paste YAML per repo.
Policy-as-code (OPA/Conftest) that encodes ADRs.

Example: GitHub reusable workflow that every service calls. Two lines in repos, org owns the logic.

# .github/workflows/ci.yml
name: CI
on: [push, pull_request]
jobs:
  call-org-ci:
    uses: acme-org/.github/.github/workflows/service-ci.yml@v3
    with:
      language: node
      adr_policy_dir: opa/policies

OPA policy that enforces ADR-0012 by rejecting raw axios calls to internal services in Node (trivial heuristic, but effective when paired with code search):

package adr.grpc

# deny when axios is used against internal .svc.local domains
violation[msg] {
  input.file.path.endswith(".ts")
  some line in input.file.lines
  contains(line.text, "axios(")
  contains(line.text, ".svc.local")
  msg := sprintf("ADR-0012: Use gRPC client, not axios for service-to-service: %s:%d", [input.file.path, line.number])
}

Conftest wired into CI:

conftest test --policy opa/policies --input . --no-fail=false

Terraform paved-road module with pinned provider versions:

# versions.tf
terraform {
  required_version = ">= 1.6.0, < 1.8.0"
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.55.0"
    }
  }
}

module "service_vpc" {
  source  = "git::ssh://git@github.com/acme/modules//vpc?ref=v2.3.1"
  cidr    = var.cidr
  tags    = local.common_tags
}

Bootstrap command (make it a single command):

npx @acme/create-service my-service --template=svc-grpc

Prevent drift continuously

Drift is a process problem, not a willpower problem. Bake the checks into the platform:

Renovate for dependency and toolchain updates, with automerge for patch/minor if tests and policies pass.
ArgoCD + Kustomize with app-of-apps so cluster manifests are standardized and drift shows up as OutOfSync, not a Friday surprise.
Org-level policies that run on every PR and scheduled (nightly) to catch drift before audits do.

Renovate config that keeps you off “latest” while reducing toil:

{
  "extends": ["config:recommended"],
  "semanticCommits": true,
  "rangeStrategy": "bump",
  "packageRules": [
    {
      "matchManagers": ["npm", "dockerfile", "terraform"],
      "matchUpdateTypes": ["patch", "minor"],
      "automerge": true
    },
    {
      "matchManagers": ["npm"],
      "matchPackageNames": ["pino", "@grpc/grpc-js"],
      "groupName": "observability-stack"
    }
  ]
}

GitHub Action that runs OPA and Terraform checks nightly across all repos:

name: Nightly Policy Sweep
on:
  schedule:
    - cron: '17 3 * * *'
  workflow_dispatch: {}
jobs:
  sweep:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          repository: acme-org/repo-harvester
      - name: Run policy checks
        run: |
          ./harvest.sh | conftest test --deny --policy opa/policies --all-namespaces -
      - name: Open issues for violations
        run: node scripts/open-issues.js

For GitOps, keep ArgoCD boring. No hand-edited manifests on clusters. Standard overlays only:

# argo/app-of-apps.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: platform
spec:
  project: default
  source:
    repoURL: 'git@github.com:acme-org/cluster-manifests.git'
    path: apps
    targetRevision: main
  destination:
    server: 'https://kubernetes.default.svc'
    namespace: argocd
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

Safe refactors: make the migration the paved road

This is where most orgs crash. An ADR announces “we’re moving to X,” and then nothing happens safely. You need:

A migration ADR with scope and exit criteria.
Codemods for mechanical changes.
Feature flags or canary-by-default pipelines.
Service catalog visibility (Backstage) to know who owns what and show progress.

Example: migrating Node services from a homegrown logger to pino.

Migration ADR snippet:

# ADR-0044-MIG: Migrate logger to pino

Scope: 61 Node services.
Plan: codemod + template update + canary rollout (5% -> 25% -> 100%).
Exit: 0 services use `logger`, template default switched, dashboards updated.

Codemod with jscodeshift:

// transforms/logger-to-pino.js
export default function transformer(file, api) {
  const j = api.jscodeshift
  const root = j(file.source)

  // replace imports
  root.find(j.ImportDeclaration, { source: { value: 'logger' } })
    .replaceWith(j.importDeclaration([
      j.importDefaultSpecifier(j.identifier('pino'))
    ], j.literal('pino')))

  // initialize
  root.find(j.VariableDeclaration)
    .filter(p => j(p).find(j.CallExpression, { callee: { name: 'createLogger' } }).size())
    .forEach(p => {
      j(p).replaceWith(j.variableDeclaration('const', [
        j.variableDeclarator(j.identifier('log'), j.callExpression(j.identifier('pino'), []))
      ]))
    })

  // method mapping
  const map = { warn: 'warn', error: 'error', info: 'info', debug: 'debug' }
  root.find(j.CallExpression, { callee: { object: { name: 'logger' } } })
    .forEach(p => {
      const m = p.node.callee.property.name
      if (map[m]) p.node.callee.object.name = 'log'
    })

  return root.toSource()
}

Pipeline step to run the codemod and open a PR per service:

npx jscodeshift -t transforms/logger-to-pino.js src --extensions=ts,tsx,js
npm test && git checkout -b chore/logger-to-pino && git commit -am "chore: migrate logger to pino" && gh pr create -f

For Java codebases, use OpenRewrite. Recipe example:

type: specs.openrewrite.org/v1beta/recipe
name: com.acme.logging.MigrateSlf4j
recipeList:
  - org.openrewrite.java.dependencies.AddDependency:
      groupId: io.micrometer
      artifactId: context-propagation
      version: 1.x
  - org.openrewrite.java.ChangeType:
      oldFullyQualifiedTypeName: com.acme.logging.Logger
      newFullyQualifiedTypeName: org.slf4j.Logger

Roll out safely with canaries and SLOs:

Use flagger or your existing Argo Rollouts to ramp traffic.
Gate promotions on error-rate and latency SLOs from Prometheus.

Example Rollouts snippet:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: svc
spec:
  strategy:
    canary:
      steps:
        - setWeight: 5
        - pause: {duration: 5m}
        - setWeight: 25
        - pause: {duration: 10m}
        - setWeight: 100
      analysis:
        templates:
          - templateName: error-rate

Backstage service catalog gives you the heatmap of migration progress. No spreadsheets.

# catalog-info.yaml
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: svc-foo
  tags: [node, grpc]
spec:
  owner: team-foo
  type: service
  lifecycle: production
  providesApis: [internal-grpc]
  adr: ["docs/adr/ADR-0012.md", "docs/adr/ADR-0044-MIG.md"]

Before/after: cost and speed

A recent GitPlumbers engagement (70 services, mixed stacks):

Before: 9 distinct service templates, 14 CI workflows, Terraform provider drift >6 months, PR lead time 2.7 days, 3 refactor freeze incidents in a year.
After 8 weeks: 2 paved-road templates, 1 org CI, provider drift <2 weeks (Renovate), PR lead time 14 hours median, zero refactor-related incidents during a cross-cut logging migration, 12% infra cost drop from standardized autoscaling and distroless images.

Trade-offs we accepted:

We deprecated bespoke Go and Node frameworks in favor of boring choices (chi + pino). Some teams grumbled; shipping got faster.
We stopped letting teams hand-edit Helm; paved road is Helm + Kustomize overlays only.
We chose ArgoCD over homegrown deployers; fewer knobs, fewer surprises.

What I’d do again (and what to avoid)

Do again:

Tie every ADR to at least one enforcement mechanism (policy, template, CI check).
Keep paved-road templates updated via Renovate and org-owned workflows.
Provide codemods and playbooks before you announce the migration.
Use canary-by-default and promotion gates on SLOs.

Avoid:

Monorepo-only zealotry or polyrepo sprawl religion. Choose per-org, but keep the paved road consistent.
“Framework du jour.” Stability beats novelty for platform layers.
Over-customizing Backstage; start with service catalog + TechDocs, iterate.
Policy FOMO: pick 5–10 high-value checks first (secrets in images, TLS, base image, RPC choice), expand later.

Related Resources

Key takeaways

ADRs must be short, indexed, and wired into CI so they influence code—not just slides.
Paved roads (templates, modules, and org-level pipelines) cut cognitive load and enforce defaults without heroics.
Drift prevention = policy gates + dependency automation + platform baselines checked continuously.
Safe refactors require migration playbooks, codemods, and canary-by-default rollouts tied to SLOs.
Favor simplification: buy boring tools, standardize on fewer stacks, avoid bespoke glue.

Implementation checklist

Keep ADRs under 1 page, tag them, and commit them next to code.
Map each ADR to at least one CI check or policy.
Publish paved-road templates (service, Terraform, Helm) with one command to bootstrap.
Enforce org-level CI (reusable workflows) and policy-as-code (OPA) across repos.
Automate dependency and toolchain updates with Renovate and pinned versions.
Provide migration codemods and feature flags for safe rollouts.
Track refactor progress via service catalog and SLO dashboards.

Questions we hear from teams

Do we need Backstage to make this work?: No. Backstage helps with visibility and docs, but the core is ADRs + templates + policies + org-level CI. You can start with a simple repo list and a spreadsheet, then graduate to Backstage as you scale.
Won’t policies slow teams down?: Bad ones do. Start with high-signal checks tied to ADRs (base image, TLS, RPC choice, secret scanning). Keep failures actionable with messages and links to the template or doc. We typically see PR lead time decrease once ambiguity drops.
How do we handle teams that truly need exceptions?: Add an exception process: a short `EXEMPTIONS.md` with owner and expiry, or policy annotations like `adr/exception: ADR-0012 until 2025-03-01`. Make exceptions visible, time-bound, and expensive to keep.
We’re polyglot. Will this explode our maintenance?: Only if you over-customize. Keep paved-road patterns per runtime (Node, Java, Go) but share the same CI, policy engine, and GitOps model. Provide codemods/playbooks per language; centralize policy and deployment.
What’s the first week look like?: Pick one ADR with clear payoff (e.g., base image + distroless). Ship an org-level policy, update templates, and run a Renovate sweep. Celebrate one measurable win before boiling the ocean.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Schedule a 60‑minute Platform Productivity assessment See how we run safe refactors