Stop Playing Config Whack‑a‑Mole: ADRs + Paved Roads That Make Refactors Boring

Favor boring, paved-road defaults over bespoke snowflakes. Write ADRs, enforce with templates and bots, and make global refactors routine instead of risky.

Boring is a feature. When refactors are boring, your weekends stay yours.
Back to all posts

The refactor that should’ve taken a day, and cost us a sprint

I once tried to standardize logging across ~70 services at a unicorn you’ve definitely used. We were shipping a new tracing policy, and I naïvely thought: “Swap in OpenTelemetry middleware, tweak exporters, done by lunch.” Instead, we found four logging libraries, three formats, two trace ID conventions, and a bespoke sidecar from the old “move fast” era. Every team had drifted. The refactor broke staging, then canaries, then our patience.

We didn’t lack talent. We lacked a paved road and a paper trail. No ADRs (Architecture Decision Records) to justify prior choices. No opinionated templates to keep new services standard. Every new repo was a choose-your-own-adventure. That’s how you end up debugging JSON parsing in the middle of a P1.

Here’s the boring combo that finally worked: write ADRs for cross-cutting decisions, build a paved road that encodes those decisions, and let bots keep you honest. After that, refactors got boring—in the good way.

Why drift happens (and how much it costs)

Drift creeps in when:

  • Bespoke tooling: Hand-rolled deploy scripts, snowflake Helm charts, custom CLIs no one maintains.
  • Tribal knowledge: Decisions live in Slack threads and memories, not durable records.
  • Docs rot: Confluence pages age out; templates don’t exist or are optional.
  • No enforcement: CI pipelines differ per repo; infra modules aren’t versioned; Renovate is “on the backlog.”

What it costs:

  • MTTR spikes: On-call burns hours correlating logs/traces because formats differ.
  • Refactors stall: Global changes mean N bespoke PRs and N different failure modes.
  • Shadow platforms: Teams fork their own workflows to go faster… until they don’t.
  • Cloud bill creep: Inconsistent autoscaling and retries mean overprovisioning and thundering herds.

If you’re a VP Eng balancing velocity and reliability, this isn’t academic. I’ve seen orgs shave 30–50% off refactor timelines and cut failed deploys by a third by going all-in on ADRs + paved roads. The trick is to keep it light and automated, not bureaucratic.

The boring combo that works: ADRs + paved roads + bots

  • ADRs (Architecture Decision Records): One-page decisions with context and trade-offs. Stored next to code. Immutable once accepted; follow-ups supersede.
  • Paved road (golden path): Opinionated defaults baked into templates, modules, and reusable workflows. It should be easier to do the right thing than deviate.
  • Bots and policy: Renovate for dependency bumps, GitHub Actions reusable workflows for CI parity, ArgoCD for drift detection, OPA/Rego for policy gates.

Put simply: ADRs explain the why, paved roads encode the how, and bots enforce the now.

Ship it in a week: a practical playbook

  1. Start with a minimal ADR template and a place to put them:
# ADR-0007: Standardize Logging with OpenTelemetry

- Status: Accepted
- Date: 2025-01-14
- Owners: @platform, @observability
- Context: Four logging libraries, inconsistent correlation IDs, high MTTR.
- Decision: Adopt OpenTelemetry SDK v1.x, JSON logs with `trace_id`/`span_id`, exporter: OTLP over gRPC.
- Consequences: Minor perf hit, dependency on collector; codemods for legacy repos.
- Alternatives considered: Keep `winston` + custom middleware; use Datadog SDK directly.

Repo structure:

repo/
  adr/
    ADR-0007-standardize-logging.md
  src/
  .adr.json              # optional index for tooling
  1. Require ADR linkage in PRs that change cross-cutting concerns:
# .github/PULL_REQUEST_TEMPLATE.md

- Related ADR: ADR-____
- Cross-cutting change? [ ] yes  [ ] no
- If yes, link to approved ADR and checklist.
  1. Encode the paved road in templates and reusable CI:
# .github/workflows/ci.yaml (reusable)
name: service-ci
on:
  workflow_call:
    inputs:
      node-version:
        required: false
        type: string
        default: '20'
jobs:
  build-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: ${{ inputs.node-version }} }
      - run: npm ci
      - run: npm run lint && npm test -- --ci
      - run: npm run otel-verify   # paved-road script validates logging/tracing hooks

Each service consumes it:

# service/.github/workflows/ci.yaml
name: ci
on: [push, pull_request]
jobs:
  call:
    uses: org/.github/.github/workflows/ci.yaml@v2
    with:
      node-version: '20'
  1. Version infra and let bots bump it:
# terraform
module "service" {
  source  = "git::ssh://git@github.com/org/terraform-modules.git//service?ref=v3.2.1"
  replicas = 3
}
// renovate.json
{
  "extends": ["config:recommended"],
  "packageRules": [
    {"matchManagers": ["terraform"], "groupName": "terraform-modules"},
    {"matchDatasources": ["docker"], "groupName": "base-images", "schedule": ["after 10pm on sunday"]}
  ]
}
  1. Scaffold new services from a template repo:
cookiecutter gh:org/service-template \
  service_name=my-boring-service \
  language=typescript \
  deploy=argocd \
  observability=otel
  1. Register services so you can see drift:
# catalog-info.yaml (Backstage)
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: my-boring-service
  annotations:
    github.com/project-slug: org/my-boring-service
    adr.repo: https://github.com/org/my-boring-service/tree/main/adr
spec:
  type: service
  owner: team-platform
  lifecycle: production
  1. Add a light policy gate for deviations:
# policy/adr.rego (OPA/Rego)
package pr

violation[msg] {
  input.cross_cutting == true
  not input.adr_linked
  msg := "Cross-cutting change without linked ADR"
}

Wire it in CI with conftest:

conftest test pr.json -p policy/

Before/after: the paved road payoff

Example 1: unify logging/tracing

  • Before: winston, bunyan, pino, and a homegrown logger; no trace_id propagation; text logs in some services.
  • After: OpenTelemetry middleware, JSON logs with trace_id/span_id, OTLP exporter, standard resource attributes.
// before (Express)
app.use((req, _res, next) => {
  req.logger = createWinston({ level: process.env.LOG_LEVEL || 'info' });
  next();
});

// after
import { diag, trace } from '@opentelemetry/api';
import { registerInstrumentations } from '@opentelemetry/instrumentation';
import { ExpressInstrumentation } from '@opentelemetry/instrumentation-express';

registerInstrumentations({ instrumentations: [new ExpressInstrumentation()] });
app.use((req, _res, next) => {
  const span = trace.getActiveSpan();
  req.log = (msg: string, fields = {}) => console.log(JSON.stringify({
    msg,
    trace_id: span?.spanContext().traceId,
    span_id: span?.spanContext().spanId,
    ...fields
  }));
  next();
});

Outcome (3 sprints):

  • MTTR on P1s dropped from 82m to 48m (log/trace correlation worked across services).
  • New service bootstrap time: 3 days → 1 day (template repo + reusable CI).
  • Failed canaries from logging changes: 6 last quarter → 0 this quarter.

Example 2: ingress swap without a death march

  • Before: Hand-rolled nginx-ingress per team; divergent annotations and health checks.
  • After: Standard Helm values + GitOps; one ArgoCD app-of-apps; staged rollout.
# apps/ingress/values.yaml
controller:
  kind: DaemonSet
  metrics:
    enabled: true
  config:
    use-proxy-protocol: "true"
# argo/app.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: ingress
spec:
  project: platform
  source:
    repoURL: git@github.com:org/platform-charts.git
    targetRevision: v1.12.0
    path: ingress
  destination:
    server: https://kubernetes.default.svc
    namespace: ingress
  syncPolicy:
    automated: { prune: true, selfHeal: true }

Outcome:

  • 1 PR to the platform-charts repo updated all clusters.
  • Rollout took 2 hours with health gates; no customer-visible errors.

Example 3: runtime bumps without breakage

  • Before: Node versions ranged from 14–20; CI images inconsistent.
  • After: .nvmrc + reusable workflow defaulting to Node 20; Renovate opened synchronized PRs.
# standardize
echo "20" > .nvmrc

# reusable CI drives this
uses: org/.github/.github/workflows/ci.yaml@v2
with:
  node-version: '20'

Outcome: 54 repos updated in 4 days; 3 failures caught by contract tests (fixed same day).

Make global refactors safe (and boring)

Design refactors like releases:

  1. Codify the change
    • Language: jscodeshift for TS/JS, OpenRewrite for JVM, gomodifytags/gofmt for Go.
    • Example codemod adds trace_id fields to log calls.
jscodeshift -t transforms/add-trace.ts src/**/*.ts --dry --print
  1. Protect contracts

    • Add contract tests (OpenAPI/Protobuf) so paved-road changes don’t break wire formats.
    • Gate merges on passing consumer-driven tests.
  2. Stage with feature flags

    • launchdarkly or unleash to ramp new behavior per service/cluster.
  3. Gate on SLOs

    • CI/CD blocks promotion if error rate or latency SLI regresses.
# GitHub Actions job gate (pseudo)
- name: Check SLOs
  run: |
    error_rate=$(curl -s "$PROM_URL/api/v1/query?query=rate(http_requests_errors_total[5m])")
    if (( $(echo "$error_rate > 0.01" | bc -l) )); then exit 1; fi
  1. Use GitOps to observe
    • ArgoCD ensures desired state; drift shows up in red immediately.

When this is muscle memory, a “global refactor” becomes: update module → codemod → open batched PRs → canary → gate on SLO → promote. No herding cats on Slack at 1 a.m.

Governance without theater

You don’t need a weekly “architecture council” to police YAML. You need:

  • ADR-linked PRs for cross-cutting changes (lightweight, searchable history).
  • Scorecards: % services on paved road, CI parity, module versions, SLO coverage.
  • Policy-as-code: OPA to block changes that bypass the paved road.
  • Exception path: A checkbox in the PR template: “Deviation from paved road.” Signal, not shame—and it triggers a follow-up ADR if it sticks.

Example OPA check: block direct Helm chart changes in app repos when a platform chart exists without an ADR noted.

package helm

violation[msg] {
  input.repo_type == "app"
  input.files[_].path == "charts/values.yaml"
  not input.labels[_] == "adr-approved"
  msg := "Direct Helm values in app repo require adr-approved label"
}

Pitfalls (ask me how I know)

  • Over-bespoking the paved road: If your “golden path” requires a custom CLI and 12 flags, you’ve just built a framework no one asked for. Prefer stock GitHub Actions, ArgoCD, Terraform.
  • Docs without bots: A Confluence page is not enforcement. Templates + reusable workflows + Renovate are.
  • ADR bloat: One-pagers only. If it reads like a whitepaper, it won’t be read.
  • All carrot, no stick: Scorecards and gentle policy gates nudge behavior. Without them, entropy wins.
  • Skipping SLO gates: If you don’t tie refactors to SLOs, you’ll ship regressions with a straight face.

If you’re starting from chaos, pick one paved-road win (logging/tracing, runtime versions, or CI parity), ship it with an ADR, and let the momentum pull in the next change.

Boring is a feature. When refactors are boring, your weekends stay yours.

Related Resources

Key takeaways

  • ADRs capture decisions and trade-offs; paved roads turn them into default behavior.
  • Drift dies when templates, reusable workflows, and bots enforce the defaults.
  • Safe refactors require versioned modules, contract tests, and staged rollouts tied to SLOs.
  • Favor boring tools: Renovate for deps, ArgoCD for drift, OPA for policy. Avoid bespoke glue.
  • Governance should be lightweight: ADR-linked PRs, scorecards, and paved-road linting.

Implementation checklist

  • Create a lightweight ADR template and repo location (e.g., /adr).
  • Add a PR template that requires linking an ADR for cross-cutting changes.
  • Stand up template repos with paved-road defaults (lint, test, deploy, observability).
  • Adopt reusable CI workflows via GitHub Actions’ workflow_call or similar.
  • Pin infra with versioned Terraform modules; manage bumps via Renovate.
  • Publish a “golden module” for logging/tracing (e.g., OpenTelemetry) and make it the default.
  • Use GitOps (ArgoCD/Flux) to detect and correct config drift automatically.
  • Add policy-as-code (OPA) checks to block non-paved changes without an ADR.
  • Pilot a global refactor (e.g., logger unification) with staged rollout tied to SLOs.
  • Track platform scorecards (adherence %, MTTR, failed deployment rate) and iterate.

Questions we hear from teams

Aren’t ADRs just more process?
Only if you let them be. Keep them to one page, store them in-repo, and link them in PRs. The point isn’t ceremony; it’s creating a durable why behind changes that impact many teams.
What if a team needs to deviate from the paved road?
Allow exceptions with an ADR and an expiration date. If the deviation persists and benefits others, fold it back into the paved road as a new version.
We don’t have Backstage/ArgoCD. Can we still do this?
Yes. Start with template repos, reusable CI workflows, and Renovate. Add GitOps and cataloging later. The key is encoding defaults and enforcing them with automation.
How do we measure success?
Track platform scorecards: adherence to templates, % services on standard modules, time-to-bootstrap, failed deployment rate, MTTR, and number of repos updated per global change. You should see fewer exceptions and faster refactors within 1–2 quarters.
What’s the first paved-road module to build?
Observability. Standardize logging/tracing/metrics (OpenTelemetry + JSON logs + OTLP). It pays back immediately in MTTR and debugging speed.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Get a paved-road assessment Grab our ADR + paved road starter kit

Related resources