How do we pick the initial capacity band for modernization?

Start at 15–20% per domain. Use your last quarter’s incident volume and SLO misses to bias up or down. If MTTR > 2h or you’re burning >30% of your error budget monthly, move to 25% until stability returns. Review monthly in the investment council.

Won’t product hate “losing” capacity to tech debt?

They’ll hate outages and missed dates more. Frame modernization as risk-reduction with measurable outcomes (latency, availability, cost). Keep a single backlog and require every modernization epic to attach to a product KPI and SLO. Show a Friday scorecard. Transparency buys trust.

What if Security or Finance won’t work in Git?

Bridge it. Mirror RFCs into their systems (e.g., Confluence export) but make the PR the source of truth. Add a `CODEOWNERS` rule so approvals are required. If approvals block for >48 hours, escalate at weekly triage, not by DM.

We tried ADRs before and they gathered dust. What’s different here?

Two things: expiry dates and kill criteria. Every ADR expires unless renewed with evidence; every experiment has a stop condition. And ADRs live next to code with CI checks that fail if a service has no current ADR for critical areas (e.g., data access).

Culture · Oct 9, 2025 · 9 minute read

The Cadence That Keeps Modernization From Hijacking Your Roadmap

How to build a decision rhythm that funds modernization without blowing up delivery schedules, trust, or SLOs.

Alex Ramirez

Partner, GitPlumbers

20 years in the trenches from bare-metal colo to multi-cloud, led SRE and platform teams at a Fortune 100 retailer and a couple of unicorns. I’ve broken prod so you don’t have to.

Modernization only works when it ships on the same train as features. Change the calendar, not just the code.

Back to all posts

The rewrite that almost ate the roadmap

If you’ve been around long enough, you’ve seen this movie. A retailer I worked with spun up a platform team to “modernize” a creaky Java monolith. Twelve months later they had Kubernetes, Istio sidecars, and a half-finished service graph—and a furious CFO asking why cart conversion dropped 3 points during peak. Product and platform were running on different calendars. Modernization work landed in chunky, high-risk releases while features fought for scraps.

We turned it by installing a decision cadence—not another architecture committee. A tight operating rhythm, shared artifacts, and automated guardrails aligned modernization to the same drumbeat as delivery. In three quarters, they moved 40% of monolith traffic behind a strangler proxy, kept SLOs green, and shipped the promotional engine that paid for the whole thing.

Design the operating rhythm (lightweight, enforced)

This cadence fits a 10–30 team org without adding meeting bloat. Inputs and outputs are crisp. Decisions move forward or die—no zombie work.

Weekly Ops + Product Triage (30 min, Tues)
- Inputs: last week’s SLOs, incident notes, cost deltas, delivery burndown.
- Outputs: rebalanced WIP, blocked work escalated, capacity shifts (+/- 5%).
- Who: EMs, PMs, SRE, platform rep. No execs.
Biweekly Architecture Decision Review (ADR) (60 min, Thu)
- Inputs: PR-based RFCs, ADR drafts, experiments results.
- Outputs: Approved/Rejected/Timeboxed decisions with owners and expiry dates.
- Who: Principal Eng, Domain Leads, Platform, Security.
Monthly Investment Council (45 min, first Wed)
- Inputs: scorecard, DORA trend, cost per transaction, roadmap deltas.
- Outputs: Capacity bands per domain (e.g., 20% modernization for Checkout), kill/continue calls.
- Who: VP Eng, Director PM, Finance partner.
Quarterly North-Star Review (90 min)
- Inputs: Architecture vision deltas, domain decomposition plan, risk register.
- Outputs: 1–3 cross-domain bets with success metrics and guardrails.

A simple configuration helps socialize this without slide decks:

cadence:
  weekly_triage:
    time: "Tue 11:00-11:30"
    inputs: ["slo_dashboard", "dora_report", "cost_delta", "burndown"]
    outputs: ["wip_adjustments", "blockers_escalated", "capacity_shift"]
  biweekly_adr:
    time: "Thu 13:00-14:00"
    decisions: ["approve", "reject", "timebox"]
  monthly_investment:
    time: "1st Wed 10:00-10:45"
    knobs: ["capacity_band_per_domain", "kill_or_continue"]
  quarterly_north_star:
    time: "Quarterly 2h"
    bets: 3

One backlog, explicit capacity bands

Separate “modernization” backlogs are where alignment goes to die. Keep a single backlog per domain with clear labels and capacity bands you actually enforce.

Labels: modernization, reliability, risk-reduction, feature.
Bands: Start at 15–25% per domain for modernization+reliability; adjust monthly.
WIP limits: Don’t exceed 2 concurrent modernization epics per domain.
Two-way vs one-way doors: Two-way (canarying a new cache) go to weekly triage; one-way (data model splits) go to ADR review.

If you’re on Jira, bake this into queries and boards your directors look at every Friday:

Project: CHECKOUT AND labels in (modernization, reliability) AND statusCategory != Done

-- % of capacity on modernization last 2 weeks
worklogDate >= -14d AND labels in (modernization, reliability)

-- Blockers older than 72h
status = Blocked AND updated < -3d ORDER BY priority DESC

Tie capacity to outcomes, not vibes:

Checkout domain at 20% until p95_latency_ms < 300 and cart_abandonment stabilizes for 4 weeks.
Catalog domain at 15% until monolith read path < 50% of traffic.

For deployments, keep modernization slices shippable behind flags:

Use LaunchDarkly or Unleash to gate new services.
Drive strangler fig via edge routing (Nginx/Envoy) with traffic percentages.

Decision artifacts people actually read

Don’t ship decisions by Slack. Use ADRs for reversible/irreversible architecture moves per domain, and PR-based RFCs for cross-cutting changes.

Store ADRs next to the code: services/checkout/docs/adr/.
Require a link to SLOs and a kill criterion in every ADR.
Timebox experiments with explicit data to collect.

Example ADR template we’ve used at multiple clients:

# ADR 0037: Introduce Redis cache before monolith read path

Date: 2025-10-03
Status: Approved (expires 2026-01-15 unless renewed)
Deciders: checkout-principal, sre-lead, platform-arch

Context
- p95 product page latency is 680ms vs SLO 300ms
- 62% of reads hit monolith DB; DB CPU at 80% during peak

Decision
- Add `Redis` cluster (3 shards, `cache.t3.medium`) fronting product read path
- Route 10% via `Envoy` filter with canary; ramp to 50% if error budget > 80%

Consequences
- +$3.2k/month infra. Offset by 25% DB CPU drop.
- New oncall playbook and dashboard panel.

Measures
- Success: p95 < 350ms within 2 weeks; DB CPU < 60%
- Kill: error_budget_remaining < 50% for 24h or cache miss ratio > 60%

Links
- SLO: /dashboards/checkout-slo
- Runbook: /runbooks/redis-cache.md
- Jira Epic: CHECKOUT-1423

PR-based RFCs keep stakeholders honest. If Finance or Security can’t comment in Git, something’s broken in your culture.

Automate guardrails so the cadence sticks

If your cadence relies on people remembering rules, it will fail during crunch time. Bake rules into pipelines and tooling.

SLO budget gate: Block risky ramps when error budget is burning too fast.
Cost drift gate: Catch surprise infra jumps with Infracost.
Policy-as-code: Enforce architectural constraints (Open Policy Agent, Conftest).

Example GitHub Actions pipeline that fails a PR when the service is burning too hot or cost jumps >10%:

name: guardrails
on:
  pull_request:
    paths:
      - 'services/checkout/**'
jobs:
  guardrails:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Check SLO burn rate
        run: |
          BURN=$(curl -s "$PROM_URL/api/v1/query?query=rate(error_budget_burn{service='checkout'}[1h])" | jq -r '.data.result[0].value[1]')
          echo "burn=$BURN" >> $GITHUB_OUTPUT
          awk -v b=$BURN 'BEGIN{ if (b>2.0) { exit 1 } }'
      - name: Infracost
        uses: infracost/actions/setup@v3
      - name: Infracost diff
        run: |
          infracost breakdown --path=./infra --format=json --out-file=ic.json
          jq '.projects[].diffTotalMonthlyCostPercentage' ic.json | awk 'BEGIN{ok=1} {if ($1>10) ok=0} END{exit !ok}'
      - name: Policy check (OPA)
        run: |
          conftest test ./k8s --policy ./policy

For release windows, prevent heroics with ArgoCD sync windows and Istio canaries:

apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
  name: checkout
spec:
  syncWindows:
    - kind: allow
      schedule: "Mon-Fri 09:00-16:00"
      duration: "8h"
      applications: ["checkout/*"]
    - kind: deny
      schedule: "Fri 16:00-Sun 23:59"
      duration: "56h"
      applications: ["*"]

If it’s important, automate it. If you can’t automate it, make it explicit and timeboxed.

Leadership behaviors that make or break it

The cadence works only if leaders model the right behaviors. I’ve seen this fail when VPs quietly reprioritize pet projects or when platform teams hoard decisions.

No stealth work. Anything over two days must have a ticket with labels and an owner. Random “cleanup” sprints are how regressions happen.
Kill bad bets fast. If an ADR’s kill conditions trigger, celebrate the stop. No “but we’re already halfway.”
Protect focus. 2 WIP modernization epics per domain. Product asks queue up, not explode WIP.
Tie money to outcomes. Monthly investment council adjusts capacity bands only with SLO/cost data.
One narrative, many audiences. Publish a Friday scorecard: 1 slide, green/yellow/red with one-line deltas. Same facts for execs and teams.
Service ownership > central gatekeeping. Domain teams own their SLOs and ADRs. Platform sets policies and provides paved roads.

If you can’t explain why 20% of capacity is on modernization in a single screenshot, you’re not running a cadence—you’re doing theater.

Metrics that keep you honest

Measure both delivery and modernization. If you only track features, debt will metastasize. If you only track debt, the business will cut you off.

DORA: lead time, deployment frequency, change failure rate, MTTR.
SLOs: error budget burn rates per service. Tie to release gates.
Cost: cost per transaction (not just EC2 bill).
Modernization throughput: % traffic off monolith, # of domains with service ownership, age of top-10 debt items.

A couple of pragmatic snippets:

Prometheus SLO burn query we use to gate canaries:

sum(rate(requests_total{service="checkout",response_code!~"5.."}[1m]))
/
sum(rate(requests_total{service="checkout"}[1m])) < 0.99

BigQuery rough-cut PR lead time (if you export GH events):

SELECT
  pr.number,
  TIMESTAMP_DIFF(pr.merged_at, pr.created_at, HOUR) AS lead_time_hours
FROM `org_gh.pull_requests` pr
WHERE pr.base_repo = 'checkout' AND pr.merged_at >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)

Cost per transaction trend from logs + billing export:

SELECT
  DATE(request_ts) AS day,
  SUM(cost_usd) / NULLIF(SUM(requests),0) AS cost_per_tx_usd
FROM `analytics.daily_service_costs`
WHERE service = 'checkout'
GROUP BY day
ORDER BY day

Publish these weekly. If the trend is red for two weeks, the monthly council must rebalance capacity.

What it looks like when it works (and what we’d do again)

At a mid-market fintech (25 teams, PCI scope), we installed this cadence without changing their org chart:

Kept modernization at 20% capacity across four domains, adjusted monthly.
Moved 55% of auth traffic off the monolith via strangler proxy in two quarters.
Improved p95 latency 620ms → 280ms; cut MTTR from 3h → 38m.
Held change failure rate at <12% while doubling deployment frequency.
Held infra cost per transaction flat while adding Redis and edge cache, thanks to pruning orphaned EBS and RDS instances flagged by Infracost diffs.

What we’d do again:

Start with Checkout-like domains where SLOs and business impact are closest to revenue.
Use feature flags to keep risk low and confidence high.
Put platform policies in code day one—arguing policy in meetings is a timesink.

If this sounds like the rhythm you need, GitPlumbers can help you stand it up in weeks, not quarters.

Related Resources

Key takeaways

Modernization succeeds when it rides the same cadence as feature delivery—same backlog, same SLOs, same release train.
Establish 4 lightweight but enforced rituals: weekly triage, biweekly ADR review, monthly investment council, quarterly north-star review.
Make decisions durable with ADRs and PR-based RFCs that tie to measurable outcomes and kill-switches.
Automate guardrails in CI/CD: SLO budget checks, cost drift, and architectural policy—don’t rely on meeting heroics.
Measure both product and modernization outcomes: DORA, SLO burn, cost per transaction, and tech-debt burn-down by domain.

Implementation checklist

Create a single backlog with explicit labels for modernization and reliability work.
Set capacity bands (e.g., 15–25%) and enforce with WIP limits in Jira.
Stand up weekly triage, biweekly ADR review, monthly investment council, quarterly north-star review.
Adopt ADRs per service/domain and RFCs via PRs; link tickets, SLOs, and kill criteria.
Automate gates in CI/CD for SLO budget, cost drift, and policy compliance.
Track DORA metrics, SLO burn rates, cost per transaction, and modernization throughput per domain.
Use feature flags and strangler fig to ship modernization incrementally.
Publish a simple scorecard to execs every Friday—green/yellow/red with one-line deltas.

Questions we hear from teams

How do we pick the initial capacity band for modernization?: Start at 15–20% per domain. Use your last quarter’s incident volume and SLO misses to bias up or down. If MTTR > 2h or you’re burning >30% of your error budget monthly, move to 25% until stability returns. Review monthly in the investment council.
Won’t product hate “losing” capacity to tech debt?: They’ll hate outages and missed dates more. Frame modernization as risk-reduction with measurable outcomes (latency, availability, cost). Keep a single backlog and require every modernization epic to attach to a product KPI and SLO. Show a Friday scorecard. Transparency buys trust.
What if Security or Finance won’t work in Git?: Bridge it. Mirror RFCs into their systems (e.g., Confluence export) but make the PR the source of truth. Add a `CODEOWNERS` rule so approvals are required. If approvals block for >48 hours, escalate at weekly triage, not by DM.
We tried ADRs before and they gathered dust. What’s different here?: Two things: expiry dates and kill criteria. Every ADR expires unless renewed with evidence; every experiment has a stop condition. And ADRs live next to code with CI checks that fail if a service has no current ADR for critical areas (e.g., data access).

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Set up a decision cadence with GitPlumbers Download the Decision Cadence Checklist (PDF)