How much capacity should we allocate to technical debt?

Start at 20% per team. Move to 25–30% if you’re breaching SLOs or CFR > 25% across key services. If incidents are low and DORA trends are green for two quarters, test 15%—but keep the ring-fence.

What if product refuses to give up roadmap capacity?

Make the trade explicit: publish the QBR one-pager with the cost of not doing the work (incident hours, risk, and cloud waste). Tie bonuses to CFR/MTTR. Most orgs align once dollars and goals show up.

How do we measure ROI on ‘intangible’ work like refactors?

Instrument developer time and stability. Track build/test time, flaky test rate, merge queue time, and CFR/MTTR. Dollarize engineer time via a burdened rate. If it doesn’t change a graph, it’s not an epic; add observability first.

Which tools do we actually need?

Use what you have: Jira/Linear for tracking; Datadog/Grafana + PagerDuty for ops metrics; SonarQube/Snyk for code and vulns; Backstage for visibility; Looker/BigQuery/Snowflake for ROI rollups. Don’t buy new tools until you can’t extract the data.

How soon should we see results?

Within 4–6 weeks you should see incident hours down and build times improving. By the end of a quarter, CFR and MTTR should be trending down, with clear ROI examples to share with finance.

Culture · Oct 2, 2025 · 9 minute read

Stop Guessing: A Real Technical Debt Budget (and How to Prove the ROI)

You don’t need another slide deck about “velocity.” You need a debt budget, CFO-grade ROI math, and rituals that survive roadmap pressure.

Alex Grant

Partner, GitPlumbers

20 years building and fixing systems at scale. Led platform teams through the microservices wave, tamed monoliths at a Fortune 100, and made friends with more CFOs than I thought possible.

> The day you can show ROI on technical debt in dollars is the day the CFO becomes your ally, not your blocker.

Back to all posts

The on-call tax you pretend isn’t on the balance sheet

I walked into a unicorn’s war room years ago: 14 Sev-2 incidents in 30 days, MTTR north of 6 hours, and a product VP still asking, “Can we ship the promo flow?” I’ve seen this movie at a bank, a games studio, and a logistics platform. The pattern is the same—feature roadmap eats every sprint until the reliability tax shows up as churn, missed OKRs, and burned-out teams.

If you want out, stop treating technical debt as “nice to have” and start budgeting like a CFO. Then measure ROI with real dollars and real time, not vibes.

Set the budget: not vibes, percentages

You don’t need 6 months of committees. Pick a number and publish it.

Start at 20% capacity per team for tech-debt/platform/security work. I’ve seen 15% work in stable orgs; 25–30% for high-incident platforms. If you use SAFe, treat it as enabler capacity. If you use Scrum, that’s roughly one day per engineer per week.
Ring-fence it the same way you protect SLO error budgets. If you blow the error budget, you slow features; if you blow the debt budget, you stop pretending.
Scope to epics, not random chores. Create debt-epics that deliver outcomes: “Reduce checkout CFR from 35% to <15%” or “Cut build time from 40m to 15m.”
Allocate explicitly: each team commits debt points in sprint planning; product leads sign the trade-off in writing.

Example policy you can paste in Slack/Confluence:

“Engineering will allocate 20% of capacity to tech-debt and platform epics. Changes require VP Eng approval. We track usage and ROI monthly. Feature roadmaps must assume this capacity is unavailable.”

Make it visible: tags, boards, and dashboards

If the work isn’t visible, it will get eaten by “just one more feature.” Instrument it end-to-end.

Issue tracking: In Jira/Linear/GitHub Projects, standardize labels: tech-debt, platform, risk, security, observability.
- Use a dedicated Debt Board per org with swimlanes for Stability, Scale, Security, DevEx.
- Group into debt-epics with clear acceptance criteria and owners.
Developer portal: In Backstage, add a Debt tab per service with:
- SLO status, CFR, MTTR, open security issues, SonarQube code smells, Snyk vulns, build time, test flakiness.
- A living Ownership doc: escalation path, runbooks, decision log.
Dashboards: Use Datadog/Grafana to display:
- DORA: Lead Time, Deployment Frequency, Change Failure Rate (CFR), MTTR.
- Incident hours per team (from PagerDuty/Opsgenie).
- Build/test time (CI), flaky test rate, merge queue time.
- Cloud cost for the service (from CloudWatch/BigQuery FinOps table, tagged by service, team).

Pro tip: tie each debt-epic to an observable metric target. If you can’t graph it, it’s not an epic—it's a chore.

Measure ROI like a CFO, not a keynote

Debt ROI is measurable in weeks if you track the right units. Use a simple formula and dollarize the inputs.

ROI = (Savings + RiskReduction - Cost) / Cost

Where:

Savings includes:
- IncidentHoursReduced * BurdenedRate (include on-call, resolver, incident commander, customer support time).
- CloudCostReduced (CPU/memory/egress saved via tuning/arch fixes).
- DevTimeReclaimed from faster builds/tests (MinutesSavedPerBuild * BuildsPerDay * BurdenedRate).
- SupportTicketsReduced * CostPerTicket.
RiskReduction approximates avoided outages:
- ProbabilityDrop * ImpactPerIncident for a class of incidents (be conservative; use last-quarter rates).
Cost includes engineer time (hours * rate), partner/vendor spend, and migration overhead.

Concrete example you can present to a CFO:

Epic: “Rewrite legacy queue consumers to remove duplicate delivery.”
1. Cost: 2 engineers x 3 weeks x 35 hrs x $140/hr = $29,400.
2. Savings:
  - Incidents: From 6/mo to 1/mo; average 4 hrs incident x 4 people x $140/hr = $2,240/incident ⇒ $11,200/mo.
  - Cloud: 20% fewer retries saves $3,000/mo.
  - DevEx: Build flake fix saves 6 min/build x 80 builds/day x $140/hr ÷ 60 ≈ $1,120/day ⇒ $22,400/mo.
3. RiskReduction: Reduce probability of Sev-1 by 5%; impact $100k ⇒ $5,000 expected monthly.
4. Total monthly value ≈ $41,600. Payback < 1 month. ROI (90 days) ≈ (124,800 - 29,400) / 29,400 ≈ 3.24x.

We’ve run this math with CFOs at a fintech and a media company; it changes the conversation from “engineering wants time” to “we’re buying margin and uptime.”

Prioritize with WSJF and Cost of Delay you can defend

Hand-waving dies in steering committees. Use WSJF or RICE with real numbers.

WSJF = (User/Revenue Impact + Time Criticality + Risk Reduction/Opportunity Enablement) / Job Size.
- User/Revenue Impact: tie to CFR-driven churn, conversion, or ad impressions saved.
- Time Criticality: SLO burn rate, regulatory deadlines, end-of-life timelines.
- Risk Reduction: decrease in Sev-1 probability, security posture score.
- Job Size: story points or person-days; be consistent.
RICE = Reach * Impact * Confidence / Effort.

Workflow you can copy:

Baseline last quarter’s CFR, MTTR, incident hours, build time, cloud cost.
Create a candidate list of debt-epics with metric targets and expected savings.
Score with WSJF and rank. Publish the top 10 per org.
Lock the top 5 into the quarter; keep the rest as pull-ahead if you finish early.
Re-score monthly as new data arrives; don’t reshuffle unless the math changes materially.

Rituals that keep the budget alive under roadmap pressure

I’ve seen the best budgets die in week three because someone yelled “holiday promo.” Rituals keep you honest.

Weekly 30-min Debt Triage per team
- Review progress on debt-epics, check metric deltas, unblock. Product attends.
Monthly Exec Review (60 min)
- VP Eng + Product + Finance. Agenda: budget burn vs. plan, ROI-to-date, SLO/incident deltas, proposed adjustments.
Quarterly Re-baseline
- Refresh DORA metrics, incident stats, cloud spend. Adjust capacity ±5% based on lagging indicators.
Fix-It Friday or No-Feature Wednesday
- Reserve a standing block where debt tasks can’t be pre-empted.
Stop-the-line rule
- Breach SLO or escalate CFR > 30% for a service? Freeze feature work on that service until a stability epic lands.

Codify these in Runbooks and pin them in your Backstage home page.

Leadership behaviors that stop the whiplash

Tools don’t solve leadership debt. Behaviors do.

Make the trade-offs explicit: Product and Engineering co-sign a one-pager per quarter: “We’re trading Feature X for Stability Y.” Publish it.
Tie bonuses to outcomes: Include CFR, MTTR, and SLO adherence in manager and staff engineer goals.
Defend the ring-fence: If you need to borrow from the debt budget, log it as debt-budget-variance with a payback date.
Celebrate boring: Give kudos for a month with 0 pages at 2 a.m. Roll fewer, safer changes; highlight the customer impact.
Speak CFO: In reviews, lead with dollars saved and risk reduced, not just story points burned.

I’ve seen CTOs at a marketplace and a travel platform turn culture by doing one thing: reading the stability scorecard first in every QBR.

What good looks like after two quarters

When this works, you don’t need a cheerleader—your graphs sell it.

CFR drops from ~30% to <15% for top services.
MTTR falls from 180m to <60m.
Incident hours per month cut by 40–60%.
Build time drops from 45m to 15–20m; dev throughput increases 10–20% without adding headcount.
Cloud spend trims 5–10% from right-sizing and retry control.
Feature Lead Time holds steady or improves because the merge queue moves.

Real example: at a public retailer, we carved 22% capacity for 2 quarters. Outcomes: CFR 33% → 14%, MTTR 140m → 55m, build 38m → 17m, cloud -8%. Sales didn’t slip because fewer rollbacks meant marketing actually launched on time.

Common pitfalls and guardrails

I’ve seen these kill good intentions:

“Debt Fridays” with no epics: Random chores disappear under pressure. Guardrail: only debt-epics with metrics.
No CFO buy-in: Finance thinks it’s cost. Guardrail: show ROI in dollars; do a 30-day ROI pilot.
Over-rotating to infra: Teams fix what they control (pipelines) and ignore product stability. Guardrail: balance stability, scale, security, devex epics.
Vanity metrics: Code coverage up, incidents flat. Guardrail: anchor on CFR, MTTR, incident hours, SLO burn.
One-and-done: Budget set once, then forgotten. Guardrail: monthly exec review, quarterly re-baseline.

If you need a neutral to set the baseline and wire up the dashboards, that’s literally what we do at GitPlumbers. We’ve done it in messy, regulated, multi-cloud environments. Bring your constraints; we’ll bring the wrenches.

Related Resources

Key takeaways

Set a fixed, visible technical debt budget (15–25% of team capacity) and protect it like uptime.
Quantify ROI using incident hours saved, cloud costs reduced, velocity reclaimed, and risk reduction.
Make debt work visible via `debt-epics`, labels, and dashboards that tie to DORA and SLOs.
Institutionalize rituals: weekly triage, monthly exec review, and quarterly re-baselining.
Leaders must trade features for stability explicitly and publish the trade-offs in writing.
Measure outcomes in weeks: MTTR down, CFR down, release lead time steady, on-call hours reduced.

Implementation checklist

Define and publish a quarterly technical debt budget per team (e.g., 20% capacity).
Create `debt-epics` in your tracker with `tech-debt`, `risk`, and `platform` labels.
Baseline DORA metrics, incident hours, and developer time lost to toil.
Score debt epics using `WSJF` or `RICE` with dollarized Cost of Delay.
Track savings: incident-hour reduction, cloud cost reduction, build/test time reduction.
Report monthly: budget used vs. planned, ROI year-to-date, and SLO deltas.
Re-baseline quarterly; adjust budget by lagging indicators (CFR, MTTR, security backlog).

Questions we hear from teams

How much capacity should we allocate to technical debt?: Start at 20% per team. Move to 25–30% if you’re breaching SLOs or CFR > 25% across key services. If incidents are low and DORA trends are green for two quarters, test 15%—but keep the ring-fence.
What if product refuses to give up roadmap capacity?: Make the trade explicit: publish the QBR one-pager with the cost of not doing the work (incident hours, risk, and cloud waste). Tie bonuses to CFR/MTTR. Most orgs align once dollars and goals show up.
How do we measure ROI on ‘intangible’ work like refactors?: Instrument developer time and stability. Track build/test time, flaky test rate, merge queue time, and CFR/MTTR. Dollarize engineer time via a burdened rate. If it doesn’t change a graph, it’s not an epic; add observability first.
Which tools do we actually need?: Use what you have: Jira/Linear for tracking; Datadog/Grafana + PagerDuty for ops metrics; SonarQube/Snyk for code and vulns; Backstage for visibility; Looker/BigQuery/Snowflake for ROI rollups. Don’t buy new tools until you can’t extract the data.
How soon should we see results?: Within 4–6 weeks you should see incident hours down and build times improving. By the end of a quarter, CFR and MTTR should be trending down, with clear ROI examples to share with finance.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to an engineer about your debt budget See how we wire up ROI dashboards