Postmortems That Pay Down Debt: The Feedback Loop That Turns Incidents into a Ruthless Modernization Backlog
If your incident reviews aren’t producing prioritized investments with owners, dates, and budgets, you’re doing storytelling—not engineering. Here’s the loop we install when we’re called in after the third “never again” outage.
If your postmortems aren’t creating backlog items with owners and dates, you’re doing storytelling, not engineering.Back to all posts
Stop treating postmortems as therapy sessions
I’ve sat through the “we’re so sorry” Zooms where everyone nods, someone says “blameless,” and then the action items vanish into a Confluence graveyard. Two months later the same READ_TIMEOUT page wakes up PagerDuty at 3 a.m.
Here’s the uncomfortable truth: if your incident review doesn’t produce a ranked modernization backlog with owners, dates, and funding, you didn’t do a review—you held group therapy. I’ve seen this fail at Fortune 50s running ServiceNow, and at unicorns with immaculate Datadog dashboards. What works is a simple, brutal feedback loop that converts “we learned” into “we shipped and paid down debt.”
Define the conversion funnel: Incident → Insight → Investment
You need a pipeline, not heroics. Make the states explicit and measurable:
- Incident occurs (PagerDuty/Opsgenie/ServiceNow).
- Blameless review produces structured insights: contributing factors, failure modes, cost of impact.
- Insights are tagged to services in your catalog (Backstage/Service Catalog).
- A scoring model ranks modernization candidates.
- Epics are created in
Jira/Azure DevOpswith budgets and acceptance criteria. - Work lands in teams’ capacity plans and GitOps flows (ArgoCD/Flux).
Key fields to capture every time (no essays—check boxes and enums):
- SLOs affected (
availability,latency,error rate), with burn rate fromPrometheus/Datadog. - MTTA/MTTR, and whether paging hit on-call outside business hours.
- Repeat offender flag with
nin last 90 days. - Blast radius (user %, revenue, regulatory exposure like PCI/SOX/PHI).
- Dependency centrality (e.g., API gateway, auth, payments).
- Fix class: config-only (e.g.,
IstioDestinationRule), code refactor, infra modernization (Terraformmodule), process/automation, or AI code cleanup. - Estimate in engineer-days and required window (canary, feature flags, maintenance).
Keep it boring, consistent, and machine-readable. Store these fields in the postmortem record and pipe them to your backlog system automatically.
The rituals that make it stick
Rituals beat tools. The calendar is your system design.
- Weekly Incident-to-Modernization Review (45 mins): DRI per domain (SRE + product + EM). Review last week’s postmortems, apply the scoring rubric live, and decide: promote to epic, bundle with existing epic, or close as hygiene.
- Monthly Executive Triage (45 mins): CTO/VP Eng plus Finance partner. Approve the top 10 modernization epics. Decisions recorded as PR merges to
modernization/prioritized.yaml(no slideware-only signoffs). - Friday Digest in
#incident-modernization: Bot posts the top 5 candidates with score deltas, links to SLO charts, and the owner due dates. - Quarterly Capacity Carve-out: 15–25% of team capacity reserved for modernization. Don’t hide it; put it on the roadmap like any feature.
- Leadership behaviors:
- Ask “what got promoted and funded?” not “how many RCAs did we write?”
- Celebrate deletion and simplification PRs in all-hands.
- Protect the 25% carve-out when the board screams for features.
If it’s not on the calendar and not in the budget, it’s not a priority. Treat reliability and debt paydown like product work, because it is.
From words to work: wire up the data plumbing
Stop copying bullets from Confluence into Jira by hand. Automate the boring parts.
- Ingest incidents into your warehouse (BigQuery/Snowflake/Redshift) and tag to services by
service_ownerork8slabels.
-- Snowflake: find repeat offenders and SLO burn
with incidents as (
select service, incident_id, started_at, resolved_at,
datediff('minute', started_at, resolved_at) as mttr_min,
repeat_flag,
slo_burn_rate,
revenue_impact_usd
from analytics.incidents
where started_at >= dateadd('day', -90, current_timestamp())
)
select service,
count(*) as incident_count,
sum(case when repeat_flag then 1 else 0 end) as repeats,
avg(mttr_min) as avg_mttr_min,
sum(slo_burn_rate) as total_burn,
sum(revenue_impact_usd) as revenue_hit
from incidents
group by service
order by repeats desc, total_burn desc;- Create epics automatically when score crosses a threshold. Example:
PagerDuty+jq+JiraREST.
#!/usr/bin/env bash
set -euo pipefail
PD_TOKEN="$PD_TOKEN" # env var
JIRA_TOKEN="$JIRA_TOKEN"
JIRA_USER="svc-bot@company.com"
JIRA_BASE="https://jira.company.com"
PROJECT="MOD"
curl -s -H "Authorization: Token token=$PD_TOKEN" \
'https://api.pagerduty.com/incidents?since=2025-08-01' \
| jq -c '.incidents[] | {
id: .id,
service: .service.summary,
started: .created_at,
resolved: .last_status_change_at,
urgency: .urgency
}' | while read row; do
score=$(echo "$row" | jq -r '.urgency == "high" ? 5 : 2')
if [ "$score" -ge 4 ]; then
summary="Modernize $(echo "$row" | jq -r .service): high-urgency incident $(echo "$row" | jq -r .id)"
payload=$(jq -n --arg summary "$summary" --arg project "$PROJECT" '{
fields: {
project: { key: $project },
issuetype: { name: "Epic" },
summary: $summary,
labels: ["incident-modernization"]
}
}')
curl -s -u "$JIRA_USER:$JIRA_TOKEN" -H 'Content-Type: application/json' \
-d "$payload" "$JIRA_BASE/rest/api/2/issue"
fi
done- Sync decisions to Git for visibility and GitOps.
# .github/workflows/modernization-sync.yaml
name: Sync Modernization Decisions
on:
schedule: [{ cron: '0 13 * * 5' }] # Fridays
workflow_dispatch:
jobs:
generate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with: { node-version: '20' }
- run: npm ci
- name: Generate prioritized list from Jira
run: |
node scripts/pull-jira.js > modernization/prioritized.yaml
- name: Create PR
env:
GH_TOKEN: ${{ secrets.GH_TOKEN }}
run: |
git config user.email bot@company.com
git config user.name modernization-bot
git checkout -b chore/update-modernization
git add modernization/prioritized.yaml
git commit -m "chore: weekly modernization sync"
gh pr create --fill --base main- Make it visible in the service catalog (Backstage): link each service to its open modernization epics, SLOs, and debt score.
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
name: payments-api
annotations:
jira/project-key: MOD
jira/labels: incident-modernization
grafana/dashboard-url: https://grafana.company.com/d/abc123
spec:
type: service
owner: team-payments
lifecycle: productionPrioritization that survives reality
Scoring models fail when they’re opaque or PhD-level. Keep it simple, auditable, and adjustable via PR.
# modernization/scoring.yaml
weights:
slo_burn: 5
repeats: 4
mttr: 3
blast_radius: 5
dependency_centrality: 4
revenue_impact: 5
effort: -3 # negative weight, easier = higher score
ai_risk: 2 # presence of AI-generated code w/o tests
bands:
promote_to_epic:
min_score: 24
bundle_with_existing:
min_score: 16Example math for a gnarly gateway service:
- SLO burn high (5), repeats 3 in 60 days (4), MTTR 140m (3), blast radius 40% traffic (5), dependency centrality high (4), revenue impact medium (3), effort 5 days (−2), AI risk present (2)
- Score = 5+4+3+5+4+3−2+2 = 24 → promote to epic.
Pair the scoring with a maturity rubric so teams know what “modernized” means.
# modernization/maturity.yaml
levels:
bronze:
- "SLOs defined and alerting wired to PagerDuty"
- "Dashboards in Grafana with RED/USE metrics"
silver:
- "Canary deployment via ArgoCD + feature flags"
- "Circuit breakers + timeouts configured (Istio DestinationRule)"
- "Runbooks in repo (/runbooks)"
gold:
- "Chaos experiments quarterly"
- "Load tests in CI before prod (k6/Locust)"
- "No AI-generated code without tests + lint gates"Governance without theater
Skip the 20-slide RCA deck. Use small, sharp artifacts and Git-based decisions.
- RCA template that produces work, not prose:
# Postmortem: <incident-id>
- Services: payments-api, gateway
- SLOs: latency P99 breached (3h), error rate 5xx 2.3%
- Impact: 18% sessions, est $420k revenue
- Contributing factors: missing circuit breaker; AI-generated retry code w/o backoff
- Fix class: Istio config + refactor retry client + Terraform module update
- Owner/DRI: @team-payments EM
- Decision: Promote epic MOD-123; target Silver maturity
- Review date: 2025-01-15Change path that fits enterprise constraints:
- Regulatory? Tie epics to risk register IDs in
ServiceNowand record approvals as PR comments. - Prod safety? Use
canary deploymentwith ArgoCD and bake time windows. - Shared infra? Patch once in a
Terraformmodule, roll via app-of-apps.
- Regulatory? Tie epics to risk register IDs in
Concrete config example (stop the timeouts):
# istio-destinationrule.yaml
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: payments-api
spec:
host: payments-api.default.svc.cluster.local
trafficPolicy:
connectionPool:
http:
http1MaxPendingRequests: 100
maxRequestsPerConnection: 100
tcp:
maxConnections: 1000
outlierDetection:
consecutive5xxErrors: 5
interval: 5s
baseEjectionTime: 30s
maxEjectionPercent: 50
retries:
attempts: 2
perTryTimeout: 800ms- Jira JQL for the board:
project = MOD AND labels = incident-modernization AND statusCategory != Done ORDER BY priority DESC, "SLO Burn" DESCCase file: the API gateway that ate our quarter
Real story. At a fintech, the gateway (Envoy via Istio) did what gateways do: aggregate everyone’s sins. Latency spikes from downstream services plus a homegrown client with exponential-ish retries (written via “vibe coding” with AI autocomplete) created a perfect storm. We had three Sev-2s in six weeks, MTTR averaging 140 minutes, and burned the availability SLO twice.
We wired the funnel:
- Weekly review promoted a “Gateway Reliability” epic with three workstreams: circuit breaker config, retry client refactor, and Terraformized timeouts.
- Scoring put it at the top (score 26). Finance approved a 4-engineer, 2-week carve-out.
- We shipped:
DestinationRulelike above, a properbackoff = jittered_exponential(100ms, max=2s)in the client, and a sharedTerraformmodule for timeout defaults. Canary withArgoCDto 10%, then 50%, then 100% over 48 hours.
Results 45 days later:
- Repeat incidents: 3 → 0.
- P99 latency: down 37%.
- MTTR: 140m → 55m (alerts became actionable).
- Infra egress spend: −12% (less thrash).
- Dev time regained: ~1.5 FTE/week not firefighting.
The kicker: the postmortem also flagged an AI-generated client lib with no tests. We added a repo rule: no AI-generated code merges without tests and lint. That alone prevented two later regressions. Call it vibe code cleanup that paid real dividends.
What to instrument and what to expect in 90 days
Set targets that force the loop to produce outcomes, not documents.
- Track conversion:
- % postmortems with structured fields completed within 5 business days (target: 95%).
- % action items promoted to prioritized epics (target: 70%+), and time-to-epic (median < 7 days).
- Repeat incident rate per service (target: −50%).
- Reliability metrics: MTTR (−20%), SLO burn rate (−30%), pages outside business hours (−40%).
- Delivery metrics: Planned capacity spent on modernization (15–25%), number of services moved to Silver maturity.
- Cost metrics: Infra cost deltas on hot paths (5–15%) and toil time recovered.
30/60/90 action plan:
- 30 days: pipeline from incidents to Jira working; start weekly review; publish rubric.
- 60 days: first monthly exec triage; ArgoCD canary playbook; Backstage surfacing epics.
- 90 days: show the graph—repeat incidents down, SLOs stable, and 3–5 gold/silver upgrades shipped.
If you want outside help to wire this in, GitPlumbers drops in the bots, scoring, and governance in weeks—not quarters. We’ve done the code rescue after AI hallucinations and the legacy modernization after the audit letter. Happy to show you the playbooks.
Key takeaways
- Postmortems must create backlog items with owners, dates, and a scoring rationale—or they’re theater.
- Automate the funnel: PagerDuty/ServiceNow -> warehouse -> scoring -> Jira/ADO backlog + quarterly funding.
- Use a repeatable scoring model: SLO burn, MTTR, incident frequency, blast radius, dependency centrality, and cost-to-fix.
- Establish weekly, monthly, and quarterly rituals that survive reorgs and budget cycles.
- Make the modernization backlog visible in your service catalog and your GitOps workflows, not buried in Confluence.
- Track conversion and impact: % of postmortem actions promoted to funded epics, incident repeat rate, SLO adherence, MTTR, and infra spend deltas.
Implementation checklist
- Create one Slack/Teams channel for incident-to-modernization, e.g., `#incident-modernization` with weekly digest.
- Stand up a data pipeline from PagerDuty/ServiceNow to your warehouse and Jira/ADO using the examples below.
- Adopt the scoring rubric and publish it as `modernization/scoring.yaml` in a repo everyone can PR to.
- Add a “Modernization Debt” facet to your Backstage/Service Catalog with links to open epics and health scores.
- Run a monthly executive triage with 45 minutes, 10 slides max, decisions recorded as PR merges.
- Set 90-day targets: reduce repeat incidents by 50%, convert 70%+ of postmortem actions into prioritized tickets, cut MTTR by 20%.
- Tie one OKR to incident-driven modernization with clear owners and an actual budget line item.
Questions we hear from teams
- What if security/compliance demands slow this down?
- Tie each modernization epic to a ServiceNow risk/ctrl ID and capture approvals as PR comments in the `modernization/prioritized.yaml` changes. Use canaries and change windows to satisfy CAB without blocking the loop.
- Our execs only care about features. How do we get funding?
- Show the math: repeat incidents + SLO burn correlate to churn, revenue loss, and on-call attrition. Bring a 90-day plan with capacity carve-out and projected savings. Monthly triage with Finance turns this into budgeted work, not a plea.
- We have AI-generated code all over. How do we reduce risk without banning it?
- Add guardrails: tests required for AI-authored changes, lint/gate checks, and maturity targets in your rubric. Use the loop to prioritize refactors where AI hallucination caused incidents—this is targeted AI code refactoring, not a crusade.
- Won’t this create duplicate work across teams?
- Centralize shared fixes: patch Terraform modules once, standardize Istio policies, and publish playbooks. Use dependency centrality in scoring to prefer platform-level changes over per-service heroics.
- How do we avoid the backlog becoming a dumping ground?
- Use the scoring threshold to auto-close low-value items or bundle them into hygiene. Enforce owners, dates, and exit criteria. Anything without those is deleted in the weekly review—ruthlessly.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
