What if security/compliance demands slow this down?

Tie each modernization epic to a ServiceNow risk/ctrl ID and capture approvals as PR comments in the `modernization/prioritized.yaml` changes. Use canaries and change windows to satisfy CAB without blocking the loop.

Our execs only care about features. How do we get funding?

Show the math: repeat incidents + SLO burn correlate to churn, revenue loss, and on-call attrition. Bring a 90-day plan with capacity carve-out and projected savings. Monthly triage with Finance turns this into budgeted work, not a plea.

We have AI-generated code all over. How do we reduce risk without banning it?

Add guardrails: tests required for AI-authored changes, lint/gate checks, and maturity targets in your rubric. Use the loop to prioritize refactors where AI hallucination caused incidents—this is targeted AI code refactoring, not a crusade.

Won’t this create duplicate work across teams?

Centralize shared fixes: patch Terraform modules once, standardize Istio policies, and publish playbooks. Use dependency centrality in scoring to prefer platform-level changes over per-service heroics.

How do we avoid the backlog becoming a dumping ground?

Use the scoring threshold to auto-close low-value items or bundle them into hygiene. Enforce owners, dates, and exit criteria. Anything without those is deleted in the weekly review—ruthlessly.

Culture · Dec 1, 2025 · 10 minute read

Postmortems That Pay Down Debt: The Feedback Loop That Turns Incidents into a Ruthless Modernization Backlog

If your incident reviews aren’t producing prioritized investments with owners, dates, and budgets, you’re doing storytelling—not engineering. Here’s the loop we install when we’re called in after the third “never again” outage.

Morgan Shaw

Partner, GitPlumbers (ex-Netflix, Atlassian)

20 years fixing distributed systems and the org charts around them. Led SRE and platform teams through monolith breakups, cloud repatriations, and AI-assisted code fiascos. I prefer boring, reliable systems and short incident calls.

If your postmortems aren’t creating backlog items with owners and dates, you’re doing storytelling, not engineering.

Back to all posts

Stop treating postmortems as therapy sessions

I’ve sat through the “we’re so sorry” Zooms where everyone nods, someone says “blameless,” and then the action items vanish into a Confluence graveyard. Two months later the same READ_TIMEOUT page wakes up PagerDuty at 3 a.m.

Here’s the uncomfortable truth: if your incident review doesn’t produce a ranked modernization backlog with owners, dates, and funding, you didn’t do a review—you held group therapy. I’ve seen this fail at Fortune 50s running ServiceNow, and at unicorns with immaculate Datadog dashboards. What works is a simple, brutal feedback loop that converts “we learned” into “we shipped and paid down debt.”

Define the conversion funnel: Incident → Insight → Investment

You need a pipeline, not heroics. Make the states explicit and measurable:

Incident occurs (PagerDuty/Opsgenie/ServiceNow).
Blameless review produces structured insights: contributing factors, failure modes, cost of impact.
Insights are tagged to services in your catalog (Backstage/Service Catalog).
A scoring model ranks modernization candidates.
Epics are created in Jira/Azure DevOps with budgets and acceptance criteria.
Work lands in teams’ capacity plans and GitOps flows (ArgoCD/Flux).

Key fields to capture every time (no essays—check boxes and enums):

SLOs affected (availability, latency, error rate), with burn rate from Prometheus/Datadog.
MTTA/MTTR, and whether paging hit on-call outside business hours.
Repeat offender flag with n in last 90 days.
Blast radius (user %, revenue, regulatory exposure like PCI/SOX/PHI).
Dependency centrality (e.g., API gateway, auth, payments).
Fix class: config-only (e.g., Istio DestinationRule), code refactor, infra modernization (Terraform module), process/automation, or AI code cleanup.
Estimate in engineer-days and required window (canary, feature flags, maintenance).

Keep it boring, consistent, and machine-readable. Store these fields in the postmortem record and pipe them to your backlog system automatically.

The rituals that make it stick

Rituals beat tools. The calendar is your system design.

Weekly Incident-to-Modernization Review (45 mins): DRI per domain (SRE + product + EM). Review last week’s postmortems, apply the scoring rubric live, and decide: promote to epic, bundle with existing epic, or close as hygiene.
Monthly Executive Triage (45 mins): CTO/VP Eng plus Finance partner. Approve the top 10 modernization epics. Decisions recorded as PR merges to modernization/prioritized.yaml (no slideware-only signoffs).
Friday Digest in #incident-modernization: Bot posts the top 5 candidates with score deltas, links to SLO charts, and the owner due dates.
Quarterly Capacity Carve-out: 15–25% of team capacity reserved for modernization. Don’t hide it; put it on the roadmap like any feature.
Leadership behaviors:
- Ask “what got promoted and funded?” not “how many RCAs did we write?”
- Celebrate deletion and simplification PRs in all-hands.
- Protect the 25% carve-out when the board screams for features.

If it’s not on the calendar and not in the budget, it’s not a priority. Treat reliability and debt paydown like product work, because it is.

From words to work: wire up the data plumbing

Stop copying bullets from Confluence into Jira by hand. Automate the boring parts.

Ingest incidents into your warehouse (BigQuery/Snowflake/Redshift) and tag to services by service_owner or k8s labels.

-- Snowflake: find repeat offenders and SLO burn
with incidents as (
  select service, incident_id, started_at, resolved_at,
         datediff('minute', started_at, resolved_at) as mttr_min,
         repeat_flag,
         slo_burn_rate,
         revenue_impact_usd
  from analytics.incidents
  where started_at >= dateadd('day', -90, current_timestamp())
)
select service,
       count(*) as incident_count,
       sum(case when repeat_flag then 1 else 0 end) as repeats,
       avg(mttr_min) as avg_mttr_min,
       sum(slo_burn_rate) as total_burn,
       sum(revenue_impact_usd) as revenue_hit
from incidents
group by service
order by repeats desc, total_burn desc;

Create epics automatically when score crosses a threshold. Example: PagerDuty + jq + Jira REST.

#!/usr/bin/env bash
set -euo pipefail

PD_TOKEN="$PD_TOKEN"  # env var
JIRA_TOKEN="$JIRA_TOKEN"
JIRA_USER="svc-bot@company.com"
JIRA_BASE="https://jira.company.com"
PROJECT="MOD"

curl -s -H "Authorization: Token token=$PD_TOKEN" \
  'https://api.pagerduty.com/incidents?since=2025-08-01' \
  | jq -c '.incidents[] | {
      id: .id,
      service: .service.summary,
      started: .created_at,
      resolved: .last_status_change_at,
      urgency: .urgency
    }' | while read row; do
  score=$(echo "$row" | jq -r '.urgency == "high" ? 5 : 2')
  if [ "$score" -ge 4 ]; then
    summary="Modernize $(echo "$row" | jq -r .service): high-urgency incident $(echo "$row" | jq -r .id)"
    payload=$(jq -n --arg summary "$summary" --arg project "$PROJECT" '{
      fields: {
        project: { key: $project },
        issuetype: { name: "Epic" },
        summary: $summary,
        labels: ["incident-modernization"]
      }
    }')
    curl -s -u "$JIRA_USER:$JIRA_TOKEN" -H 'Content-Type: application/json' \
      -d "$payload" "$JIRA_BASE/rest/api/2/issue"
  fi
done

Sync decisions to Git for visibility and GitOps.

# .github/workflows/modernization-sync.yaml
name: Sync Modernization Decisions
on:
  schedule: [{ cron: '0 13 * * 5' }] # Fridays
  workflow_dispatch:
jobs:
  generate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: '20' }
      - run: npm ci
      - name: Generate prioritized list from Jira
        run: |
          node scripts/pull-jira.js > modernization/prioritized.yaml
      - name: Create PR
        env:
          GH_TOKEN: ${{ secrets.GH_TOKEN }}
        run: |
          git config user.email bot@company.com
          git config user.name modernization-bot
          git checkout -b chore/update-modernization
          git add modernization/prioritized.yaml
          git commit -m "chore: weekly modernization sync"
          gh pr create --fill --base main

Make it visible in the service catalog (Backstage): link each service to its open modernization epics, SLOs, and debt score.

apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: payments-api
  annotations:
    jira/project-key: MOD
    jira/labels: incident-modernization
    grafana/dashboard-url: https://grafana.company.com/d/abc123
spec:
  type: service
  owner: team-payments
  lifecycle: production

Prioritization that survives reality

Scoring models fail when they’re opaque or PhD-level. Keep it simple, auditable, and adjustable via PR.

# modernization/scoring.yaml
weights:
  slo_burn: 5
  repeats: 4
  mttr: 3
  blast_radius: 5
  dependency_centrality: 4
  revenue_impact: 5
  effort: -3   # negative weight, easier = higher score
  ai_risk: 2   # presence of AI-generated code w/o tests
bands:
  promote_to_epic: 
    min_score: 24
  bundle_with_existing:
    min_score: 16

Example math for a gnarly gateway service:

SLO burn high (5), repeats 3 in 60 days (4), MTTR 140m (3), blast radius 40% traffic (5), dependency centrality high (4), revenue impact medium (3), effort 5 days (−2), AI risk present (2)
Score = 5+4+3+5+4+3−2+2 = 24 → promote to epic.

Pair the scoring with a maturity rubric so teams know what “modernized” means.

# modernization/maturity.yaml
levels:
  bronze:
    - "SLOs defined and alerting wired to PagerDuty"
    - "Dashboards in Grafana with RED/USE metrics"
  silver:
    - "Canary deployment via ArgoCD + feature flags"
    - "Circuit breakers + timeouts configured (Istio DestinationRule)"
    - "Runbooks in repo (/runbooks)"
  gold:
    - "Chaos experiments quarterly"
    - "Load tests in CI before prod (k6/Locust)"
    - "No AI-generated code without tests + lint gates"

Governance without theater

Skip the 20-slide RCA deck. Use small, sharp artifacts and Git-based decisions.

RCA template that produces work, not prose:

# Postmortem: <incident-id>
- Services: payments-api, gateway
- SLOs: latency P99 breached (3h), error rate 5xx 2.3%
- Impact: 18% sessions, est $420k revenue
- Contributing factors: missing circuit breaker; AI-generated retry code w/o backoff
- Fix class: Istio config + refactor retry client + Terraform module update
- Owner/DRI: @team-payments EM
- Decision: Promote epic MOD-123; target Silver maturity
- Review date: 2025-01-15

Change path that fits enterprise constraints:
- Regulatory? Tie epics to risk register IDs in ServiceNow and record approvals as PR comments.
- Prod safety? Use canary deployment with ArgoCD and bake time windows.
- Shared infra? Patch once in a Terraform module, roll via app-of-apps.
Concrete config example (stop the timeouts):

# istio-destinationrule.yaml
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: payments-api
spec:
  host: payments-api.default.svc.cluster.local
  trafficPolicy:
    connectionPool:
      http:
        http1MaxPendingRequests: 100
        maxRequestsPerConnection: 100
      tcp:
        maxConnections: 1000
    outlierDetection:
      consecutive5xxErrors: 5
      interval: 5s
      baseEjectionTime: 30s
      maxEjectionPercent: 50
    retries:
      attempts: 2
      perTryTimeout: 800ms

Jira JQL for the board:

project = MOD AND labels = incident-modernization AND statusCategory != Done ORDER BY priority DESC, "SLO Burn" DESC

Case file: the API gateway that ate our quarter

Real story. At a fintech, the gateway (Envoy via Istio) did what gateways do: aggregate everyone’s sins. Latency spikes from downstream services plus a homegrown client with exponential-ish retries (written via “vibe coding” with AI autocomplete) created a perfect storm. We had three Sev-2s in six weeks, MTTR averaging 140 minutes, and burned the availability SLO twice.

We wired the funnel:

Weekly review promoted a “Gateway Reliability” epic with three workstreams: circuit breaker config, retry client refactor, and Terraformized timeouts.
Scoring put it at the top (score 26). Finance approved a 4-engineer, 2-week carve-out.
We shipped: DestinationRule like above, a proper backoff = jittered_exponential(100ms, max=2s) in the client, and a shared Terraform module for timeout defaults. Canary with ArgoCD to 10%, then 50%, then 100% over 48 hours.

Results 45 days later:

Repeat incidents: 3 → 0.
P99 latency: down 37%.
MTTR: 140m → 55m (alerts became actionable).
Infra egress spend: −12% (less thrash).
Dev time regained: ~1.5 FTE/week not firefighting.

The kicker: the postmortem also flagged an AI-generated client lib with no tests. We added a repo rule: no AI-generated code merges without tests and lint. That alone prevented two later regressions. Call it vibe code cleanup that paid real dividends.

What to instrument and what to expect in 90 days

Set targets that force the loop to produce outcomes, not documents.

Track conversion:
- % postmortems with structured fields completed within 5 business days (target: 95%).
- % action items promoted to prioritized epics (target: 70%+), and time-to-epic (median < 7 days).
- Repeat incident rate per service (target: −50%).
Reliability metrics: MTTR (−20%), SLO burn rate (−30%), pages outside business hours (−40%).
Delivery metrics: Planned capacity spent on modernization (15–25%), number of services moved to Silver maturity.
Cost metrics: Infra cost deltas on hot paths (5–15%) and toil time recovered.

30/60/90 action plan:

30 days: pipeline from incidents to Jira working; start weekly review; publish rubric.
60 days: first monthly exec triage; ArgoCD canary playbook; Backstage surfacing epics.
90 days: show the graph—repeat incidents down, SLOs stable, and 3–5 gold/silver upgrades shipped.

If you want outside help to wire this in, GitPlumbers drops in the bots, scoring, and governance in weeks—not quarters. We’ve done the code rescue after AI hallucinations and the legacy modernization after the audit letter. Happy to show you the playbooks.

Related Resources

Key takeaways

Postmortems must create backlog items with owners, dates, and a scoring rationale—or they’re theater.
Automate the funnel: PagerDuty/ServiceNow -> warehouse -> scoring -> Jira/ADO backlog + quarterly funding.
Use a repeatable scoring model: SLO burn, MTTR, incident frequency, blast radius, dependency centrality, and cost-to-fix.
Establish weekly, monthly, and quarterly rituals that survive reorgs and budget cycles.
Make the modernization backlog visible in your service catalog and your GitOps workflows, not buried in Confluence.
Track conversion and impact: % of postmortem actions promoted to funded epics, incident repeat rate, SLO adherence, MTTR, and infra spend deltas.

Implementation checklist

Create one Slack/Teams channel for incident-to-modernization, e.g., `#incident-modernization` with weekly digest.
Stand up a data pipeline from PagerDuty/ServiceNow to your warehouse and Jira/ADO using the examples below.
Adopt the scoring rubric and publish it as `modernization/scoring.yaml` in a repo everyone can PR to.
Add a “Modernization Debt” facet to your Backstage/Service Catalog with links to open epics and health scores.
Run a monthly executive triage with 45 minutes, 10 slides max, decisions recorded as PR merges.
Set 90-day targets: reduce repeat incidents by 50%, convert 70%+ of postmortem actions into prioritized tickets, cut MTTR by 20%.
Tie one OKR to incident-driven modernization with clear owners and an actual budget line item.

Questions we hear from teams

What if security/compliance demands slow this down?: Tie each modernization epic to a ServiceNow risk/ctrl ID and capture approvals as PR comments in the `modernization/prioritized.yaml` changes. Use canaries and change windows to satisfy CAB without blocking the loop.
Our execs only care about features. How do we get funding?: Show the math: repeat incidents + SLO burn correlate to churn, revenue loss, and on-call attrition. Bring a 90-day plan with capacity carve-out and projected savings. Monthly triage with Finance turns this into budgeted work, not a plea.
We have AI-generated code all over. How do we reduce risk without banning it?: Add guardrails: tests required for AI-authored changes, lint/gate checks, and maturity targets in your rubric. Use the loop to prioritize refactors where AI hallucination caused incidents—this is targeted AI code refactoring, not a crusade.
Won’t this create duplicate work across teams?: Centralize shared fixes: patch Terraform modules once, standardize Istio policies, and publish playbooks. Use dependency centrality in scoring to prefer platform-level changes over per-service heroics.
How do we avoid the backlog becoming a dumping ground?: Use the scoring threshold to auto-close low-value items or bundle them into hygiene. Enforce owners, dates, and exit criteria. Anything without those is deleted in the weekly review—ruthlessly.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Wire up your incident-to-modernization loop See how we clean up vibe-coded systems