How fast should we complete a postmortem?

Draft within 2 business days, review by day 5, and close Sev1/Sev2 corrective actions within 30 days. Use waivers for exceptions with VP signoff.

How do we stay blameless but still hold people accountable?

Hold the system accountable for failure and leaders accountable for resourcing fixes. If an individual repeatedly bypasses guardrails, that’s a performance issue handled outside the review—not in it.

What tools do we need to start?

PagerDuty (or Opsgenie), Slack/Teams, a timeline bot (`incident.io` is great), Jira for tracking actions, Git for templates and runbooks, and Grafana/Prometheus for SLOs. Everything else is optional.

We’re ITIL-heavy. Will this fly with CAB and auditors?

Yes. Treat the postmortem as the Problem Record, map actions to Changes with approvals, and attach evidence (diffs, dashboards). Auditors prefer deterministic processes with artifacts.

Remote-first teams: any special advice?

Ritualize channels and roles, record the Zoom, and harden timelines. Make the paved path easy with ChatOps: one command creates channel, assigns roles, pins Zoom, and starts the timeline.

Culture · Oct 16, 2025 · 9 minute read

The Postmortem Ritual That Quieted Our 3 A.M. PagerDuty

Blameless doesn’t mean consequence-free. It means system-first accountability, leadership air cover, and a repeatable ritual that prevents the next page.

Riley Grant

Principal, GitPlumbers

Riley has led SRE and platform teams at fintech and retail orgs for 20 years—through monoliths, microservices, and AI-fueled on-call chaos. He now helps enterprises make production quieter without killing delivery speed.

“Blameless isn’t about letting people off the hook; it’s about putting the system on the hook.”

Back to all posts

The outage you’ve lived through

It’s 3:07 a.m. PagerDuty is screaming. Slack is a stove. Someone blurts “who pushed?” and 45 minutes later you’ve rolled back, patched a feature flag, and promised to “write the postmortem tomorrow.” Tomorrow turns into a week; the doc becomes a tombstone. Three months later, the same class of incident bites you again—new on-call, new names, same root conditions.

I’ve seen this loop at a fintech on Kubernetes 1.23, an ecommerce unicorn running Istio 1.17, and a Fortune 100 stuck in ITIL purgatory. The pattern is universal: the team “does a postmortem,” but nothing in the system actually changes. Or worse, the doc devolves into a search for a human to blame.

Here’s the playbook we use at GitPlumbers that consistently reduces repeat incidents 30–60% in the first two quarters, without turning engineers into compliance clerks.

What blameless looks like (and what it’s not)

Blameless is not “no accountability.” It’s moving accountability from the individual to the system—and giving leaders a specific job: remove fear so truth can surface, then resource the fixes.

No-human root cause: Ban phrases like “engineer forgot” or “fat-fingered.” If a single person’s mistake can take down prod, that’s a system design issue (missing guardrails, no canary, weak review, poor tooling).
Leadership air cover: VP/Director opens the review with: “We are here to understand system failures and invest in prevention. There will be no performance consequences from candid participation.” Say it out loud. Every time.
System-first framing: Findings must map to concrete system changes: tests, automation, limits, rollbacks, retries, quotas, circuit breakers, feature flags, or process guardrails.
Timeboxed learning: If it takes 3+ weeks to get to review, the context is gone. Draft in 2 business days, review by 5, actions closed within 30 unless there’s a documented exception.

Blameless isn’t letting people off the hook; it’s putting the system on the hook.

Rituals that hold under pressure

You can’t invent process during an incident. Ritualize it.

Roles
- Incident Commander (IC): single decision-maker; keeps scope tight, uses authority to pause risky fixes.
- Scribe: timestamps decisions, captures evidence and timeline.
- Comms Lead: coordinates stakeholder updates (status page, exec Slack, customer success).
Channels and artifacts
- Slack channel per incident: #inc-sev1-YYYYMMDD-<short>; only one, named fast.
- Threading for hypotheses; @here only from IC.
- Zoom or Meet link pinned.
- Timeline bot (we’ve used incident.io, PagerDuty, or a simple homegrown slash command) to capture PD, Grafana, and GitHub links.

Example channel setup (runbook excerpt):

# invoked by the on-call via a ChatOps bot
export INC_ID=sev1-20251016-checkout
slack channels create "inc-$INC_ID"
slack channels invite "inc-$INC_ID" @oncall-web @oncall-infra @oncall-db
slack chat postMessage "inc-$INC_ID" "IC: @alice | Scribe: @bob | Comms: @carol\nZoom: https://zoom.us/j/12345\nStatus page: https://status.example.com"
slack channels setTopic "inc-$INC_ID" "Customer impact: elevated 5xx on checkout; Start: 03:07 UTC; Next update: 03:30 UTC"

Designated timeline: The Scribe pins a single message and appends timestamps. Use UTC. Pull from Prometheus, kubectl, feature flag changes, deploy logs, and customer reports.
Status cadence: Comms Lead posts public/customer updates every 30 minutes (or per contract). Missed updates are a finding.

These rituals sound basic. They’re oxygen at 3 a.m.

A postmortem template that causes change

Kill free-form docs. Use one template, tied to SLOs and system changes. Store it in Git, not scattered Confluence pages. Here’s a starter we’ve rolled out at multiple enterprises:

# file: postmortems/templates/sev1.yaml
incident:
  id: "inc-sev1-2025-10-16-checkout"
  severity: 1
  start: "2025-10-16T03:07:00Z"
  end: "2025-10-16T03:48:00Z"
  duration_minutes: 41
  customer_impact:
    users_affected: 180000
    symptoms: ["checkout 5xx", "card auth timeout"]
  business_impact:
    revenue_at_risk_usd: 420000
    sla_breach: true
  slo:
    target: "99.9% monthly availability"
    burn_rate: 14.2
    error_budget_minutes_consumed: 41
  detection:
    first_signal: "Prometheus alert: http_5xx_rate > 2%"
    detection_delta_minutes: 3
  response:
    mttr_minutes: 41
    pages: 5
    escalation: "DBA + Payments"
  contributing_factors:
    - "Checkout service scaled to 0 in one AZ due to HPA misconfig"
    - "Feature flag rollout lacked canary and circuit breaker"
    - "Liveness probe too aggressive on cold start"
  safeguards_missing:
    - "No rate limiting between checkout and card gateway"
    - "No `maxUnavailable` on deployment; surge=0"
  what_worked:
    - "Rollback via ArgoCD took 4 minutes"
    - "Feature flag kill switch documented and owned"
  what_didnt:
    - "On-call runbook missing card gateway timeouts"
    - "Slack updates missed two cadences"
  timeline:
    - "03:07Z Alert fired"
    - "03:10Z IC declared; roles assigned"
    - "03:13Z Rollback initiated"
    - "03:30Z First customer comms sent"
    - "03:44Z 5xx back to baseline"
  corrective_actions:
    - id: CA-1234
      type: "guardrail"
      desc: "Add `maxUnavailable: 1` and `maxSurge: 1` to checkout deployment"
      owner: "team-checkout"
      due: "2025-11-15"
      tracking: "Jira SRE-8421"
    - id: CA-1235
      type: "observability"
      desc: "Create burn-rate alerts (2h, 6h windows) for checkout SLO"
      owner: "team-sre"
      due: "2025-10-30"
      tracking: "Jira SRE-8422"
    - id: CA-1236
      type: "resilience"
      desc: "Introduce circuit breaker and retries with jitter toward card gateway"
      owner: "team-payments"
      due: "2025-11-30"
      tracking: "Jira PAY-1123"
  attachments:
    - "grafana://d/checkout-5xx"
    - "github://org/repo/commit/abc123"
    - "argocd://applications/checkout"
review:
  date: "2025-10-20"
  attendees: ["VP Eng", "IC", "SRE", "Payments Lead", "Support Lead"]
  notes: "No blame. Focus on system guardrails."
  signoff: "VP Eng"

Two things make this stick:

Guardrail taxonomy: Every corrective action is one of a few types: observability, guardrail, resilience, process, runbook, security. This prevents “just docs” actions and biases toward system changes.
Traceability: Each action must point to a Jira/GitHub ticket and evidence of completion (diff, dashboard, or runbook PR).

From words to fixes: wire it into your delivery

If actions don’t hit code, configs, or tools, nothing changes. Wire postmortems into your pipelines.

GitOps tie-in: Annotate deployments with incident IDs so you can query what changed during or before an outage.

# deployment snippet for ArgoCD
metadata:
  annotations:
    gitplumbers.io/last-incident: "inc-sev1-2025-10-16-checkout"
    gitplumbers.io/postmortem: "https://git.example.com/postmortems/inc-sev1-2025-10-16-checkout"

Block risky deploys when high-sev actions are open: For Sev1/Sev2 corrective actions, require closure (or explicit waiver) before you promote to prod. This isn’t “no deploys”; it’s “no deploys without conscious risk.”

#!/usr/bin/env bash
# gate.sh — run in CI before prod promotion
set -euo pipefail
JIRA_JQL='project = SRE AND labels = postmortem AND severity >= 2 AND status != Done'
OPEN=$(curl -s -u "$JIRA_USER:$JIRA_TOKEN" \
  -G "https://jira.example.com/rest/api/2/search" \
  --data-urlencode "jql=$JIRA_JQL" | jq '.issues | length')
if [[ "$OPEN" -gt 0 ]]; then
  echo "Blocking deploy: $OPEN open Sev2+ postmortem actions. Use waiver label to override."
  exit 1
fi

Make it visible: A weekly roll-up to leadership showing action closure rate and exceptions. If teams are always asking for waivers, that’s a signal to adjust timelines or add engineering capacity.
Runbooks are code: Store runbooks next to the service in Git. PRs for runbook updates are first-class.

This is where most orgs fail: they treat the postmortem as content instead of changes. Changes win outages.

Measure what matters (and show it on one dashboard)

Leaders don’t need 17 KPIs; they need four:

Recurrence rate: Percent of incidents in a quarter that match a known class from the last four quarters.
MTTR trend: Median minutes from alert to restore, rolling 90 days.
Action closure rate: Percent of postmortem actions closed within 30 days, by team.
Error-budget burn: SLO burn rate windows and cumulative burn.

You can implement these with Prometheus and Grafana, plus a small incidents table (even a BigQuery or Snowflake view).

Burn-rate alerts (4- and 14-hour windows):

# 2-window SLO burn rate alerts for HTTP 5xx
sum(rate(http_requests_total{status=~"5..", job="checkout"}[5m]))
/
sum(rate(http_requests_total{job="checkout"}[5m]))
  > (1 - 0.999) * 14  # 99.9% SLO, 14x burn threshold

MTTR from incident logs (simple SQL over your incidents table):

SELECT
  DATE_TRUNC('week', start_time) AS week,
  PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY EXTRACT(EPOCH FROM (end_time - start_time))/60) AS mttr_minutes
FROM incidents
WHERE severity <= 2
GROUP BY 1
ORDER BY 1 DESC;

Action closure rate by team (Jira export → SQL):

SELECT team,
       100.0 * SUM(CASE WHEN closed_at <= due_date THEN 1 ELSE 0 END) / COUNT(*) AS on_time_pct
FROM postmortem_actions
WHERE created_at >= CURRENT_DATE - INTERVAL '90 days'
GROUP BY team
ORDER BY on_time_pct ASC;

Hold a monthly 30-minute review of this dashboard. If recurrence rate isn’t trending down, you don’t have a postmortem problem—you have a resourcing problem.

Leadership behaviors that make it stick

I’ve watched this succeed at a payments company and stall at an ad-tech scale-up. The difference was leadership.

Say the words: “No performance consequences from candid participation.” It changes the room.
Fund the fixes: Earmark 5–15% of each team’s capacity as a “prevention budget.” If you don’t protect it, product work will eat it.
Reward prevention: Promote engineers for building guardrails, not just features. Tie objectives to SLOs and recurrence reduction.
Keep review small and senior: IC, service owners, SRE, and a VP/Director. Exec presence signals priority; too many attendees turns it into theater.
Timebox and enforce: Missed timelines are escalated like missed SLAs. No scolding; just re-prioritization and help.
Normalize waivers—sparingly: Sometimes 30 days is unrealistic. Waivers require a VP signoff and a new due date. Track waivers like error budget spend.

Compliance without the theater (ITIL, CAPA, SOX/HIPAA)

You can be blameless and still pass audits.

ITIL mapping: Treat the postmortem as the “Problem Record.” The corrective actions become Known Error DB entries and Changes with proper CAB notes. Keep the language system-first and you’ll satisfy auditors without naming a scapegoat.
CAPA: When regulated (SOX/HIPAA/PCI), mark certain actions as CAPAs. Same template, but include verification evidence: link to merged PR, updated Terraform, new alert in Prometheus, screenshot of Grafana panel. Auditors love determinism.
Evidence automation: Add a checklist item in the action ticket that requires a link to code or config diff. No diff, no done.
Change control that isn’t theater: Use ArgoCD or Spinnaker with approvals gated by labels (postmortem=true). Approvals are recorded automatically; your CAB will thank you.

Example ArgoCD policy snippet:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: prod-deploy-approver
rules:
  - apiGroups: ["argoproj.io"]
    resources: ["applications"]
    verbs: ["sync"]
    resourceNames: ["checkout"]
    # Requires label postmortem-actions-closed=true set by CI gate

The trick is to make the compliant path the paved path. If it’s harder than cowboy deploys, you’ll get cowboy deploys.

A short, real example: the HPA that ate an AZ

At a retail client running EKS and HPA on v2beta2, a misconfigured target CPU of 15% caused aggressive scale-to-zero in one AZ during a traffic dip. When a flash sale hit, pods cold-started, liveness probes killed them, and checkout started throwing 5xx. We rolled back, toggled a LaunchDarkly flag, and stabilized.

Postmortem led to:

maxUnavailable: 1 and maxSurge: 1 on Deployment
Added 2-window burn-rate alerts and per-AZ SLO panels
Increased liveness initialDelaySeconds and added readiness gates
Circuit breaker on calls to the payment gateway with exponential backoff and jitter
Runbook update and an HPA policy review step in code reviews

Results in 60 days:

Repeat class incidents: down from 3 to 0
MTTR median: 52 → 24 minutes
Action closure: 92% on-time
Error budget burn: back under 25% monthly

No one got thrown under the bus. The system got guardrails. That’s the point.

structuredSections':[{

Related Resources

Key takeaways

Blameless means system accountability with leadership air cover, not avoiding responsibility.
Ritualize roles, channels, and timelines so the postmortem runs itself under stress.
Tie every finding to measurable change: code, config, automation, or guardrails.
Measure outcomes that matter to the business: recurrence rate, MTTR, SLO burn, action closure rate.
Integrate with enterprise realities (SOX/HIPAA, ITIL, CAPA) without turning it into theater.

Implementation checklist

Name a clear Incident Commander, Scribe, and Comms Lead for every Major Incident.
Create a standard Slack channel pattern and timeline capture bot.
Use a single postmortem template that ties to SLOs, costs, and specific system changes.
Timebox: draft in 2 business days, review in 5, actions closed in 30 (with exceptions listed).
Dashboard the four outcomes: recurrence rate, MTTR, action closure rate, SLO error-budget burn.
Add a “prevention budget” to each team’s sprint capacity (5–15%).
Adopt a no-human root cause rule in reviews; system causes only.
Map corrective actions to auditable CAPAs when compliance requires it.

Questions we hear from teams

How fast should we complete a postmortem?: Draft within 2 business days, review by day 5, and close Sev1/Sev2 corrective actions within 30 days. Use waivers for exceptions with VP signoff.
How do we stay blameless but still hold people accountable?: Hold the system accountable for failure and leaders accountable for resourcing fixes. If an individual repeatedly bypasses guardrails, that’s a performance issue handled outside the review—not in it.
What tools do we need to start?: PagerDuty (or Opsgenie), Slack/Teams, a timeline bot (`incident.io` is great), Jira for tracking actions, Git for templates and runbooks, and Grafana/Prometheus for SLOs. Everything else is optional.
We’re ITIL-heavy. Will this fly with CAB and auditors?: Yes. Treat the postmortem as the Problem Record, map actions to Changes with approvals, and attach evidence (diffs, dashboards). Auditors prefer deterministic processes with artifacts.
Remote-first teams: any special advice?: Ritualize channels and roles, record the Zoom, and harden timelines. Make the paved path easy with ChatOps: one command creates channel, assigns roles, pins Zoom, and starts the timeline.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Get your postmortem ritual audited by GitPlumbers Download the postmortem template