The Postmortem Ritual That Quieted Our 3 A.M. PagerDuty
Blameless doesn’t mean consequence-free. It means system-first accountability, leadership air cover, and a repeatable ritual that prevents the next page.
“Blameless isn’t about letting people off the hook; it’s about putting the system on the hook.”Back to all posts
The outage you’ve lived through
It’s 3:07 a.m. PagerDuty is screaming. Slack is a stove. Someone blurts “who pushed?” and 45 minutes later you’ve rolled back, patched a feature flag, and promised to “write the postmortem tomorrow.” Tomorrow turns into a week; the doc becomes a tombstone. Three months later, the same class of incident bites you again—new on-call, new names, same root conditions.
I’ve seen this loop at a fintech on Kubernetes 1.23, an ecommerce unicorn running Istio 1.17, and a Fortune 100 stuck in ITIL purgatory. The pattern is universal: the team “does a postmortem,” but nothing in the system actually changes. Or worse, the doc devolves into a search for a human to blame.
Here’s the playbook we use at GitPlumbers that consistently reduces repeat incidents 30–60% in the first two quarters, without turning engineers into compliance clerks.
What blameless looks like (and what it’s not)
Blameless is not “no accountability.” It’s moving accountability from the individual to the system—and giving leaders a specific job: remove fear so truth can surface, then resource the fixes.
- No-human root cause: Ban phrases like “engineer forgot” or “fat-fingered.” If a single person’s mistake can take down prod, that’s a system design issue (missing guardrails, no canary, weak review, poor tooling).
- Leadership air cover: VP/Director opens the review with: “We are here to understand system failures and invest in prevention. There will be no performance consequences from candid participation.” Say it out loud. Every time.
- System-first framing: Findings must map to concrete system changes: tests, automation, limits, rollbacks, retries, quotas, circuit breakers, feature flags, or process guardrails.
- Timeboxed learning: If it takes 3+ weeks to get to review, the context is gone. Draft in 2 business days, review by 5, actions closed within 30 unless there’s a documented exception.
Blameless isn’t letting people off the hook; it’s putting the system on the hook.
Rituals that hold under pressure
You can’t invent process during an incident. Ritualize it.
Roles
Incident Commander (IC): single decision-maker; keeps scope tight, uses authority to pause risky fixes.Scribe: timestamps decisions, captures evidence and timeline.Comms Lead: coordinates stakeholder updates (status page, exec Slack, customer success).
Channels and artifacts
- Slack channel per incident:
#inc-sev1-YYYYMMDD-<short>; only one, named fast. - Threading for hypotheses;
@hereonly from IC. - Zoom or Meet link pinned.
- Timeline bot (we’ve used
incident.io,PagerDuty, or a simple homegrown slash command) to capturePD,Grafana, andGitHublinks.
- Slack channel per incident:
Example channel setup (runbook excerpt):
# invoked by the on-call via a ChatOps bot
export INC_ID=sev1-20251016-checkout
slack channels create "inc-$INC_ID"
slack channels invite "inc-$INC_ID" @oncall-web @oncall-infra @oncall-db
slack chat postMessage "inc-$INC_ID" "IC: @alice | Scribe: @bob | Comms: @carol\nZoom: https://zoom.us/j/12345\nStatus page: https://status.example.com"
slack channels setTopic "inc-$INC_ID" "Customer impact: elevated 5xx on checkout; Start: 03:07 UTC; Next update: 03:30 UTC"- Designated timeline: The Scribe pins a single message and appends timestamps. Use UTC. Pull from
Prometheus,kubectl,feature flagchanges, deploy logs, and customer reports. - Status cadence: Comms Lead posts public/customer updates every 30 minutes (or per contract). Missed updates are a finding.
These rituals sound basic. They’re oxygen at 3 a.m.
A postmortem template that causes change
Kill free-form docs. Use one template, tied to SLOs and system changes. Store it in Git, not scattered Confluence pages. Here’s a starter we’ve rolled out at multiple enterprises:
# file: postmortems/templates/sev1.yaml
incident:
id: "inc-sev1-2025-10-16-checkout"
severity: 1
start: "2025-10-16T03:07:00Z"
end: "2025-10-16T03:48:00Z"
duration_minutes: 41
customer_impact:
users_affected: 180000
symptoms: ["checkout 5xx", "card auth timeout"]
business_impact:
revenue_at_risk_usd: 420000
sla_breach: true
slo:
target: "99.9% monthly availability"
burn_rate: 14.2
error_budget_minutes_consumed: 41
detection:
first_signal: "Prometheus alert: http_5xx_rate > 2%"
detection_delta_minutes: 3
response:
mttr_minutes: 41
pages: 5
escalation: "DBA + Payments"
contributing_factors:
- "Checkout service scaled to 0 in one AZ due to HPA misconfig"
- "Feature flag rollout lacked canary and circuit breaker"
- "Liveness probe too aggressive on cold start"
safeguards_missing:
- "No rate limiting between checkout and card gateway"
- "No `maxUnavailable` on deployment; surge=0"
what_worked:
- "Rollback via ArgoCD took 4 minutes"
- "Feature flag kill switch documented and owned"
what_didnt:
- "On-call runbook missing card gateway timeouts"
- "Slack updates missed two cadences"
timeline:
- "03:07Z Alert fired"
- "03:10Z IC declared; roles assigned"
- "03:13Z Rollback initiated"
- "03:30Z First customer comms sent"
- "03:44Z 5xx back to baseline"
corrective_actions:
- id: CA-1234
type: "guardrail"
desc: "Add `maxUnavailable: 1` and `maxSurge: 1` to checkout deployment"
owner: "team-checkout"
due: "2025-11-15"
tracking: "Jira SRE-8421"
- id: CA-1235
type: "observability"
desc: "Create burn-rate alerts (2h, 6h windows) for checkout SLO"
owner: "team-sre"
due: "2025-10-30"
tracking: "Jira SRE-8422"
- id: CA-1236
type: "resilience"
desc: "Introduce circuit breaker and retries with jitter toward card gateway"
owner: "team-payments"
due: "2025-11-30"
tracking: "Jira PAY-1123"
attachments:
- "grafana://d/checkout-5xx"
- "github://org/repo/commit/abc123"
- "argocd://applications/checkout"
review:
date: "2025-10-20"
attendees: ["VP Eng", "IC", "SRE", "Payments Lead", "Support Lead"]
notes: "No blame. Focus on system guardrails."
signoff: "VP Eng"Two things make this stick:
- Guardrail taxonomy: Every corrective action is one of a few types:
observability,guardrail,resilience,process,runbook,security. This prevents “just docs” actions and biases toward system changes. - Traceability: Each action must point to a Jira/GitHub ticket and evidence of completion (diff, dashboard, or runbook PR).
From words to fixes: wire it into your delivery
If actions don’t hit code, configs, or tools, nothing changes. Wire postmortems into your pipelines.
- GitOps tie-in: Annotate deployments with incident IDs so you can query what changed during or before an outage.
# deployment snippet for ArgoCD
metadata:
annotations:
gitplumbers.io/last-incident: "inc-sev1-2025-10-16-checkout"
gitplumbers.io/postmortem: "https://git.example.com/postmortems/inc-sev1-2025-10-16-checkout"- Block risky deploys when high-sev actions are open: For Sev1/Sev2 corrective actions, require closure (or explicit waiver) before you promote to prod. This isn’t “no deploys”; it’s “no deploys without conscious risk.”
#!/usr/bin/env bash
# gate.sh — run in CI before prod promotion
set -euo pipefail
JIRA_JQL='project = SRE AND labels = postmortem AND severity >= 2 AND status != Done'
OPEN=$(curl -s -u "$JIRA_USER:$JIRA_TOKEN" \
-G "https://jira.example.com/rest/api/2/search" \
--data-urlencode "jql=$JIRA_JQL" | jq '.issues | length')
if [[ "$OPEN" -gt 0 ]]; then
echo "Blocking deploy: $OPEN open Sev2+ postmortem actions. Use waiver label to override."
exit 1
fi- Make it visible: A weekly roll-up to leadership showing action closure rate and exceptions. If teams are always asking for waivers, that’s a signal to adjust timelines or add engineering capacity.
- Runbooks are code: Store runbooks next to the service in Git. PRs for runbook updates are first-class.
This is where most orgs fail: they treat the postmortem as content instead of changes. Changes win outages.
Measure what matters (and show it on one dashboard)
Leaders don’t need 17 KPIs; they need four:
- Recurrence rate: Percent of incidents in a quarter that match a known class from the last four quarters.
- MTTR trend: Median minutes from alert to restore, rolling 90 days.
- Action closure rate: Percent of postmortem actions closed within 30 days, by team.
- Error-budget burn: SLO burn rate windows and cumulative burn.
You can implement these with Prometheus and Grafana, plus a small incidents table (even a BigQuery or Snowflake view).
- Burn-rate alerts (4- and 14-hour windows):
# 2-window SLO burn rate alerts for HTTP 5xx
sum(rate(http_requests_total{status=~"5..", job="checkout"}[5m]))
/
sum(rate(http_requests_total{job="checkout"}[5m]))
> (1 - 0.999) * 14 # 99.9% SLO, 14x burn threshold- MTTR from incident logs (simple SQL over your incidents table):
SELECT
DATE_TRUNC('week', start_time) AS week,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY EXTRACT(EPOCH FROM (end_time - start_time))/60) AS mttr_minutes
FROM incidents
WHERE severity <= 2
GROUP BY 1
ORDER BY 1 DESC;- Action closure rate by team (Jira export → SQL):
SELECT team,
100.0 * SUM(CASE WHEN closed_at <= due_date THEN 1 ELSE 0 END) / COUNT(*) AS on_time_pct
FROM postmortem_actions
WHERE created_at >= CURRENT_DATE - INTERVAL '90 days'
GROUP BY team
ORDER BY on_time_pct ASC;Hold a monthly 30-minute review of this dashboard. If recurrence rate isn’t trending down, you don’t have a postmortem problem—you have a resourcing problem.
Leadership behaviors that make it stick
I’ve watched this succeed at a payments company and stall at an ad-tech scale-up. The difference was leadership.
- Say the words: “No performance consequences from candid participation.” It changes the room.
- Fund the fixes: Earmark 5–15% of each team’s capacity as a “prevention budget.” If you don’t protect it, product work will eat it.
- Reward prevention: Promote engineers for building guardrails, not just features. Tie objectives to SLOs and recurrence reduction.
- Keep review small and senior: IC, service owners, SRE, and a VP/Director. Exec presence signals priority; too many attendees turns it into theater.
- Timebox and enforce: Missed timelines are escalated like missed SLAs. No scolding; just re-prioritization and help.
- Normalize waivers—sparingly: Sometimes 30 days is unrealistic. Waivers require a VP signoff and a new due date. Track waivers like error budget spend.
Compliance without the theater (ITIL, CAPA, SOX/HIPAA)
You can be blameless and still pass audits.
- ITIL mapping: Treat the postmortem as the “Problem Record.” The corrective actions become Known Error DB entries and Changes with proper CAB notes. Keep the language system-first and you’ll satisfy auditors without naming a scapegoat.
- CAPA: When regulated (SOX/HIPAA/PCI), mark certain actions as CAPAs. Same template, but include verification evidence: link to merged PR, updated
Terraform, new alert inPrometheus, screenshot ofGrafanapanel. Auditors love determinism. - Evidence automation: Add a checklist item in the action ticket that requires a link to code or config diff. No diff, no done.
- Change control that isn’t theater: Use
ArgoCDorSpinnakerwith approvals gated by labels (postmortem=true). Approvals are recorded automatically; your CAB will thank you.
Example ArgoCD policy snippet:
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: prod-deploy-approver
rules:
- apiGroups: ["argoproj.io"]
resources: ["applications"]
verbs: ["sync"]
resourceNames: ["checkout"]
# Requires label postmortem-actions-closed=true set by CI gateThe trick is to make the compliant path the paved path. If it’s harder than cowboy deploys, you’ll get cowboy deploys.
A short, real example: the HPA that ate an AZ
At a retail client running EKS and HPA on v2beta2, a misconfigured target CPU of 15% caused aggressive scale-to-zero in one AZ during a traffic dip. When a flash sale hit, pods cold-started, liveness probes killed them, and checkout started throwing 5xx. We rolled back, toggled a LaunchDarkly flag, and stabilized.
Postmortem led to:
maxUnavailable: 1andmaxSurge: 1onDeployment- Added 2-window burn-rate alerts and per-AZ SLO panels
- Increased liveness
initialDelaySecondsand added readiness gates - Circuit breaker on calls to the payment gateway with exponential backoff and jitter
- Runbook update and an
HPApolicy review step in code reviews
Results in 60 days:
- Repeat class incidents: down from 3 to 0
- MTTR median: 52 → 24 minutes
- Action closure: 92% on-time
- Error budget burn: back under 25% monthly
No one got thrown under the bus. The system got guardrails. That’s the point.
structuredSections':[{
Key takeaways
- Blameless means system accountability with leadership air cover, not avoiding responsibility.
- Ritualize roles, channels, and timelines so the postmortem runs itself under stress.
- Tie every finding to measurable change: code, config, automation, or guardrails.
- Measure outcomes that matter to the business: recurrence rate, MTTR, SLO burn, action closure rate.
- Integrate with enterprise realities (SOX/HIPAA, ITIL, CAPA) without turning it into theater.
Implementation checklist
- Name a clear Incident Commander, Scribe, and Comms Lead for every Major Incident.
- Create a standard Slack channel pattern and timeline capture bot.
- Use a single postmortem template that ties to SLOs, costs, and specific system changes.
- Timebox: draft in 2 business days, review in 5, actions closed in 30 (with exceptions listed).
- Dashboard the four outcomes: recurrence rate, MTTR, action closure rate, SLO error-budget burn.
- Add a “prevention budget” to each team’s sprint capacity (5–15%).
- Adopt a no-human root cause rule in reviews; system causes only.
- Map corrective actions to auditable CAPAs when compliance requires it.
Questions we hear from teams
- How fast should we complete a postmortem?
- Draft within 2 business days, review by day 5, and close Sev1/Sev2 corrective actions within 30 days. Use waivers for exceptions with VP signoff.
- How do we stay blameless but still hold people accountable?
- Hold the system accountable for failure and leaders accountable for resourcing fixes. If an individual repeatedly bypasses guardrails, that’s a performance issue handled outside the review—not in it.
- What tools do we need to start?
- PagerDuty (or Opsgenie), Slack/Teams, a timeline bot (`incident.io` is great), Jira for tracking actions, Git for templates and runbooks, and Grafana/Prometheus for SLOs. Everything else is optional.
- We’re ITIL-heavy. Will this fly with CAB and auditors?
- Yes. Treat the postmortem as the Problem Record, map actions to Changes with approvals, and attach evidence (diffs, dashboards). Auditors prefer deterministic processes with artifacts.
- Remote-first teams: any special advice?
- Ritualize channels and roles, record the Zoom, and harden timelines. Make the paved path easy with ChatOps: one command creates channel, assigns roles, pins Zoom, and starts the timeline.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
