The Career Ladder That Cut MTTR in Half: Promotions That Reward Reliability Work

If your promo packets only celebrate features shipped, you’re paying engineers to create future outages. Here’s the ladder, rituals, and metrics that make reliability count.

Promote the people who make the pager boring.
Back to all posts

The silent killer: ladders that tell people to ignore the pager

I’ve watched this movie at three Fortune 500s and a unicorn that should’ve known better. We had 50+ microservices, a shiny ArgoCD GitOps flow, and the promotion narrative was basically: “Delivered migrations X, Y, Z ahead of schedule.” Meanwhile, the on-call rotation was a meat grinder. MTTR hovered at 2h+, and the same engineers who stabilized Kafka consumers or wrote real runbooks got “thanks,” not promotions.

If your ladder celebrates feature velocity and treats reliability as volunteer work, you will keep buying the same outage twice. Reliability work needs to be first-class in your career framework—explicit behaviors, objective signals, and rituals that surface the outcomes.

If people are promoted for shipping, they will ship—even if they ship pain into the pager queue.

Design the ladder: reliability behaviors by level

Stop hand-waving. Write the reliability expectations into the ladder the way you write “owns designs” or “mentors juniors.” Make it scannable, testable, and tied to artifacts.

  • L3 (Engineer)
    • Contributes to runbooks and dashboard panels (Prometheus/Grafana)
    • Fixes flaky alerts; writes at least one SLO proposal with guidance
    • Participates in on-call with supervision; closes postmortem action items on time
  • L4 (Sr. Engineer)
    • Owns SLOs for a service; keeps error budget policy current
    • Leads at least one P2 postmortem per half; improves MTTR via better diagnostics
    • Reduces toil by automating a recurring task (e.g., Terraform module for RDS failover)
  • L5 (Staff)
    • Designs reliability controls across services: circuit breakers (Istio), retries, idempotency
    • Establishes runbook quality bar and drills; cuts pager alert volume by X%
    • Co-owns change failure rate with EM; introduces canary/feature-flag strategy (LaunchDarkly)
  • L6+ (Principal/Architect)
    • Sets reliability strategy and budgets (error budget, blast radius) across domains
    • Champions org rituals (weekly reliability review, QBR error-budget check)
    • Moves org-wide metrics (e.g., MTTR from 90m to <45m in 2 quarters)

Here’s a lightweight representation we’ve used to keep expectations versioned alongside the handbook:

# ladder.reliability.yml
levels:
  L3:
    behaviors:
      - add_runbooks: "Adds or updates runbooks with tested steps and owners"
      - dashboards: "Contributes panels with clear SLO/SLA context"
      - oncall: "Participates in on-call with peer shadowing; closes action items"
  L4:
    behaviors:
      - own_slos: "Defines/maintains 2-3 SLOs; tracks error budget in QBR"
      - lead_postmortem: "Leads P2 postmortems; drives MTTR improvements"
      - reduce_toil: "Automates recurring ops tasks; documents golden path"
  L5:
    behaviors:
      - resilience_arch: "Implements circuit breakers, backpressure, idempotency"
      - alert_quality: "Cuts noisy alerts >30%; enforces alert → runbook link"
      - safe_deploys: "Rolls out canary + feature flags; change fail rate ↓"
  L6:
    behaviors:
      - strategy: "Org-wide reliability strategy and error-budget policy"
      - rituals: "Institutionalizes reliability reviews and drills"
      - kpi_movement: "Improves org MTTR/availability targets across teams"

This file lives next to the broader ladder, gets PRs, and is referenced directly in promotion packets.

Make reliability measurable (and queryable)

If it’s not queryable, it won’t survive promotion committees. Pick metrics that connect to business pain and are feasible with your stack.

  • SLOs & error budgets: 2–3 user-centric SLOs per critical service. Track burn rate.
  • DORA metrics: change failure rate, deployment frequency, lead time, MTTR.
  • On-call signals: pages per week per service, percent noisy alerts, after-hours load.
  • Operational hygiene: runbook coverage, postmortem SLA adherence, chaos drill pass rate.

Concrete snippets you can drop in today:

# 28-day SLO burn rate (requests) with multi-window, multi-burn alerting
# Fast + slow windows inspired by Google SRE workbook
sum(rate(http_requests_total{job="checkout",status!~"5.."}[5m]))
/
sum(rate(http_requests_total{job="checkout"}[5m]))
< 0.995
# Prometheus alert for fast burn (page) vs slow burn (ticket)
- alert: SLOFastBurn
  expr: (1 - slo:availability:ratio_rate5m{service="checkout"}) > 5 * (1 - 0.995)
  for: 5m
  labels:
    severity: page
- alert: SLOSlowBurn
  expr: (1 - slo:availability:ratio_rate1h{service="checkout"}) > 2 * (1 - 0.995)
  for: 2h
  labels:
    severity: ticket

Gate risky deploys when you’re burning budget too fast:

# .github/workflows/deploy.yaml
name: deploy
on: [workflow_dispatch]
jobs:
  gate:
    runs-on: ubuntu-latest
    steps:
      - name: Check SLO burn rate
        run: |
          BURN=$(curl -s "$PROM_URL/api/v1/query?query=1-slo:availability:ratio_rate1h{service='checkout'}")
          VAL=$(echo "$BURN" | jq -r '.data.result[0].value[1]' )
          echo "Burn rate: $VAL"
          awk -v v=$VAL 'BEGIN{if (v > 0.01) exit 1}'  # >1% over SLO -> block
  deploy:
    needs: gate
    uses: org/workflows/argocd-deploy@v3

Generate weekly reports without a new data warehouse. If you’re on PagerDuty:

# Pages per service last 7 days; useful for on-call debriefs
pd incidents list --since "7d" --json \
 | jq -r '[.[] | {service:.service.summary, urgency:.urgency}] \
 | group_by(.service) \
 | map({service: .[0].service, pages: length, high: map(select(.urgency=="high"))|length}) \
 | ("service,pages,high"), (.[]|[.service, (.pages|tostring), (.high|tostring)]|@csv)'

Rituals that keep reliability visible

Changing the ladder without changing the calendar doesn’t work. You need recurring forums where reliability shows up next to roadmap and revenue.

  • Weekly Reliability Review (30m)
    • Attendees: EMs, TLs, SREs, PMs from critical services
    • Agenda: error-budget status, top incidents, action item SLAs, deploy gates
    • Output: 3 prioritized fixes with owners/timeboxes
  • On-call Debrief at handoff (15m)
    • Each rotation ends with a quick retro: What paged? What was noise? What runbook lied?
    • Create tickets then and there; tag reliability and assign.
  • Postmortem Office Hours (1h/week)
    • A Staff+ facilitator reviews P1–P2 postmortems for quality, not blame
    • Enforce “alert → runbook → dashboard” linkage
  • QBR Error-Budget Check
    • Every business unit sees burn-down; if a service burned >X% budget, roadmaps show capacity reserved for fixes.
  • Drills
    • Chaos tests on non-prod using Litmus or Gremlin. Track pass rate as a KPI.

Codify the ticket taxonomy so it’s reportable:

# jira-fields.yaml
customFields:
  reliability: "customfield_12345" # bool
  incidentAction: "customfield_67890" # enum: postmortem|hardening|toil
workflows:
  requireOwnerForIncidentAction: true

Promotion packets that surface the right work

If the promo template doesn’t ask for reliability outcomes, they won’t show up. Add a section that’s impossible to hand-wave.

# Promotion Packet: Reliability Evidence

1. SLO Ownership
   - Services: checkout, pricing
   - Error budget: 99.5% availability, 300ms p95 latency
   - Outcomes: Reduced burn by 60% after retry + idempotency redesign

2. Incident Leadership
   - Led P2-2025-03-14; MTTR 42m → 18m by adding synthetic checks and better logs
   - Drove postmortem AIs to closure in 14 days; deleted 2 flaky alerts

3. Toil Reduction
   - Wrote `Terraform` module for RDS failover; cut runbook steps from 12 → 3
   - Jenkins → `ArgoCD` migration with canary; change failure rate 18% → 6%

4. Org Impact (for Staff+)
   - Instituted weekly Reliability Review; pager volume down 35% QoQ

Ask reviewers to confirm with links, not adjectives: Grafana dashboards, Prometheus rules, PRs, Jira tickets, ArgoCD app history.

Leadership behaviors that actually move the needle

I’ve seen this fail when leaders outsource reliability to a “platform team” and keep grading ICs on features. What works:

  • Tie capacity to error budget: If a service burns >25% budget in a month, allocate N sprints to fixes. Publish the policy.
  • Say “no” with receipts: EMs and PMs jointly present the burn data and DORA metrics before deferring reliability work.
  • Protect focus time: 10–20% of engineering time for toil reduction; track it with a Jira label and show the impact in QBRs.
  • Comp committee alignment: Calibrate with reliability exemplars; don’t penalize engineers who slowed feature output to stabilize the platform.
  • Make debt visible: Maintain a reliability.md in each repo with SLOs, alert links, runbooks, and known risks.

And yes, sometimes you have to stop the world. We froze deploys at a fintech client for 48 hours to fix database connection storms. That “lost” time paid back in three weeks of zero after-hours pages.

Enterprise reality: pilot, measure, then scale

You’ve got CABs, SOX, ITIL, and three monitoring tools because 15-year accretion is a thing. Don’t boil the ocean—pilot.

  1. Pick 2–3 services with real pain (high pager volume, revenue adjacency). Assign a Staff+ sponsor.
  2. Define SLOs and wire the burn-rate gate in CI/CD for those services only.
  3. Run the rituals for a quarter: weekly review, on-call debrief, postmortem office hours.
  4. Update promotion packets for candidates in those teams; calibrate comp committee with the new rubric.
  5. Publish results in the QBR.

What “good” looked like at a healthcare client in 90 days:

  • MTTR: 76m → 34m
  • Change failure rate: 21% → 9%
  • Pages/week for claims service: 22 → 8
  • 80% of P1 postmortem actions closed in 14 days (was 35%)
  • Two Sr. Engineer promotions with strong reliability evidence; both became de facto incident commanders

Once you have numbers, your CFO becomes your advocate, not your blocker.

Tools, snippets, and guardrails you can actually use

  • Feature flags: Gate risky codepaths (LaunchDarkly or Unleash), tie rollout to SLO health.
  • Service mesh: Use Istio for outlier detection and circuit breaking. Example destination rule:
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: checkout-dr
spec:
  host: checkout
  trafficPolicy:
    outlierDetection:
      consecutive5xxErrors: 5
      interval: 30s
      baseEjectionTime: 3m
    connectionPool:
      http:
        http1MaxPendingRequests: 100
        maxRequestsPerConnection: 100
  • Runbook checks: Add a CI job that fails if an alert lacks a linked runbook.
#!/usr/bin/env bash
# verify-alert-runbooks.sh
for f in alerts/*.yaml; do
  if ! yq '.annotations.runbook_url' "$f" | grep -q http; then
    echo "Missing runbook_url in $f" && exit 1
  fi
done
  • Golden paths: Template ops tasks (backups, dashboards) as reusable Terraform modules. Reward engineers who upstream improvements.

When you reward these behaviors, you get more of them. Funny how that works.

Related Resources

Key takeaways

  • If your ladder rewards only feature delivery, engineers will rationally ignore reliability work.
  • Make reliability legible: track SLOs, MTTR, change failure rate, incident volume per service, and toil hours.
  • Bake reliability into levels with concrete behaviors, not vibes—e.g., “owns error budget policy” at Senior.
  • Use rituals—weekly reliability review, on-call debriefs, QBR error-budget check—to keep it visible.
  • Gate promotions and releases on reliability signals; automate checks in CI/CD where possible.
  • Pilot on 2-3 services for a quarter; publish results and expand. Tie to business outcomes, not slogans.

Implementation checklist

  • Define 2-3 SLOs per critical service with customer-facing semantics.
  • Add reliability behaviors to each ladder level with measurable signals.
  • Stand up weekly Reliability Review and end-of-rotation on-call debriefs.
  • Require postmortems with action item SLAs and ownership for P1–P2 incidents.
  • Automate an SLO burn-rate gate in CI/CD for risky deploys.
  • Update promotion packet templates to require reliability evidence.
  • Allocate focus time (e.g., 10–20%) specifically to toil reduction and debt paydown.
  • Publish quarterly error-budget and DORA metrics in QBRs.

Questions we hear from teams

How do we avoid weaponizing SLOs against teams?
Use SLOs as guardrails, not bludgeons. Pair error-budget policies with support: capacity earmarked for fixes, staff sponsorship, and time to improve observability. Never tie SLO breaches directly to punitive measures; focus on system learning and risk reduction.
We don’t have SREs. Can product engineers own this?
Yes. Plenty of orgs run “SRE-as-a-verb.” Start by giving product teams SLO ownership, a reliability champion per team, and platform support for canary, dashboards, and runbooks. Reward the champion work in the ladder or rotate the role quarterly.
Our stack is Datadog/New Relic, not Prometheus. Does this still work?
Absolutely. The principles are tool-agnostic. Replace PromQL with Datadog Monitor queries or NRQL. You still need SLOs, alert/runbook linkage, and rituals. What matters is consistent measurement and visibility.
How do we measure toil without micromanaging time?
Tag work, not minutes. Use a Jira custom field like `reliability=true` and `incidentAction=toil`. Report on issue counts and outcomes (e.g., reduced pages, faster MTTR), not personal timesheets.
What if PMs push back on allocating time?
Bring data to the roadmap. Show error-budget burn, change failure rate, and the historical cost of incidents (lost hours, refunds, churn). Frame reliability as enabling faster, safer delivery. Also, set an explicit policy: if burn > threshold, capacity shifts automatically.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to GitPlumbers about building a reliability-centered career ladder Download the reliability ladder rubric (YAML + packet template)

Related resources