Stop Promoting Pager Tourists: Career Frameworks That Reward Reliability Work

If your promotion packets celebrate launch counts but ignore SLOs, MTTR, and pager load, you’re paying for outages with morale. Here’s the career ladder redesign that actually rewards reliability.

Reliability isn’t a personality trait. It’s a set of behaviors we can observe and reward.
Back to all posts

The promotion system that quietly punishes reliability

I watched a unicorn-scale fintech melt down after a “quick” microservice rewrite plus Istio mesh rollout. The feature hero who shipped the new pricing engine got promoted. The SRE who gutted a Friday night to unwind a bad circuit breaker default and stabilize canaries? A pat on the back and comp neutral. Six months later, that SRE left and MTTR doubled. None of this was malicious. The ladder celebrated shiny launches and titled project plans. It never said “keep prod boring” out loud.

If your promotion packets read like launch press releases, you’re subsidizing outages with morale. Let’s fix that with a career framework that rewards reliability work the way CFOs reward revenue: by design.

Write reliability into the ladder — explicitly

Stop hoping managers “consider” reliability. Put it in the rubric at every level with behaviors you can observe and metrics you can verify. Here’s a concrete, boring-on-purpose example you can hand to HR and not get laughed out of comp calibration.

# career-ladder.yaml
levels:
  IC3:
    competencies:
      reliability:
        behaviors:
          - Owns on-call for a service; follows runbooks; writes clear incident notes.
          - Contributes to postmortems; fixes at least one action item per quarter.
        outcomes:
          - Keeps personal MTTR within team’s SLO targets.
          - Reduces toil by ~8 hours/quarter via automation or docs.
  IC4:
    competencies:
      reliability:
        behaviors:
          - Designs features with SLOs and error budgets; adds `Prometheus` metrics.
          - Leads one postmortem/quarter; implements canary + rollback in `ArgoCD`.
        outcomes:
          - Decreases change failure rate on owned service by 20% YoY.
          - Cuts pager alerts during business hours by 30% via alert tuning.
  IC5 (Staff):
    competencies:
      reliability:
        behaviors:
          - Defines SLOs across a domain; negotiates with product on error budgets.
          - Introduces circuit breaker patterns (`Istio`/`Envoy`) and feature flag kill switches.
        outcomes:
          - Improves MTTR across domain by 25%; eliminates a class of incidents.
          - Mentors teams to adopt GitOps; establishes `ArgoCD` health checks.
  M3 (Eng Manager):
    competencies:
      reliability:
        behaviors:
          - Maintains fair on-call; caps alert volume; protects focus time.
          - Runs weekly ops review; ensures postmortems within 72h and actions tracked.
        outcomes:
          - Team meets SLOs for 3 consecutive quarters; change failure rate < 15%.
          - Allocates 15% capacity to toil reduction; proves hours saved.

Note the structure: behaviors are observable; outcomes have numbers. Promotions should cite both.

Make reliability work visible with rituals

You can’t reward what you can’t see. Most enterprises have the data; it’s just siloed. Bake visibility into your operating rhythm.

  • Weekly ops review (30 minutes, cameras on)
    • Review SLO burn, top 5 pages by frequency, open postmortem actions, change failure rate.
    • Celebrate prevention: “X eliminated 200 alerts with a single Prometheus rule change.”
  • Blameless postmortems within 72 hours
    • Require one concrete prevention action item per severity-1 incident.
    • Track actions like features in Jira—sized, prioritized, assigned.
  • SLO office hours
    • Senior ICs teach teams to define latency < 250ms p95 and availability >= 99.9% SLOs.
    • Product joins to discuss error budgets and launch gates.
  • Change review for risky deploys
    • Use canary deployment and feature flags as standard, not hero moves.
    • Document rollback plans in the PR; tie to ArgoCD rollout steps.

Example: a simple Prometheus alert that rewards prevention, not pager roulette.

# slo-burn.rules.yaml
groups:
- name: slo-burn
  rules:
  - record: service:latency_p95
    expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
  - alert: ErrorBudgetBurn
    expr: (service:latency_p95 > 0.25) and (sum(rate(http_requests_total{code=~"5.."}[5m])) by (service) > 0)
    for: 15m
    labels:
      severity: page
    annotations:
      summary: High latency and errors indicate SLO burn.
      runbook: https://runbooks.mycorp.com/slo-burn

Postmortems should link to the alert and the runbook. Close the loop.

Tie promotions to measurable reliability outcomes

Launches make nice demos; reliability moves business needles. Pick metrics leaders recognize.

  • SLO attainment and error budget burn
    • Example signal: three consecutive quarters within budget for a service an IC leads.
  • MTTR and time-to-detect (TTD)
    • Shorten MTTR with better runbooks, not heroics.
  • DORA metrics: change failure rate, deployment frequency, lead time for changes
  • Toil hours eliminated
    • Automations, playbooks, and docs that reduce recurring manual work.
  • Pager load fairness and alert quality
    • Lower total pages and shift left to business hours; reduce false positives.

Automate the accounting. If your promo packet requires archeology, you’ll bias toward “shiny slide” projects.

# .github/workflows/reliability-rollup.yaml
name: reliability-rollup
on:
  schedule: [{ cron: '0 3 * * 1' }]  # Mondays 3am UTC
jobs:
  rollup:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Query GitHub issues/PRs with reliability labels
        uses: actions/github-script@v7
        with:
          script: |
            const labels = ['slo', 'toil-reduction', 'postmortem', 'oncall-hardening'];
            // fetch merged PRs in last quarter
            // tally by author and label to create a CSV artifact for promo packets
            // (omit full code for brevity)
            core.info('Aggregated reliability contributions.');
      - name: Upload artifact
        uses: actions/upload-artifact@v4
        with:
          name: reliability-rollup
          path: rollup.csv

Tag your work. Don’t rely on folklore.

# Label hygiene
gh label create slo --description "SLO definition/changes"
gh label create toil-reduction --description "Automation and ops cleanup"
gh label create postmortem --description "Incident analysis and actions"

And store DORA stats where managers can find them.

-- BigQuery example: change failure rate last 90 days
SELECT
  COUNTIF(build_status = 'FAILED' OR rollback = TRUE) / COUNT(*) AS change_failure_rate
FROM `ci.cd_deployments`
WHERE deploy_time BETWEEN TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 90 DAY) AND CURRENT_TIMESTAMP();

Calibrate review language so reliability gets credit

Managers default to “launched X” because it’s easy to describe. Give them a template that turns reliability into a crisp narrative.

  • Use outcome statements tied to org goals
    • “Reduced incident page volume by 42% QoQ by consolidating alerts and adding circuit breakers in Istio. Saved ~120 engineer hours.”
  • Cite evidence.
    • Link to prometheus dashboards, postmortems, rollup.csv, and SLO docs.
  • Avoid hero worship.
    • “Led a boring quarter where SLO burn stayed within budget despite traffic doubling.”

Drop a career-matrix.md in each repo with expectations.

### IC4 Reliability Examples
- Designed kill switches via `feature flags` to gate risky rollouts; validated with canaries in `ArgoCD`.
- Authored runbooks; reduced MTTR from 45m to 18m on auth-service.
- Drove SLO agreement with product for `checkout` and held line on error budgets pre-Black Friday.

Codify ownership so the right people get paged and credited.

# CODEOWNERS
/services/checkout/ @team-checkout @staff-eng-riley
/runbooks/ @sre-guild
/.argo/ @platform-team

Leadership behaviors that prevent perverse incentives

I’ve seen teams create “incident hero points” and promptly spike change failure rates. Don’t gamify outages. Reward prevention.

  • Cap pager load
    • No more than 1 week on-call per 6 weeks; max 2 pages/non-business hour per week average.
  • Protect focus time
    • Schedule on-call backfill; move the roadmap when reality bites.
  • Budget for toil reduction
    • Make 10–20% of team capacity recurring; track hours saved and defects prevented.
  • Gate launches on error budgets
    • If a service is out of budget, slow feature ramps; invest in reliability first.
  • Normalize rolling back
    • “Rollback before root cause” as a leadership mantra; praise early reversions.
  • Pay down AI messes
    • If AI-generated code (“vibe coding”) shipped, schedule AI code refactoring and vibe code cleanup as first-class backlog—don’t pin it on the next on-call victim.

A pragmatic 90‑day rollout plan (enterprise-safe)

You don’t have to boil the ocean. Pick one domain, make it boring, then scale.

  1. Week 0–2: Define the rubric and labels
    • Finalize career-ladder.yaml with HR; align vocabulary with comp bands.
    • Create labels (slo, toil-reduction, postmortem, oncall-hardening); publish career-matrix.md templates.
  2. Week 3–6: Instrument and ritualize
    • Stand up SLO dashboards in Prometheus/Grafana; add basic burn alerts.
    • Run weekly ops review; enforce 72h postmortems and action tracking.
    • Add canary + rollback policies in ArgoCD; require runbooks in repos.
  3. Week 7–10: Automate attribution
    • Deploy the GitHub Action to roll up reliability contributions.
    • Wire DORA queries to a shared dashboard; add MTTR per service.
  4. Week 11–12: Calibrate and commit
    • Run a mock promo cycle using the new artifacts.
    • Adjust language with HR; train managers; publish examples.

Expected outcomes within a quarter (I’ve seen these at a F500 replatform and a high-growth SaaS):

  • 20–40% reduction in alert volume after tuning and runbooks.
  • 15–30% improvement in MTTR with clean handoffs and owned services.
  • Feature velocity stabilizes because rollbacks are routine, not political.

Real-world gnarl: Terraform, traffic, and “quiet” wins

At a large retailer, we tied reliability work to clearly owned infra. A Staff IC rewrote a flaky Terraform module that handled ALB health checks. Before: canary would pass in staging, fall over under Istio mTLS in prod, and trigger thrash. After: correctness by default, plus a one-line ArgoCD health gate. Result: zero rollbacks in peak week and 60% fewer pages. That Staff IC got promoted on a packet that didn’t mention a single feature launch—just hard reliability gains and avoided revenue loss.

# terraform-alb-healthcheck.tf
resource "aws_lb_target_group" "checkout" {
  name     = "checkout-tg"
  protocol = "HTTP"
  port     = 8080
  health_check {
    path                = "/healthz"
    matcher             = "200-299"
    healthy_threshold   = 3
    unhealthy_threshold = 2
    interval            = 10
    timeout             = 5
  }
  stickiness { type = "lb_cookie" }
}
# argo-app.yaml (health gate)
healthChecks:
  - httpGet:
      path: /healthz
      port: 8080
    timeoutSeconds: 5
    initialDelaySeconds: 10

No cinematic incident. Just fewer pages and more sleep. That should be promotable.

What I wish I’d done sooner

  • Wrote reliability into the ladder years earlier instead of hoping managers “remembered.”
  • Automated the rollup so promo packets weren’t detective work.
  • Celebrated runbooks and rollbacks publicly.
  • Drew a hard line on AI-generated code: no merge without observability and rollback plan.

Reliability isn’t a personality trait. It’s a set of behaviors we can observe and reward. Put it in the ladder and pay the people who keep the lights on.

Related Resources

Key takeaways

  • Write reliability into the ladder at every level with clear, observable behaviors and business outcomes.
  • Make reliability work visible through weekly ops reviews, blameless postmortems, and label-driven PR/issue tracking.
  • Use SLOs, MTTR, change failure rate, and toil hours removed as promotion evidence—not launch counts.
  • Reward prevention, not heroics—limit on-call load, rotate fairly, and value boring success.
  • Automate attribution with labels, dashboards, and simple pipelines so managers aren’t doing forensic accounting during promo season.
  • Roll out in 90 days via a pilot, HR calibration, and manager training—then scale.

Implementation checklist

  • Define reliability competencies per level (IC/M) with concrete behaviors and outcomes.
  • Adopt rituals: weekly ops review, postmortems within 72h, SLO office hours, change review.
  • Instrument metrics: SLOs, MTTR, change failure rate, deployment frequency, toil hours removed.
  • Tag reliability work with labels and CODEOWNERS; automate rollups for promo packets.
  • Set leadership guardrails: cap pager load, protect focus time, budget for toil reduction.
  • Pilot with one org, calibrate with HR, train managers, and iterate quarterly.

Questions we hear from teams

How do we avoid rewarding ‘incident heroes’ instead of prevention?
Define success as fewer, shorter incidents and documented prevention. Cap pager load, require postmortems with prevention actions, and value rollbacks and kill switches. Do not count pages as points.
What if product pushes back on error budgets blocking launches?
Make the trade-off explicit in planning: breaching error budgets increases churn and support costs. Use SLO data to quantify the revenue risk. Negotiate staged rollouts, feature flags, and risk-limiting canaries instead of hard stops.
We’re an enterprise with strict HR bands. How do we fit this in?
Translate reliability into the existing competency language (systems thinking, influence, delivery). Provide example behaviors and outcomes per level—HR-friendly wording plus hard metrics. Pilot with one org to prove it aligns to performance, not heroics.
How do we measure toil reduction credibly?
Track before/after counts of manual steps or time-per-step. Example: rotate secrets via automation saved 10 minutes per deploy x 200 deploys = ~33 hours/quarter. Require links to scripts, PRs, and runbooks.
What about AI-generated code that increases incident risk?
Set a policy: no AI-generated PR merges without observability, rollback, and test coverage. Tag cleanup as `toil-reduction` and reward `AI code refactoring` that reduces incidents. Don’t punish the on-call who cleans up vibe code; promote the prevention.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to GitPlumbers about a reliability-first career framework Rescue AI-generated code before it rescues your pager

Related resources