Will this penalize feature teams that have to move fast?

No—if you do it right. Error budgets are a throttle, not a brake. When SLOs are healthy, ship. When they’re not, reliability work becomes the most valuable thing you can do to protect revenue and morale. Tie Product OKRs to error budget health so you’re aligned.

How do we handle junior engineers who aren’t ready for on-call?

Pair them with a buddy on-call, give scoped runbooks and shadow incidents. Count reliability work like dashboard improvements and test hardening toward their growth. Don’t gate promotions on pager time; gate on readiness and documented improvements.

What if we don’t have SLOs yet?

Start with coarse targets for your top 3 customer flows (e.g., checkout latency < 300ms p95, availability 99.5%). Instrument those first; perfect can come later. Use Datadog SLIs or Prometheus and iterate quarterly.

How do platform teams show reliability impact?

Use adoption and stability metrics: upgrade lead time, success rate for cluster rollouts, mean time to rollback, node drain safety, reduction in toil tickets. Publish SLOs for the platform itself (control plane uptime, deploy success rate).

Culture · Nov 7, 2025 · 10 minute read

Promotions Shouldn’t Go To Pager Heroes: Career Ladders That Reward Reliability Work

Q: How do platform teams show reliability impact?

Use adoption and stability metrics: upgrade lead time, success rate for cluster rollouts, mean time to rollback, node drain safety, reduction in toil tickets. Publish SLOs for the platform itself (control plane uptime, deploy success rate).

Make reliability contributions first‑class in your engineering career framework, not invisible chores tagged as “nice to have.”

Alex Mercer

Partner, GitPlumbers (ex-Adobe, ex-Stripe, ex-HashiCorp)

20 years shipping and rescuing distributed systems. Led SRE and platform teams through PCI audits, multi-region cutovers, and the occasional 3 a.m. Kafka storm. Now helping enterprises make reliability a career accelerant, not an afterthought.

Promote the engineers who make your systems boring.

Back to all posts

Related Resources

Key takeaways

If you can’t measure reliability impact, your ladder will reward features by default.
Bake reliability competencies into every level with concrete, observable behaviors.
Tie promotions to SLOs, change failure rate, MTTR, and documented runbooks—not heroics.
Use rituals (reliability council, incident reviews, change audits) to surface work that’s usually invisible.
Automate tagging and evidence collection so managers don’t become bookkeepers.
Leaders must budget time (10–20%) and protect it during planning—or incentives will drift back to features.

Implementation checklist

Define 3–5 company-wide reliability competencies mapped to each level (L3–L7).
Instrument SLOs per service and agree on error budget policies with Product.
Launch a monthly Reliability Council with clear agenda and promotion signal.
Automate PR labeling for reliability work and collect evidence in one place.
Add reliability scoring to performance reviews with weightings by level.
Pilot in one org for a quarter, publish results, then roll out incrementally.

Questions we hear from teams

Will this penalize feature teams that have to move fast?: No—if you do it right. Error budgets are a throttle, not a brake. When SLOs are healthy, ship. When they’re not, reliability work becomes the most valuable thing you can do to protect revenue and morale. Tie Product OKRs to error budget health so you’re aligned.
How do we handle junior engineers who aren’t ready for on-call?: Pair them with a buddy on-call, give scoped runbooks and shadow incidents. Count reliability work like dashboard improvements and test hardening toward their growth. Don’t gate promotions on pager time; gate on readiness and documented improvements.
What if we don’t have SLOs yet?: Start with coarse targets for your top 3 customer flows (e.g., checkout latency < 300ms p95, availability 99.5%). Instrument those first; perfect can come later. Use Datadog SLIs or Prometheus and iterate quarterly.
How do platform teams show reliability impact?: Use adoption and stability metrics: upgrade lead time, success rate for cluster rollouts, mean time to rollback, node drain safety, reduction in toil tickets. Publish SLOs for the platform itself (control plane uptime, deploy success rate).

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Get a Reliability Ladder Tune‑Up Request a free SLO scorecard template

Related Resources

Key takeaways

Implementation checklist

Questions we hear from teams

Ready to modernize your codebase?

Related resources