From Bus Factor 1 to 3 in 90 Days: A Mentorship Playbook for Critical System Knowledge
Your senior engineer is interviewing. Your on-call is shaky. You’ve got six months of runway and tribal knowledge everywhere. Here’s the mentorship program that actually transfers system knowledge without blowing up your roadmap.
Redundancy isn’t a cost; it’s insurance against your next 3 a.m. incident. Mentorship is how you buy it.Back to all posts
You don’t have a retention problem, you have a transmission problem
I’ve been in too many war rooms where the only person who can decode the 2017 Kafka consumer is on PTO in Bali. The org blames hiring, but the real issue is that core system knowledge lives in two heads, three stale Confluence pages, and a bunch of Slack DMs.
At a fintech I worked with, MTTR was creeping past 6 hours because the only person who understood the payment reconciliation batch was a Staff engineer who “planned to document it” after the next release. Predictably, he resigned. That week we paid $250k in credits and lost two enterprise customers.
You don’t fix that with a wiki dump. You fix it with a mentorship program built like a product—with rituals that transmit tacit knowledge, leadership behaviors that protect the time, and metrics that Finance and the board understand.
Design the program like a product, not an HR initiative
If it’s not staffed, scheduled, and measured, it’s theater. Treat knowledge transfer as a 90-day product with a backlog and SLOs.
Time budget: 10–15% of engineering capacity for 90 days. Yes, you can afford it; no, you can’t afford the outage.
Scope: Only the systems that page you at 3 a.m. or block revenue. Use a risk-rank:
Critical
,High
,Medium
.Roles:
Mentors
: Staff/Principal who hold the keys (SREs for infra, senior ICs for services).Mentees
: Next-line owners—one dev and one SRE per critical system.Program Owner
: EM or TPM who runs the cadence and reports progress.
Artifacts:
MENTORSHIP.md
, refreshed runbooks, ADRs, recorded code/dashboard tours, updatedCODEOWNERS
, on-call shadow schedule.Tools:
Backstage
for the service catalog,PagerDuty
for on-call,Datadog
/Grafana+Prometheus
for dashboards,Loom
for recordings,Sourcegraph
for code tours,Confluence
orNotion
for runbooks,adr-tools
for ADRs.Success criteria:
Bus factor ≥ 2 for every
Critical
service.Onboarding time to first prod change ≤ 30 days.
MTTR down 30–50%.
Change failure rate trending to < 15%.
If you can’t point to owners, time on the calendar, and a dashboard with MTTR and bus factor, you don’t have a program. You have vibes.
Communication rituals that actually transfer knowledge
I’ve seen “brown bags” flop because they’re passive and optional. These rituals force contact with painful reality and create artifacts along the way.
On-call shadowing (weekly):
Week 1–2: Mentee shadows, silent.
Week 3–4: Mentee drives with mentor as safety.
Deliverables: Updated runbook steps, recorded dashboard tour, list of common pages and playbooks in
PagerDuty
.
PR office hours (2x/week, 60 min):
Mentors batch review PRs from mentees on the target systems.
Focus: failure modes, rollback strategy (
ArgoCD
health checks,kubectl rollout undo
, feature flag toggles).Deliverables: Checklist of “gotchas” added to
CONTRIBUTING.md
and the runbook.
Design review ride-alongs (bi-weekly):
Mentees present small changes to the critical system in the actual review board.
Deliverables: One ADR per change that cements why we did X, not Y. Use
adr-tools
and link fromBackstage
.
Code tours (weekly, 20–30 min):
Record with
Loom
orSourcegraph
context: package boundaries, init flows, config, retries, circuit breakers, idempotency.Deliverables: Link in
Backstage
+ repoREADME
.
Runbook walkthroughs (bi-weekly):
Practice the top 5 incidents from last year. Unplug a dependency in a staging environment (light
chaos
), verify rollback.Deliverables: “Last verified” timestamp on runbook, alert-to-action mapping.
Slack rituals:
#oncall-warroom
pinned with playbooks and escalation ladder.Friday bot post: “Which runbooks were validated this week?” with emoji tallies and links.
These aren’t optional. They’re on the calendar with attendance. Every ritual produces an artifact that outlives the people in the room.
Leadership behaviors that make it stick
Every failed program I’ve seen had good intentions and no air cover. Here’s what actually works:
Protect time in the roadmap: Dedicate 10–15% capacity explicitly. Show it on the Gantt. Don’t hide it under “engineering efficiency.”
Tie mentoring to performance and promotion: Staff+ ladders should require documented mentoring outcomes (artifacts, mentee ownership, reduced MTTR). Recognize mentors in calibration.
Make it visible: Weekly Slack post from the VP Eng with a screenshot of the bus factor dashboard and a shout-out to teams hitting targets.
Staff a program owner: Not a committee. One EM/TPM with authority to move meetings, chase artifacts, and escalate. Treat it like a reliability program.
Fund the platform: If you don’t have a service catalog (
Backstage
) and enforceCODEOWNERS
, you’re fighting entropy by hand.Model the behavior: Directors/VPs attend the first retro and the first on-call shadow. If leadership doesn’t care, neither will ICs.
A 90-day implementation plan you can actually ship
Keep it boring and relentless.
Days 1–14: Baseline and target
Build a knowledge map in
Backstage
with risk ratings and current owners.Baseline metrics: MTTR (from
PagerDuty Analytics
), DORA (fromGitHub/Jenkins/CircleCI
), onboarding time (fromJira
), bus factor (from repo authorship +CODEOWNERS
).Publish
MENTORSHIP.md
with goals, rituals, and the calendar.
Days 15–30: Stand up rituals and artifacts
Launch on-call shadowing, PR office hours, and code tours. Record everything.
Create/refresh top 10 runbooks; verify in staging. Add rollback procedures (
ArgoCD
,helm
, feature flags).Start ADR habit; link from
Backstage
service pages.
Days 31–60: Transfer ownership under supervision
Mentees drive changes with mentors reviewing. Ship at least one production change per mentee using GitOps (
ArgoCD
orFlux
).Expand
CODEOWNERS
to include mentees on critical paths.Validate monitoring: add missing SLOs and alert tuning in
Prometheus/Datadog
.
Days 61–90: Prove redundancy and reduce heroics
Mentees take one low-risk on-call shift with mentor backup.
Run a failure game day for each critical service. Track time-to-detect, time-to-mitigate.
Lock in metrics: bus factor ≥ 2, onboarding time ≤ 30 days, MTTR down 30–50%.
Deliverables worth copying into your repos:
# MENTORSHIP.md
Purpose: Transfer critical system knowledge for [Service X] to achieve Bus Factor ≥ 2 in 90 days.
Scope: [Critical/High components, dependencies, SLIs/SLOs, runbooks, dashboards].
Rituals:
- On-call shadowing (weekly)
- PR office hours (Tue/Thu)
- Code tours (weekly)
- Runbook walkthroughs (bi-weekly)
- Design review ride-alongs (bi-weekly)
Artifacts:
- ADRs (docs/adr)
- Runbooks (docs/runbooks)
- Recordings (links in Backstage)
- Updated CODEOWNERS
Owners: Mentor [@handle], Mentee [@handle], Program Owner [@handle]
Metrics: MTTR, CFR, Onboarding time, Bus factor
# CODEOWNERS
/services/payments/ @alice @bob
/infra/terraform/modules/vpc @sre1 @sre2
/helm/charts/reconciliation @carol @dave
Metrics you can defend to Finance
If you can’t measure it in a board deck, it didn’t happen. Start with baselines and show the trend line every two weeks.
MTTR: Target 30–50% reduction on critical services.
- Source:
PagerDuty Analytics
, incidents tagged to services. Time fromtrigger
tomitigate
.
- Source:
Change Failure Rate (CFR): Target < 15%.
- Source:
GitHub
deployments + rollback events (ArgoCD
health degraded,kubectl rollout undo
), incidents linked inJira
.
- Source:
Onboarding time to first prod change: Target ≤ 30 days.
- Source:
Jira
cycle time for “new hire” label + first merged PR to protected branch + first deploy.
- Source:
Bus factor (redundancy): Target ≥ 2 for every critical component.
- Source:
CODEOWNERS
+ last-90-day committers + on-call eligibility inPagerDuty
.
- Source:
Runbook coverage and freshness: Target 100% coverage for critical pages; “last verified” within 60 days.
- Source:
Backstage
metadata; weekly bot check.
- Source:
Handy commands/snippets:
# GitHub: top committers on a path for the last 90 days
git log --since="90 days ago" --pretty="%an" -- services/payments | sort | uniq -c | sort -rn | head
# gh cli: PR cycle time for new hires label
gh pr list --search "label:new-hire merged:>2025-07-01" --json createdAt,mergedAt | jq '.[] | (.mergedAt - .createdAt)'
-- Example: incidents mapped to services (if you mirror into a warehouse)
select service, count(*) incidents, avg(mttr_minutes) avg_mttr
from pagerduty_incidents
where severity in ('high','critical') and created_at >= dateadd('day', -30, current_date)
group by 1 order by 3 desc;
What actually worked—and what failed—in the real world
Worked:
Shadow → drive → own: We cut MTTR from 5h to 2.5h at a healthcare client by forcing mentees to run a live failover (with a safety net) before they were added to on-call.
Artifacts first: Recording a 25-minute
Loom
tour of the payments recon code saved three weeks of back-and-forth for every new hire. It’s linked fromBackstage
and the repoREADME
.Promotion credit: Mentors leaned in once it was explicit on the Staff+ ladder and recognized in calibration. Promotions sell programs.
Service catalog as the spine:
Backstage
eliminated the “where is the doc?” scavenger hunt. Every ritual artifact is linked to a single service entry.
Didn’t work:
Brown bags with no homework: People nodded; nothing changed. If an activity doesn’t change
CODEOWNERS
, a runbook, or a dashboard, it’s optional.Confluence gardens with no gardeners: Docs rot. We added a “last verified” field and a bot that pings the owner monthly. That moved the needle.
Voluntary mentoring: Without a hard 10–15% time allocation, delivery pressure wins every time.
Big-bang documentation sprints: You can’t write your way out of tacit knowledge. You have to practice incidents, rollbacks, and deploys.
If you’re underwater, call GitPlumbers
We’ve walked into orgs mid-resignation and stabilized on-call in weeks. We don’t sell silver bullets. We bring a boring, measurable program your CFO can love and your engineers won’t hate. If your bus factor is 1 on a revenue-critical system, we’ll get you to 3 in a quarter—without pausing delivery.
Key takeaways
- Mentorship must be staffed like a product: clear scope, owners, timeline, and success metrics.
- Rituals beat documents: on-call shadowing, PR office hours, and runbook walkthroughs transfer real context.
- Leadership has to fund and protect time (10–15%) and tie mentoring to promotions and performance.
- Measure what matters: MTTR, change failure rate, onboarding time to first prod change, and component redundancy.
- Aim for redundancy, not heroics: at least two trained maintainers per critical component within 90 days.
Implementation checklist
- Allocate 10–15% of engineering capacity for 90 days—track it explicitly in the plan.
- Publish a `MENTORSHIP.md` with goals, rituals, artifacts, and owners.
- Stand up a service catalog (e.g., Backstage) and a living knowledge map with risk ratings.
- Schedule on-call shadowing and PR office hours with rotating ownership.
- Create or refresh runbooks and ADRs; enforce `CODEOWNERS` coverage.
- Baseline MTTR, onboarding time, change failure rate, and bus factor; set targets.
- Record code tours (Loom) and dashboard tours (Grafana/Datadog); attach to Backstage entries.
- Report weekly on ritual completion, artifact creation, and metric movement to execs.
Questions we hear from teams
- How do we make time for this without blowing our roadmap?
- Budget 10–15% explicitly and show it on the plan. Reduce scope elsewhere for 90 days. The payback shows up in fewer paging incidents, faster onboarding, and fewer hero bottlenecks. If leadership won’t protect the time, the program will fail.
- What if our senior folks don’t want to mentor?
- Tie mentoring outcomes to promotions and performance. Give them credit (and visibility) for artifacts and redundancy created. Also reduce their pager load as mentees ramp; that immediate relief changes attitudes.
- Do we need Backstage or can we use Confluence?
- You can start in Confluence, but you need a source of truth tied to services and ownership. Backstage makes it much easier to align docs, on-call, runbooks, and ADRs to the actual system surface area.
- Can we do this fully remote?
- Yes. Use recorded code tours (Loom), Sourcegraph tours, Zoom for paired sessions, and Slack bots to enforce rituals. The trick is to attach artifacts to the service catalog and enforce freshness with automation.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.