The Day Your Principal Walked and Your SRE Playbook Went With Them
Tribal knowledge doesn’t scale. Paved-road docs, service catalogs, and lightweight decision records do.
If it’s not in Git with an owner and a check, it didn’t happen.Back to all posts
The problem you’ve lived: expertise walked out at 5pm Friday
Two months after your principal SRE left, a noisy Redis failover woke up three teams. The new on-call burned 90 minutes figuring out which sentinel.conf
was actually active because the Confluence page was three revisions behind and the Slack thread with the workaround was lost in a private channel. MTTR tripled. Monday’s exec review wasn’t about Redis; it was about credibility.
I’ve watched this movie at unicorns and at banks. The pattern is always the same: tribal knowledge, bespoke docs no one trusts, and zero incentives to keep anything current. The fix isn’t a knowledge graph or another wiki. It’s a boring, paved road that keeps knowledge where engineers live: Git, CI, and the service catalog.
If it’s not in Git with an owner and a check, it didn’t happen.
The principle: pave one road and make it the easy one
Bespoke tooling dies on the vine. Engineers follow gravity:
- Work happens in
git
, CI, and tickets, not in “yet another place.” - Ownership is enforced by
CODEOWNERS
, not org charts. - Freshness is a check in CI, not a quarterly initiative.
- Discovery happens via a service catalog, not tribal memory.
So favor simplification:
- Git-native docs: Markdown in
docs/
beside the code. Render withmkdocs
ordocusaurus
via GitHub Actions. - Service catalog: Backstage (thin setup) or a minimal internal catalog that lists services, owners, on-call, and links to runbooks and ADRs.
- Decision capture: 1-page
ADR
files, numbered, reviewed in PRs. - Automation over aspiration: CI lints docs, fails stale runbooks, generates API and Terraform docs from source.
I’ve seen Backstage succeed when it’s a directory, not a platform rebuild. I’ve seen Confluence fail when it’s the only source: copy-paste drift is guaranteed.
The minimum viable knowledge system (MVKS)
Here’s the smallest system that actually sticks and scales across teams without a priesthood:
Repo-level docs
README.md
: purpose, how to run, how to deploy.docs/runbook.md
: alerts, common failures, step-by-step fixes.docs/adr/0001-record-architecture-decisions.md
: decisions with context, status, and consequences.api/
withopenapi.yaml
or GraphQL schema; generate HTML docs on merge.
Catalog the fleet
- Backstage with software templates off, just the catalog plugin and entity cards.
- Each service has
catalog-info.yaml
referencing repo, owner, runbook, ADRs, SLOs.
Ownership & checks
CODEOWNERS
for docs andcatalog-info.yaml
.- PR template checklist: "Updated runbook? Added ADR for decisions?"
- CI job: fail if
docs/runbook.md
older than 90 days or links break.
Automation hooks
terraform-docs
to generate module README tables.spectral
oropenapi-generator
to lint/generate API docs.mkdocs
+material
theme published to an internal site on every main branch build.
Example layout:
repo/
├─ README.md
├─ CODEOWNERS
├─ catalog-info.yaml
├─ docs/
│ ├─ runbook.md
│ └─ adr/
│ ├─ 0001-record-architecture-decisions.md
│ └─ 0002-choose-postgres-for-orders.md
├─ api/
│ └─ openapi.yaml
└─ infra/
└─ terraform/
Minimal GH Actions to publish docs:
name: docs
on: [push]
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: { python-version: '3.11' }
- run: pip install mkdocs mkdocs-material mkdocs-monorepo-plugin
- run: mkdocs build --strict
- uses: actions/upload-pages-artifact@v3
- uses: actions/deploy-pages@v4
if: github.ref == 'refs/heads/main'
This isn’t sexy. It is durable.
Before/after: what changes when you pave the road
Real numbers from teams we’ve helped at GitPlumbers:
Fintech (200+ microservices, Notion sprawl → Git + Backstage)
- Before: Onboarding took ~45 days; runbook drift caused 3 incidents with >2h MTTR in a quarter.
- After (90 days): Onboarding down to 22 days; MTTR median dropped 38%; 100% services had owners, runbooks, and SLO links in Backstage. Docs site built from repos replaced 17 Notion spaces. Engineers liked it because PRs updated docs and shipped the site.
SaaS analytics (Confluence-first → ADRs + doc checks)
- Before: Architecture decisions lived in Slack and memory. Repeated Kafka retention debates every 6 months.
- After (60 days): 47 ADRs captured. Time-to-decision in RFCs down from weeks to days because context was one click away. Churned less time in re-litigating old choices.
Marketplace (on-call chaos → runbook SLO)
- Before: PagerDuty pages routed to the wrong team 15% of the time; runbooks were outdated.
- After (45 days): Added doc freshness check and Slack reminders. Stale runbooks fell from 62% to 8%. Wrong-pager incidents dropped below 2% after catalog ownership fix.
These aren’t vanity metrics. They correlate to DORA lead time and change failure rate because fewer cycles are wasted rediscovering context.
Implementation in 30/60/90 days (no bespoke bloat)
30 days — lay the rails:
- Standardize a doc skeleton: publish templates for
README
,runbook
,ADR
in aengineering-standards
repo. - Roll out PR templates and
CODEOWNERS
org-wide. - Stand up a minimal Backstage with your identity provider and catalog plugin only. Import top 20 services.
- Add a docs CI job to 5 pilot repos using
mkdocs
and link results in Backstage.
60 days — automate and migrate:
- Turn on doc freshness checks: fail if
runbook.md
orcatalog-info.yaml
older than 90 days; notify owners via Slack bot. - Generate code-adjacent docs:
terraform-docs
on infra repos;openapi-generator
for APIs. - Migrate critical pages from Confluence/Notion by link, not by copy: put a banner pointing to the new source in Git.
- Run 3 blameless incident reviews and require runbook/ADR updates as explicit action items.
90 days — make it sticky:
- Make docs part of Definition of Done: shipping without updated runbooks fails CI. No exceptions.
- Publish a lightweight catalog scorecard (Backstage or a script): owner set? on-call configured? runbook fresh? ADR exists?
- Report platform KPIs monthly: onboarding time, MTTR by severity, docs freshness %, catalog coverage.
- Archive the old wiki. Keep a readonly license for legal/compliance, but remove it from new-hire onboarding.
None of this requires a platform team rewrite or a six-figure “enterprise knowledge” license. It’s commodity plumbing with fast ROI.
Governance, incentives, and the politics of documentation
Where I’ve seen this die: leadership calls docs a priority but won’t trade velocity for it. Here’s what actually works:
- Make ownership enforceable:
CODEOWNERS
gates merges on doc paths; catalog entries require an owner group. - Incent docs like reliability: track doc freshness like an SLO. If a service’s runbook is >90 days stale, it’s an error budget event. Miss your doc SLO? Prioritize fix before new feature work.
- Link to compensation indirectly: team scorecards include catalog coverage and doc SLO compliance. No individual heroics; team outcomes.
- Keep the catalog thin: resist Backstage plugin creep. A directory of owners and links beats a half-built platform.
- No doc priests: platform can own templates and automation. Product teams own content. Review via PRs like any code.
Culture hack that works: in incident command, after mitigation, the commander assigns the runbook and ADR update as tasks with owners. If it’s not in the postmortem, it won’t happen.
AI can help retrieval, but Git stays the source of truth
I like AI for retrieval and summarization. I don’t trust it to be the knowledge base.
- Index your Markdown with embeddings (e.g.,
OpenSearch
+ vector, or a hosted vector DB). Wire a Slack bot to answer “how do I rotate prod certs?” with links todocs/runbook.md
and relevant ADRs. - Add a
RAG
layer over docs and the service catalog. Cache answers; log misses to improve templates. - Keep the contract: if AI answers something not backed by a URL in Git, it’s a hallucination. Treat as a bug.
This gives you speed without inventing a second source of truth that will drift by Q3.
What to skip (ask me how I know)
- Bespoke wikis with complex permission models. You’ll spend quarters on governance and still have drift.
- All-in knowledge graphs before you have basics. Model the world later; first, make sure every service has an owner and a runbook.
- Mass migrations that “clean up everything.” Redirect with banners and let old content die naturally.
- Platformizing Backstage on day one. Start as a directory. Add templates and scorecards only after adoption.
- Docs-only OKRs. Tie them to MTTR, onboarding, and change failure rate. Business cares about outcomes, not page counts.
The TL;DR playbook you can run next sprint
- Put
docs/
in every repo with templates. Require updates via PR checklists. - Stand up a basic service catalog; index 20% of services that cause 80% of your pain.
- Automate freshness checks and doc generation from code.
- Capture decisions with 1-page ADRs reviewed like code.
- Treat doc freshness like an SLO and report it.
- Use AI to surface docs, not to invent them.
If you want help, GitPlumbers has paved-road starters you can fork: doc templates, CI checks, and a Backstage seed that’s hard to mess up.
Key takeaways
- Favor Git-native docs and a service catalog over bespoke wikis and knowledge graphs.
- Make docs part of the deployment pipeline and Definition of Done; otherwise they rot.
- Use ADRs and RFCs with templates to capture decisions and context when they happen.
- Assign ownership with CODEOWNERS and measure doc freshness like an SLO.
- Start small: standard repo docs, runbooks, and an internal map of services (Backstage or a thin equivalent).
- AI can help retrieve docs, but your source of truth must be boring Markdown in Git.
Implementation checklist
- Create a paved-road `docs/` structure in every repo with templates for README, runbook, ADR, and API.
- Stand up a service catalog (Backstage or minimal catalog) with ownership, links, and on-call metadata.
- Add `CODEOWNERS` to every repo; require doc updates in PR templates and CI checks.
- Generate docs from code where possible: `terraform-docs`, `openapi-generator`, `mkdocs` or `docusaurus`.
- Adopt ADRs (`0001-record-architecture-decisions.md`) with a one-page template and a number prefix.
- Automate freshness checks: flag runbooks older than 90 days; post Slack reminders via a bot.
- Bake docs into incident response: postmortems update runbooks, SLOs, and ADRs by default.
- Measure impact: onboarding time, MTTR, and “docs freshness” as a platform KPI.
Questions we hear from teams
- Why not just standardize on Confluence or Notion?
- They’re fine for PM specs and long-form narratives, but they live outside the flow of engineering work. Without CI checks, CODEOWNERS, and proximity to code changes, drift is inevitable. We keep wikis for narratives, but the source of truth for runbooks, ADRs, and API docs sits in Git with automation.
- Do we need Backstage?
- Not always. If you’re small, a thin internal directory (even a static site generated from repo metadata) works. At ~10+ teams, Backstage as a simple catalog pays off. Start with identity and catalog only; avoid plugin sprawl until adoption is real.
- How do we keep docs fresh without killing velocity?
- Automate freshness checks, keep docs short and local to code, and make updates part of PRs. The time you ‘lose’ writing the runbook is gained back many times during incidents and onboarding. Treat freshness like an SLO, not a suggestion.
- What about regulated environments?
- Git-native docs improve traceability. Every change is versioned, reviewed, and linked to tickets. You can snapshot docs at release tags for audits. We’ve shipped this in PCI and HIPAA shops by adding retention and approval workflows in GitHub/GitLab.
- Can AI write our docs?
- It can draft and summarize, but it shouldn’t be the source. Use AI to propose runbook steps, extract API descriptions from code comments, or answer Slack questions by pointing to Git URLs. Humans own accuracy; Git owns truth.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.