The On‑Call That Exposed Our Bus Factor: Shipping a Paved‑Road Knowledge System in 90 Days
Your tools aren’t the problem—your lack of a paved road for knowledge is. Here’s how to capture institutional expertise without building a bespoke Rube Goldberg machine.
If it isn’t in the repo, it isn’t real. If it isn’t searchable, it doesn’t exist.Back to all posts
The 2 a.m. page that exposed our bus factor
I’ve been on the on-call that goes sideways because the one person who knows the Kafka DLQ dance is asleep in Tokyo. We had Prometheus screaming about a payments-api error budget burn, Grafana red, Istio circuit breakers tripping, and a Terraform plan from last week that quietly moved a security group. We weren’t short on tools—we were short on a paved road that told the next human what to do.
The fix wasn’t a new wiki or another AI bot. It was simplifying where knowledge lives and making updates travel with code. When we did that, our MTTR dropped from “pray and page” to “follow the runbook and roll back.”
Tools don’t save you. Defaults do.
I’ve seen this fail: teams assemble a bespoke stack—Notion for PM notes, Confluence for architecture, Google Docs for RFCs, SharePoint for runbooks, a tribal Slack thread for “the real steps,” and a Backstage catalog nobody updates. Six months later, onboarding takes forever, and your SLOs quietly rot.
Here’s what actually works:
- Few blessed places, boring formats. Markdown in the repo for anything developer-facing. Use a docs site generator to publish. If you need collaboration, use PRs.
- Update knowledge in the same change as code. If you touch
k8s/orterraform/, you touchrunbooks/oradr/. One PR. One reviewer. No doc committee. - Paved-road templates. ADRs under
adr/, runbooks underrunbooks/, serviceREADME.mdwith owner/SLO/links. No new formats unless you’re ready to maintain them. - Discoverability where engineers live. Search across docs, Slack, and the service catalog. Shortlinks. Backstage annotations. Dashboards link to runbooks.
If it isn’t in the repo, it isn’t real. If it isn’t searchable, it doesn’t exist.
The paved-road knowledge stack we ship
You don’t need a platform team to build a CMS. Ship this minimal, durable setup and get out of the way.
- Docs-as-code with
mkdocs-material(orDocusaurusif you already live in Node). - ADRs with a one-page template (
adr-toolsworks, but don’t overthink it). - Runbooks with YAML front matter so tools can parse owner/SLO/links.
- Service catalog entries in
Backstagethat point at docs and runbooks. - CI guardrails using
CODEOWNERSand a tiny PR rule (Danger or Probot).
Example mkdocs.yml:
site_name: Company Engineering Docs
theme:
name: material
plugins:
- search
- mermaid2
nav:
- Overview: index.md
- Runbooks: runbooks/index.md
- ADRs:
- Overview: adr/index.md
- 0001-record-architecture-decisions.md
markdown_extensions:
- admonition
- codehilite
- toc:
permalink: truePublish on PR and on main via GitHub Actions:
# .github/workflows/docs.yml
name: docs
on:
pull_request:
paths:
- 'docs/**'
- 'adr/**'
- 'runbooks/**'
- '.github/workflows/docs.yml'
push:
branches: [ main ]
paths:
- 'docs/**'
- 'adr/**'
- 'runbooks/**'
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
- run: pip install mkdocs-material mkdocs-mermaid2-plugin
- run: mkdocs build --strict
deploy:
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
needs: build
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
- run: pip install mkdocs-material mkdocs-mermaid2-plugin
- run: mkdocs build
- uses: peaceiris/actions-gh-pages@v3
with:
github_token: ${{ secrets.GITHUB_TOKEN }}
publish_dir: ./siteADR template (keep it boring):
# ADR 0001: Choose MkDocs for Docs-as-Code
Date: 2025-01-05
Status: Accepted
## Context
We need a durable, repo-native way to publish engineering docs.
## Decision
Adopt MkDocs Material. Store content under `docs/`, ADRs under `adr/`.
## Consequences
- Everyone writes Markdown. PR review workflow applies.
- One publishing pipeline; searchable.
- Sunsets Confluence pages for engineering runbooks.Runbook template with YAML front matter for tooling:
---
service: payments-api
owner: @payments-team
slo: 99.9% monthly
dashboard: https://grafana.example.com/d/payments
prometheus_rule: payments-api-error-burn
pagerduty: https://pagerduty.com/incidents/ABC123
last_verified: 2025-10-01
---
# Payments API Runbook
## Symptoms
High 5xx rate, `Istio` 503s, error budget burn alert.
## Immediate Actions
1. Check Grafana dashboard.
2. If `prometheus_rule` firing and deploy < 30m old, execute canary rollback.
3. If DB saturation > 80%, enable circuit breaker (`maxRequests=50`).
## Rollback
- ArgoCD: revert to previous healthy `Application` sync.
- Terraform: `terraform apply` on last-known-good plan.
## Deep Dive
Links to SLO, logs, and recent ADRs.Backstage service catalog entry ties it together:
# catalog-info.yaml
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
name: payments-api
annotations:
backstage.io/techdocs-ref: dir:.
github.com/project-slug: org/payments-api
runbook/url: https://docs.example.com/runbooks/payments-api
spec:
type: service
owner: payments-team
lifecycle: productionBefore/After: 90 days, measurable deltas
We ran this playbook at a fintech that had great engineers and a graveyard of Confluence pages. The baseline:
- Onboarding for backend engineers: ~45 days to first production on-call.
- MTTR on Tier-1 incidents: 2.1 hours median.
- Docs PRs per week: 2 (stale, scattered).
After 90 days of paved-road defaults:
- Onboarding to on-call: 18 days. New hires shipped safely with runbooks.
- MTTR: 54 minutes median. SREs followed runbooks linked from alerts.
- Docs PRs per week: 28. Most PRs coupled with code changes.
- Bonus: Alert fatigue down ~30% after runbooks clarified noisy thresholds and we fixed them.
No new committees. No “docs sprints.” Just defaults plus guardrails.
Guardrails, not gatekeepers
I’ve seen teams stall out by inventing a docs PMO. Don’t. Use code-native nudges.
Add CODEOWNERS:
# CODEOWNERS
docs/** @platform-team
adr/** @arch-reviewers
runbooks/** @sre-oncallFail PRs that change infra without touching runbooks (or adding a rationale):
// dangerfile.ts
import { danger, fail } from 'danger'
const m = danger.git.modified_files
const infraTouched = m.some(f => f.startsWith('k8s/') || f.endsWith('.tf'))
const runbookUpdated = m.some(f => f.startsWith('runbooks/'))
if (infraTouched && !runbookUpdated) {
fail('Infra change without a runbook update. Link a runbook or explain why not.')
}Nightly link checker so the on-call doesn’t hit 404s:
# simple link check
pip install lychee
lychee --exclude-mail --accept 200,301 docs/**/*.md runbooks/**/*.mdGovernance rule of thumb:
- Block only on safety. Broken links, missing owners, missing rollback.
- Warn on completeness. “Add Grafana link” is a comment, not a blocker.
- Automate the nags. Humans focus on content.
Make it discoverable where engineers live
If people can’t find the answer in 30 seconds during an incident, it doesn’t matter that the doc exists.
- Search. Use Algolia DocSearch (free for open source; paid otherwise) or your internal equivalent.
- Slack
/kb. A tiny slash command that returns a docs search link beats 200 pinned threads. - Backstage. Make the catalog the front door: owner, links, runbook, SLO.
- Dashboards link to runbooks. Add runbook URLs to Grafana panel descriptions and PrometheusRule annotations.
Slack command microservice:
// Node + Bolt example
app.command('/kb', async ({ command, ack, respond }) => {
await ack()
const q = encodeURIComponent(command.text)
const url = `https://docs.example.com/search/?q=${q}`
await respond(`Search results for "${command.text}": ${url}`)
})Prometheus alert annotation that points to the runbook:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: payments-slos
spec:
groups:
- name: payments.slo
rules:
- alert: PaymentsErrorBudgetBurn
expr: sum(rate(http_requests_errors_total[5m])) / sum(rate(http_requests_total[5m])) > 0.05
annotations:
summary: High error rate in payments-api
runbook_url: https://docs.example.com/runbooks/payments-apiKeep it alive when people leave
Institutional knowledge decays. Make updates part of the work, not a quarterly chore.
- After every incident: no closeout until there’s a runbook PR merged. MTTR improves next time.
- ADR discipline: material design/code changes require an ADR (or an update). Keeps “why” searchable.
- Rotating docs duty: 2 hours per week on each team to fix drift found by on-call.
- AI is a drafting assistant, not a source of truth: use AI to summarize Slack threads into a PR, but gate with owner review. Avoid
AI hallucinationby anchoring on linked sources.
Lightweight summarizer into a PR comment (example idea, keep secrets safe):
# .github/workflows/summarize-discussion.yml
name: summarize-rfc
on:
issue_comment:
types: [created]
jobs:
summarize:
if: contains(github.event.comment.body, '/summarize')
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Summarize
run: |
python scripts/summarize.py ${GITHUB_EVENT_PATH}
- name: Post comment
run: gh issue comment $NUMBER --body-file summary.mdIf you’re swimming in AI-generated code or “vibe coding” artifacts, be explicit: “No auto-generated docs to main without owner sign-off.” We’ve done vibe code cleanup engagements where 30% of runbooks referenced non-existent dashboards. Fixing that paid for itself in the next incident.
What to do Monday
- Create
docs/,adr/, andrunbooks/in the repo. Add the templates above. - Add
mkdocs.ymland the GitHub Actions workflow. Publish a first cut. - Migrate one Tier‑1 service: service
README.md, runbook, and an ADR for one decision. - Wire the alert annotation to the runbook, and add the runbook link to Grafana.
- Add
CODEOWNERSand the Danger rule. Make it a soft fail for week one. - Add a Backstage
catalog-info.yamlentry for that service. - Pick three metrics: docs PRs/week, onboarding days, and MTTR. Review in 30 days.
If you need a hand paving the road and cleaning up the “docs-ish” sprawl, this is exactly what GitPlumbers does without a six‑month platform rewrite. We’ll leave you with boring defaults that survive reorgs and the next hype cycle.
Key takeaways
- Paved-road defaults beat bespoke wikis—put Markdown in the repo, not in five SaaS tools.
- Make knowledge updateable in the same PR as code; enforce with lightweight checks, not committees.
- Standardize on ADRs, runbooks, and decision logs—few formats, many contributors.
- Optimize for search and on-call ergonomics: link runbooks to alerts, dashboards, and owners.
- Measure the delta: onboarding time, MTTR, and docs PR velocity should improve within a quarter.
Implementation checklist
- Create `docs/`, `runbooks/`, and `adr/` in the repo (or monorepo) with clear owners.
- Adopt a docs site generator (`mkdocs-material`) and publish via GitHub Actions.
- Introduce ADRs for material decisions; one-page, timestamped, link from code.
- Write one gold-standard runbook per Tier-1 service; link from alerts and dashboards.
- Add CODEOWNERS and a lightweight PR rule that nudges doc updates with code changes.
- Wire discovery: Backstage catalog entries, Slack `/kb` search, and Algolia DocSearch.
- Close the loop post-incident: require a runbook PR before incident can be closed.
- Track metrics: doc PRs/week, onboarding days, and MTTR—review monthly.
Questions we hear from teams
- Why not just use Confluence/Notion?
- They’re fine for product specs and general notes. But engineering knowledge decays unless it travels with code. Docs-as-code keeps changes reviewable, versioned, and enforced via CI. Use Confluence/Notion for planning; keep runbooks/ADRs in the repo.
- What if we already invested in Docusaurus or another site generator?
- Keep it. The critical piece is paved-road conventions: folders, templates, owners, CI rules, and discoverability. Swapping MkDocs for Docusaurus won’t change outcomes if you keep the same defaults.
- How do we stop AI hallucinations from polluting docs?
- Treat AI as a drafting tool. Require owner approval for AI-suggested changes, anchor summaries to cited sources (links to ADRs, runbooks), and block merges that add facts without references. Measure drift by link checks and on-call verifications.
- Is Backstage required?
- No. It’s a nice front door, but you can start with a simple `services.md` index and shortlinks. The win comes from linking service → owner → runbook → dashboard, not from the UI you pick.
- What business impact should we expect?
- Within a quarter: faster onboarding (30–60% improvement), MTTR down 25–60%, fewer pager escalations, and less attrition from tribal-knowledge fatigue. The cost is a week of setup and ongoing light-touch hygiene.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
