The Day Your Staff Engineer Walks: A Paved‑Road Knowledge System That Keeps Shipping

Stop treating knowledge as a side project. Make it a first-class, versioned dependency that survives reorgs, layoffs, and hero departures.

Documentation that isn’t versioned, owned, and built is just a well-decorated memory leak.
Back to all posts

When the pager goes off and the only person who knew leaves

Two summers ago, a fintech client called us at 2 a.m. Payments were stuck in a partial rollback because a Kafka consumer was replaying poison messages. The only person who truly understood the dedupe logic was on parental leave. Confluence had a page from 2019 with a broken diagram. Slack had lore spread across 14 channels. Notion had the PM’s version of the truth. We burned six figures in downtime before someone found the right aws command in a personal gist.

I’ve seen this movie a dozen times. Knowledge isn’t lost in a catastrophe—it seeps away in the cracks of tools, reorgs, and “I’ll document it later”. If you want to preserve institutional expertise, build a paved road where documentation is a runtime dependency, not an afterthought.

Documentation that isn’t versioned, owned, and built is just a well-decorated memory leak.

Why this matters: knowledge rots by default

Here’s the cost curve I’ve seen repeatedly:

  • Onboarding lead time: 45–60 days with tool sprawl and hero handoffs; 10–20 days when you centralize and version knowledge.
  • MTTR: increases 2–3x when runbooks live in stale wikis; drops 30–50% when runbooks are tested, versioned, and tied to ownership.
  • Innovation tax: every “what does this do?” drains senior bandwidth; one outage can blow a quarterly OKR.
  • AI side note: If you want future RAG/assistants that don’t hallucinate, you need a clean, versioned corpus. Slack history isn’t it.

The pattern behind the pain:

  • Tool sprawl: Confluence + Notion + SharePoint + GDocs + tribal Slack.
  • No owners: Docs have authors but no maintainers; no CODEOWNERS.
  • No CI: Docs don’t build; links rot; diagrams die.
  • Bespoke portals: months of yak-shaving on Backstage plugins without content.

The fix isn’t a new platform. It’s consolidation, ownership, and automation using boring tools that scale.

The paved road: simple, boring, effective

Favor defaults you can roll out in a week:

  • Docs-as-code: Markdown in docs/ folders next to code. Versioned, reviewed, greppable.
  • Org handbook: a single handbook repo for cross-cutting knowledge (on-call, environments, deploy, security, SDLC).
  • Static site: MkDocs Material (v9.x) + GitHub Pages or S3/CloudFront. Fast, searchable, zero bespoke.
  • ADRs: short adr/ entries for decisions; stop rewriting history every six months.
  • Runbooks: runbooks/ with step-by-step commands and rollbacks; treat them like code.
  • Ownership metadata: CODEOWNERS + minimal catalog-info.yaml for owner, tier, and pager duty.
  • Light linting: markdownlint, link-check, mkdocs build --strict in CI.

Minimal MkDocs config

site_name: Acme Engineering Handbook
theme:
  name: material
  features:
    - navigation.instant
    - content.code.copy
plugins:
  - search
  - mermaid2
markdown_extensions:
  - admonition
  - codehilite
  - toc
  - pymdownx.superfences
  - pymdownx.details
nav:
  - Home: index.md
  - On-call: oncall/index.md
  - Runbooks:
      - Payments: runbooks/payments.md
      - Kafka: runbooks/kafka.md
  - Architecture: architecture/index.md

Publish on push

# .github/workflows/docs.yml
name: docs
on:
  push:
    branches: [main]
    paths: ["docs/**", "mkdocs.yml"]
jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'
      - run: pip install mkdocs-material mkdocs-mermaid2-plugin
      - run: mkdocs build --strict
      - uses: peaceiris/actions-gh-pages@v3
        with:
          github_token: ${{ secrets.GITHUB_TOKEN }}
          publish_dir: ./site

Own the docs

# CODEOWNERS
/docs/** @platform-team
/runbooks/** @sre-oncall
/adr/** @architecture-guild

Minimal ownership metadata (Backstage optional)

# catalog-info.yaml
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: payments-api
  annotations:
    pagerduty.com/service-id: PD12345
spec:
  type: service
  owner: team-payments
  lifecycle: production
  system: money

If you’re allergic to Backstage, use a simple owners.yaml and a script to enforce presence and format in CI. The point is ownership, not the brand.

Templates that survive 3 a.m. incidents

Give teams paved-road templates. No blank pages.

ADR template

# ADR 0001: Choose Postgres over MySQL
Date: 2025-05-10
Status: Accepted
Deciders: Jane D, Omar K
Context:
- Need strong consistency for ledger. Team expertise is Postgres.
Decision:
- Use `PostgreSQL 15` with `pg_partman` for partitioning.
Consequences:
- + Strong constraints / indexing
- - Need to manage vacuum and bloat
Links:
- RFC: /docs/rfcs/ledger-db.md
- Alternatives: ADR-0000

Runbook template

# Runbook: Payments Backlog Drain
Owner: @team-payments | PagerDuty: PD12345 | Tier: 1
Last Verified: 2025-06-01 (CI link)
SLOs: drain backlog < 30 min; error rate < 0.1%

1. Verify impact
   - `kubectl -n payments logs deploy/consumer | tail -n 100`
   - `kafka-consumer-groups.sh --bootstrap-server $BS --describe --group payments`
2. Toggle circuit breaker
   - `kubectl -n payments scale deploy/consumer --replicas=0`
3. Identify poison messages
   - `aws s3 cp s3://payments-dlq/latest ./dlq.json && jq '.[] | select(.error != null)' dlq.json`
4. Patch and canary
   - `kubectl -n payments set image deploy/consumer consumer=ghcr.io/acme/consumer:${SHA}`
   - Verify with 5% canary for 10 minutes (Argo Rollouts)
5. Rollback
   - `kubectl -n payments rollout undo deploy/consumer`
6. Post-incident
   - Link to ADR/issue for root cause

Diagrams that actually render

sequenceDiagram
  participant Client
  participant API
  participant Kafka
  participant Consumer
  Client->>API: POST /payments
  API->>Kafka: publish event
  Kafka->>Consumer: deliver event
  Consumer-->>API: status update

Local docs preview

.PHONY: docs
.docs-env:
	pip install mkdocs-material mkdocs-mermaid2-plugin

docs: .docs-env
	mkdocs serve -a 0.0.0.0:8000

No fancy portal. Just Markdown, version control, and CI keeping you honest.

One-week rollout that actually sticks

You don’t need a task force. You need a short, boring project with visible wins.

  1. Day 1: Create handbook repo; add mkdocs.yml, index.md, and CI. Link on Slack sidebar and in the GitHub org description.
  2. Day 2: Add templates: ADR, runbook, service README. Create CODEOWNERS. Publish a doc SLO and definition of done for services.
  3. Day 3–4: Pilot two services: payments and auth. Move Confluence/Notion pages that still matter; delete the rest. Add catalog-info.yaml.
  4. Day 5: Wire linting: markdownlint, link check, mkdocs build --strict. Fail PRs missing required pages.
  5. Day 6: On-call drill: pick a past incident and run it with only the new runbooks. Fix gaps same day.
  6. Day 7: Announce paved road; freeze new tools. If it’s not Markdown in docs/ or the handbook, it doesn’t exist.

Guardrails:

  • Require at least one owner review on docs/** changes.
  • Add a weekly reminder to verify “Last Verified” dates in runbooks; 30-day SLA.
  • Track KPIs: onboarding time to first PR, time-to-diagnosis in incidents, and doc PR volume.

Before/after: the fintech that ditched Notion sprawl

Before we showed up:

  • 4 knowledge tools; none authoritative. No CODEOWNERS. Backstage proof-of-concept gathering dust.
  • Onboarding: median 47 days to first meaningful PR across platform. MTTR ~2.8 hours on P2s.
  • “Ask Alice” culture. Alice left.

After 5 weeks of paved-road rollout:

  • Consolidated to docs/ + handbook; Confluence/Notion archived. 180 pages migrated, 60% deleted.
  • Every service had README, runbook, ADR index, catalog-info.yaml, and CODEOWNERS.
  • GitHub Actions built and published docs on every merge; broken links went red in CI.
  • Onboarding: 18 days median to first PR; MTTR down to 1.6 hours; on-call satisfaction from 2.1 to 4.0/5 in retro surveys.

Costs vs benefits:

  • Cost: ~2 engineer-weeks to bootstrap; ongoing ~2 hours/week per team to maintain.
  • Benefit: One avoided incident paid for the program. Fewer interrupts; seniors got back 20–30% focus time.

Not sexy. Highly effective.

Keep it alive: incentives and automation

You don’t need doc martyrs. You need incentives, defaults, and a little shame from CI.

  • Bake into SDLC: “No runbook, no prod.” Rely on mkdocs build --strict and a simple PR check to verify required pages.
  • Make it visible: Pin the handbook link in Slack; add it to the GitHub org header; include in onboarding tickets.
  • Rotate stewardship: quarterly doc champions per team; 10% time, measured.
  • Auto-stale detection: a CI job that fails if Last Verified > 30 days for Tier 1 runbooks.
# naive check for Last Verified older than 30 days
awk '/Last Verified:/ {print FILENAME, $0}' $(git ls-files "runbooks/*.md") | \
  python .ci/check_last_verified.py
  • Close the loop: make every postmortem require a runbook PR.
  • Don’t fetishize tools: If someone suggests a bespoke portal, ask what it will do that Markdown + CI + ownership doesn’t.

What not to do (I’ve seen this fail)

  • Bespoke knowledge graphs. You’ll spend quarters building entity models and still have stale content.
  • “We’ll centralize in Backstage first.” Backstage is great as a catalog once you have content. It won’t write your docs.
  • “The wiki is fine.” If it isn’t versioned, reviewed, and built, it will rot. Every. Single. Time.
  • AI-first help desks without a clean corpus. Garbage in, hallucinations out.
  • Policing tone and style before existence. First make it exist, then make it good.

Start boring. You can layer smarter search, embeddings, and catalogs later. Paved roads first; fast lanes after.

Related Resources

Key takeaways

  • Knowledge evaporates by default; treat it like code with ownership, reviews, and CI.
  • Favor paved-road defaults: Markdown, MkDocs Material, GitHub Actions, CODEOWNERS, ADRs, and runbooks.
  • Consolidate to two homes: a repo-level `docs/` and an org-level `handbook` repo—kill tool sprawl.
  • Bake docs into delivery: PR checks, doc coverage gates, and on-call runbook SLOs.
  • A one-week rollout is realistic; bespoke portals take quarters and still rot.

Implementation checklist

  • Create an org-level `handbook` repo with MkDocs Material and GitHub Pages
  • Add `docs/`, `runbooks/`, and `adr/` to each service repo with templates
  • Define `CODEOWNERS` for docs and runbooks; integrate review in PRs
  • Publish a minimal `catalog-info.yaml` (or owners.yaml) per service
  • Wire a GitHub Actions workflow to build/publish docs on `main`
  • Set a doc SLO: every service has README, runbook, ADR index, and owner
  • Add linting: `markdownlint`, link check, and `mkdocs build --strict`
  • Measure: onboarding lead time, MTTR, and time-to-first-PR for new devs

Questions we hear from teams

Why not use Docusaurus or a full Backstage portal?
You can, but they’re heavier. MkDocs Material + GitHub Pages is enough to prove the model in a week. Once content and ownership are healthy, you can migrate to Docusaurus or feed Backstage. Content > portal.
How do we keep runbooks from going stale?
Add a “Last Verified” header, a 30-day SLA, and a CI check that fails PRs if the date is too old for Tier 1 services. Tie postmortems to runbook updates.
Who owns the org handbook?
Platform or DevEx owns the repo and CI. Each domain section has CODEOWNERS for review. The goal is distributed ownership with a paved road, not central gatekeeping.
What about sensitive content (secrets, IP)?
Docs live with code, so use the same controls: repo permissions, secret scanners, and no secrets-in-docs. For truly sensitive runbooks, keep them in private repos and link from the handbook with access controls.
Can we plug this into AI assistants later?
Yes—this is the right corpus: clean Markdown with ownership and recency metadata. You can index it with embeddings, add `last_verified` frontmatter, and build RAG with confidence. Don’t start with AI; earn it.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Stand up your paved-road knowledge system See how we fix developer productivity bottlenecks

Related resources