The Day the Last Staff Engineer Quit: Rebuilding Institutional Knowledge Without Buying Yet Another Wiki
Institutional expertise leaks out through Slack, tribal memory, and “that one person.” Here’s the paved-road system that keeps knowledge shippable—even when the original authors are long gone.
If your knowledge system isn’t part of the developer workflow, it’s a museum.Back to all posts
The institutional knowledge leak is real (and it’s usually invisible until it isn’t)
I’ve watched this movie across banks, SaaS companies, and “we move fast” startups: everything is fine until the one person who knows the payment flow, the Kafka topology, or the Kubernetes quirks gets promoted, burns out, or leaves. Then you’re paying interest on missing context.
The failure mode isn’t “we lack documentation.” It’s:
- The “docs” are 400 Slack threads, two Confluence spaces, and a Notion page last edited in 2022.
- On-call runbooks are aspirational (“restart the pod”) instead of runnable.
- Decisions get made in meetings, not recorded, so teams repeat the same arguments every quarter.
- AI-assisted changes (“vibe-coded” diffs) land without the why, and six weeks later no one can explain the behavior.
The cost shows up in engineering KPIs leaders actually care about:
- Longer onboarding (weeks to months) because new hires can’t build a mental model.
- Higher MTTR because incidents turn into archaeology.
- More change failures because “unknown unknowns” aren’t written down.
A knowledge sharing system is just an availability system for context. Treat it like production: it needs ownership, versioning, and feedback loops.
Stop buying “a knowledge platform.” Build a paved road with boring defaults
I’ve seen plenty of teams buy an “all-in-one” knowledge tool and still fail because the habit doesn’t stick. The simplest thing that works for most orgs:
- Docs live with code (Markdown in the same repo as the service)
- A single build system to publish a portal (MkDocs/Docusaurus)
- Ownership + review enforced via
CODEOWNERS - A few templates (ADRs, runbooks, postmortems) to remove bikeshedding
Here’s the cost/benefit trade-off I usually explain to leadership:
- Bespoke tool (custom portal, custom taxonomy, custom search)
- Benefit: tailored UX, fancy integrations
- Cost: you just created another product to maintain; adoption dies the first time it’s flaky
- Paved-road docs-as-code (Markdown + CI + portal)
- Benefit: same workflow as code, reviewable, diffable, easy to keep current
- Cost: less “pretty,” requires minimal discipline (templates + checks solve most of it)
A concrete before/after from a GitPlumbers engagement (mid-size B2B SaaS, ~60 engineers):
- Before: onboarding to first meaningful PR averaged ~12 business days; Sev1 MTTR ~90 minutes; runbooks in Confluence with dead links
- After (6 weeks): onboarding to first PR ~5 days; Sev1 MTTR ~45 minutes; 80% of Sev1s referenced a repo runbook; fewer “tribal knowledge escalations” during incidents
Not magic—just context shipped the same way code ships.
The minimum viable knowledge system: one folder, one portal, one way of working
Pick a repo convention and don’t negotiate it for every team. The goal is muscle memory.
Recommended baseline per service repo:
README.md(what it is, how to run, how to deploy)/docs/(architecture + ops)/docs/runbooks/(incident procedures)/docs/adr/(decision log)
Then publish it into a single internal portal. MkDocs Material is a workhorse for this because it’s dead simple and has good search.
Example mkdocs.yml (per repo or aggregated):
site_name: payments-service
repo_url: https://github.com/acme/payments-service
theme:
name: material
features:
- navigation.instant
- content.code.copy
plugins:
- search
nav:
- Overview: index.md
- Architecture:
- System Context: architecture/context.md
- Data Flow: architecture/data-flow.md
- Runbooks:
- Queue Backlog: runbooks/queue-backlog.md
- Stripe Webhook Failures: runbooks/stripe-webhooks.md
- ADRs:
- "ADR-0007: Idempotency Keys": adr/0007-idempotency-keys.mdAnd a simple GitHub Actions pipeline to publish (GitHub Pages shown, but same idea for S3/CloudFront, GCS, or an internal runner):
name: docs
on:
push:
branches: [ main ]
permissions:
contents: read
pages: write
id-token: write
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- run: pip install mkdocs-material
- run: mkdocs build --strict
- uses: actions/upload-pages-artifact@v3
with:
path: site
deploy:
needs: build
runs-on: ubuntu-latest
environment:
name: github-pages
url: ${{ steps.deployment.outputs.page_url }}
steps:
- id: deployment
uses: actions/deploy-pages@v4Key detail: mkdocs build --strict fails the build on broken links and missing pages. That’s the paved road doing the nagging so humans don’t have to.
Capture decisions with ADRs (because “we’ll remember why” is a lie)
If you do nothing else, do ADRs. I’ve watched teams re-litigate “Postgres vs DynamoDB,” “Kubernetes vs ECS,” “GraphQL vs REST,” and “Avro vs Protobuf” like it’s Groundhog Day.
An ADR doesn’t need to be a thesis. It needs:
- Context (what problem, what constraints)
- Decision (what we chose)
- Consequences (what this breaks/enables)
- Status (proposed/accepted/superseded)
Template (/docs/adr/000x-something.md):
# ADR-0007: Idempotency keys for payment creation
- Status: accepted
- Date: 2025-02-14
- Owners: @payments-platform
## Context
We receive retries from clients and from our message queue. Double-charging is unacceptable.
We run on Postgres 15. Writes peak at ~800 rps.
## Decision
Require an `Idempotency-Key` header for `POST /payments`.
Store keys in `payment_idempotency` with a 24h TTL (partitioned by day).
## Consequences
- Client SDKs must generate stable keys per user action.
- We will reject requests missing the header (400).
- Cleanup job required; adds ~2GB/day at peak.
## Alternatives considered
- Upsert on business key (rejected: too many edge cases)
- Redis cache (rejected: persistence + operational risk)This is where a lot of AI-generated code goes off the rails: models happily implement a solution without recording constraints, and six months later you can’t tell whether the weird behavior is intentional or accidental. ADRs are the antidote.
Runbooks that actually work: executable steps, dashboards, and rollback paths
A runbook that says “check logs” is just a cry for help.
What works in practice:
- Start with symptoms and blast radius
- Link the Grafana dashboard and the exact Prometheus queries
- Provide copy/paste commands (
kubectl,aws,gcloud,psql) - Include a safe rollback and “stop the bleeding” option
Example runbook snippet (/docs/runbooks/queue-backlog.md):
# Runbook: Kafka consumer lag spike (payments-settlements)
## Symptoms
- `consumer_lag{group="payments-settlements"}` > 50k for 10m
- Increased `p95` settlement latency
## Dashboards
- Grafana: https://grafana.acme.internal/d/settlements
## Triage (5 minutes)
1. Confirm deploy status:
```bash
kubectl -n payments get deploy payments-settlements -o wide- Check error rate:
sum(rate(http_server_errors_total{service="payments-settlements"}[5m])) - Check broker issues (if errors are low but lag rises):
sum(rate(kafka_network_request_errors_total[5m])) by (broker)
Mitigation
- If CPU throttling:
kubectl -n payments top pods | grep settlements kubectl -n payments patch hpa payments-settlements --type merge -p '{"spec":{"maxReplicas":30}}'
Rollback
- If lag spike started after a deploy in the last 60 minutes:
kubectl -n payments rollout undo deploy/payments-settlements kubectl -n payments rollout status deploy/payments-settlements
Follow-ups
- Add alert: lag > 25k for 5m
- If caused by schema change, reference ADR-0012
Notice what’s missing: philosophy. This is the stuff you want an on-caller to do at 2am with half a brain.
## Make it stick with automation: ownership, PR nudges, and “docs drift” checks
Most knowledge systems die from neglect, not bad intent. Your job is to reduce the activation energy.
### Use `CODEOWNERS` so docs have real owners
```text
# .github/CODEOWNERS
/docs/ @platform-enablement
/docs/runbooks/ @sre-oncall
/docs/adr/ @architecture-council
/src/payments/ @payments-teamThis does two things:
- Prevents unreviewed changes to critical knowledge paths
- Makes ownership explicit when teams reorganize (a surprisingly common failure point)
Add a PR template that forces the tiny habit
<!-- .github/pull_request_template.md -->
## What changed
## Why
## Docs/Runbooks
- [ ] Updated `/docs` (architecture or usage)
- [ ] Updated/added runbook (if operational behavior changed)
- [ ] ADR added/updated (if decision-level change)Enforce “docs build” in CI (but don’t be a jerk about it)
- Fail builds on broken links (
--strict) - Warn (don’t fail) on “missing docs checkbox” at first; tighten later
One pattern I’ve seen work: start with soft enforcement for 2–4 weeks, then flip to hard enforcement once the templates and paths are stable.
The numbers that tell you it’s working (and the anti-patterns that waste money)
If you can’t measure it, you’ll end up arguing about vibes. The metrics I like for knowledge preservation:
- Onboarding time-to-first-PR (median, not best-case)
- MTTR for Sev1/Sev2 incidents
- Change failure rate (deployments requiring rollback/hotfix)
- % incidents with a linked runbook (or created/updated runbook within 48 hours)
Concrete before/after example (another GitPlumbers “code rescue” scenario, heavy AI-assisted commits, messy ownership):
- Before:
- New on-call engineers escalated 70% of pages to two staff engineers
- 30–40% of incidents had “no runbook” notes in postmortems
- Slack was the source of truth for deployment steps
- After (one quarter):
- Escalations dropped to ~25%
- 85% of incidents referenced a runbook
- Deployment knowledge moved into
/docs/deploy.md+ CI/CD pipeline output
Anti-patterns I’ve seen fail (repeatedly):
- Building a custom portal before you have templates and habits
- Letting every team pick a different tool (Notion here, Confluence there, Google Docs everywhere)
- Treating docs as “someone else’s job” (hello, zombie wiki)
If your knowledge system isn’t part of the developer workflow, it’s a museum.
When you need help: the “institutional knowledge retrofit” playbook
This is the kind of work GitPlumbers gets called in for when teams are drowning in legacy complexity or AI-generated diffs with zero context. The approach is intentionally unsexy:
- Inventory where knowledge actually lives (repos, wikis, Slack, tickets, tribal memory)
- Choose paved-road defaults (repo layout + MkDocs/Docusaurus + templates)
- Migrate the 20% that drives 80% of outcomes (runbooks + top ADRs + onboarding)
- Add enforcement (CODEOWNERS, CI doc builds, PR nudges)
- Close the loop from incidents and migrations back into docs
You don’t need perfect documentation. You need operationally relevant documentation that survives org churn.
If your org is feeling the bus-factor pain, or you’re trying to tame AI-assisted changes without turning into the “no” department, that’s exactly the lane we live in.
Key takeaways
- If knowledge isn’t versioned with code, it will drift—and you’ll discover that drift at 2am.
- The winning stack is boring: Markdown + repo structure + CI checks + ownership. Avoid bespoke knowledge platforms until you’ve proven the habit.
- Capture decisions with lightweight ADRs; capture operations with runnable runbooks; capture ownership with CODEOWNERS.
- Measure success with onboarding time, change failure rate, and MTTR—not “number of pages in the wiki.”
- Automate the nudges: PR templates, doc builds, stale checks, and incident-to-runbook loops.
Implementation checklist
- Pick a single docs home per service (`/docs` in-repo) and one org-level portal (built from repos).
- Adopt an ADR template and require it for non-trivial changes (data model, queues, auth, deployment topology).
- Convert “runbook pages” into executable steps: copy/paste `bash` commands, dashboards, and rollback paths.
- Add `CODEOWNERS` for docs and critical paths; ensure reviewers include domain owners.
- Add CI to fail PRs when docs don’t build or runbook links break.
- Create an incident loop: every Sev1 adds/updates a runbook and at least one dashboard link.
- Track onboarding time-to-first-PR, MTTR, and % of incidents with a referenced runbook.
- Keep it paved: standard templates, same layout, same toolchain across repos.
Questions we hear from teams
- Should we use Confluence/Notion instead of docs-as-code?
- Use them if you already have strong adoption, but treat them as the exception—not the default. In practice, docs-as-code wins because it’s versioned with the service, reviewed via PRs, and forced through the same delivery pipeline. If you keep Confluence/Notion, at least link to repo runbooks/ADRs as the source of truth and automate checks so it doesn’t drift.
- How do we prevent docs from becoming stale?
- Make staleness detectable and owned: `mkdocs build --strict` to catch broken links, `CODEOWNERS` so changes require review, and an incident loop where every Sev1 updates a runbook within 48 hours. Add PR templates and (eventually) CI enforcement for doc updates when operational behavior changes.
- What’s the smallest set of documents that moves the needle?
- Start with (1) onboarding doc (`README.md` + local run), (2) top 5 runbooks that cover most pages, and (3) ADRs for irreversible decisions (data model, auth, queuing, deployment architecture). Don’t migrate everything—migrate what reduces escalations and MTTR.
- How does this help with AI-assisted development and “vibe coding”?
- AI-generated code often lacks the why: constraints, trade-offs, and operational implications. ADRs capture the rationale, and runbooks capture how to operate the behavior safely. Combined with ownership and CI, this turns AI speed into something you can actually maintain.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
