What about teams that hate writing docs?

Pair write during incident reviews and design reviews. Keep docs short, procedural, and close to code. Reward links in PRs.

Can we do this on GitLab/Bitbucket?

Yes. Use GitLab Pages/CI or Bitbucket Pipelines plus a static hosting bucket. Same pattern and principles.

How do we handle secrets in runbooks?

Never inline secrets. Reference commands like aws ssm get-parameter or vault kv get, and link to access controls with least privilege.

What if we already invested in Backstage?

Keep it, but feed it from the same Markdown sources. Don’t fork the truth. Backstage can render or link to the docs; the source of truth remains Markdown in repos.

How do we avoid stale ADRs?

Treat ADRs like code: PRs, reviews, and deprecations. Add a linter that checks Status and Last Reviewed date; include ADR updates in change PRs.

Platform-productivity · Oct 18, 2025 · 10 minute read

Slack Is Not a Knowledge Base: Build a Paved Road That Survives Reorgs

If your production runbooks live in someone’s head (or a Slack DM), you’re one departure away from a very expensive incident. Here’s the simple, boring, and durable system that actually preserves institutional knowledge.

Alex Mercer

Principal Consultant, GitPlumbers

Twenty years in the trenches across fintech, SaaS, and infra. Ex-SRE manager, platform lead, and recovery specialist for systems that paged too much and shipped too little.

Boring, paved roads beat bespoke portals every quarter and twice on incident nights.

Back to all posts

The outage that exposed your "knowledge base"

I watched a payment platform eat a 7-hour outage because the only working runbook for "Rotating OAuth client secrets" was a Slack thread from 18 months ago. The last SRE who knew the exact curl incantation had left. Confluence had five conflicting pages; none were correct. Finance pegged the incident at low seven figures in lost processing and refunds.

I’ve seen this movie at startups and Fortune 100s. The pattern is the same: knowledge lives in heads, DMs, random wikis, and bespoke portals nobody maintains. When people leave, your MTTR and onboarding time go vertical. When auditors show up, you scramble. When the pager goes off, someone doom-scrolls Slack.

Here’s the paved road that actually works and survives reorgs: simple, boring, docs-as-code with light automation and ownership baked in. No heroics. No bespoke React portal. Just the shortest path that engineers will actually use.

Why bespoke portals fail and Slack rots

I’ve seen teams burn quarters building internal "knowledge hubs" with Next.js, GraphQL, and a custom search layer. It demos great and dies quietly 6 months later. Why?

Authoring friction: If writing a runbook requires leaving the repo and learning a CMS, it won’t happen.
No ownership: "Who owns this page?" If that answer isn’t in CODEOWNERS, it’s nobody.
Drift: Code changes, docs don’t. Separate systems drift by default.
Search lies: Slack looks useful until you realize discoverability dies after 2 weeks and context is gone.
Maintenance tax: Bespoke portals require a product team. Docs-as-code piggybacks on your CI/CD and Git skills.

Slack is great for questions, terrible as a system of record. Portals are great for browseability, terrible for day-2 upkeep without dedicated staff. The paved road splits the difference: put knowledge next to code, render it nicely, and automate the boring parts.

The paved-road system: simple, boring, durable

If I had to roll this out in 4 weeks, I’d do this and nothing else:

Docs-as-code: Put Markdown in docs/, adr/, runbooks/ per repo. Use repo templates so every service starts with a good skeleton.
MkDocs Material: Render docs with search, nav, dark mode. Publish via gh-pages or GitLab Pages on every merge.
ADRs (Architecture Decision Records): Capture why you chose PostgreSQL 15 over Aurora, or why you turned off Istio mTLS in staging. They age better than tribal memory.
Runbooks: Concrete, step-by-step guides for critical ops: rotate keys, unstick Kafka consumers, restart a stuck ArgoCD app, recover from RDS failover.
Lightweight RFCs: Use GitHub Issues + labels + a template. Discuss in PRs. Merge decisions into ADRs.
Ownership: Enforce CODEOWNERS for docs/, runbooks/, adr/. Tie PagerDuty or OnCall rotations to owners.
Index page: A generated list of services with owners, SLOs, endpoints, and top runbooks. No Backstage required (yet).

This stack wins because it’s the path of least resistance. Engineers write Markdown. CI builds pages. Ownership comes from the same file that gates code.

Implementation: 7 steps with configs you can paste today

Bootstrap a repo template

Create a company-wide template repository with starter docs. Every new service should start here.

# one-time template repo creation (GitHub)
gh repo create org/service-template --template=false --public=false

Add a minimal README.md skeleton:

# Service Name

- Owner: @team-example
- SLO: 99.9% monthly availability
- Pager: PD-Example-Primary
- Docs: ./docs
- Runbooks: ./runbooks

## Quickstart
- make build
- make run

## Links
- Prod: https://prod.example.com/service
- Dashboards: Grafana > Service > Overview
- Alerts: PagerDuty > Services > Service Name

Add ADRs and a template

<!-- adr/0001-record-architecture-decisions.md -->
# 1. Record architecture decisions

Date: 2025-01-10

Status: Accepted

Context: We need a durable way to capture decisions.

Decision: We will use Markdown ADRs stored in `adr/` with sequential numbering.

Consequences: New decisions are added via PR, reviewed by CODEOWNERS, and linked from README.

Optionally use adr-tools to add new ADRs:

# macOS
brew install adr-tools
adr new Use PostgreSQL 15 for metadata store

Wire up MkDocs Material

# mkdocs.yml
site_name: Engineering Handbook
theme:
  name: material
  features:
    - navigation.instant
    - content.code.copy
    - search.suggest
nav:
  - Home: index.md
  - Services: services.md
  - Runbooks: runbooks/index.md
  - ADRs: adr/index.md
plugins:
  - search
markdown_extensions:
  - admonition
  - codehilite

Organize files under docs/ and symlink or copy in runbooks/ and adr/ as needed.

Publish docs on every merge (GitHub Actions)

# .github/workflows/docs.yml
name: docs
on:
  push:
    branches: [ main ]
jobs:
  build-deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'
      - run: pip install mkdocs-material
      - run: mkdocs build --strict
      - uses: peaceiris/actions-gh-pages@v3
        with:
          github_token: ${{ secrets.GITHUB_TOKEN }}
          publish_dir: ./site

Enforce ownership

# CODEOWNERS
/docs/ @platform-team
/runbooks/ @sre-team
/adr/ @architecture-review-board

Lightweight RFCs in Issues

<!-- .github/ISSUE_TEMPLATE/rfc.md -->
name: RFC
labels: [rfc]
body:
- type: textarea
  attributes:
    label: Summary
- type: textarea
  attributes:
    label: Motivation
- type: textarea
  attributes:
    label: Proposal
- type: textarea
  attributes:
    label: Alternatives
- type: textarea
  attributes:
    label: Rollout / Risks

Make it easy to preview locally

.PHONY: docs-serve
 docs-serve:
	pip install mkdocs-material >/dev/null 2>&1 || true
	mkdocs serve -a 0.0.0.0:8000

Optional polish that pays back fast:

pre-commit with markdownlint and broken-link checking.
A nightly job that flags docs older than 180 days.
A simple services.md generated by a script that scans repos for README headers and owners.

Before/after: numbers that move the business

We rolled this out at two clients last year. No bespoke portals. No committees.

B2B SaaS (150 engineers)
- Before: Onboarding to first meaningful PR averaged 6–8 weeks. On-call handoffs were chaotic. Runbooks lived in Confluence and Slack.
- After 6 weeks: Docs-as-code + ADRs + runbooks + RFCs landed. Every service had a README with owners. MkDocs site published on every merge.
- Results in 90 days:
  - Onboarding to first PR: 8.2 weeks → 3.1 weeks.
  - Incident MTTR: down 37% (runbooks reduced flailing).
  - Ad-hoc help requests in #dev-help: down 42% (searchable docs + index page).
  - Auditor findings about change management: 0 (ADRs + RFCs closed the gap).
Fintech (45 engineers, heavy on ops)
- Before: Custom internal portal stalled (1 FTE maintaining). Runbooks stale. Rotate-keys procedure existed only in Slack.
- After 4 weeks: Portal retired. GitHub Pages + MkDocs Material. Mandatory CODEOWNERS for docs/adr.
- Results in 60 days:
  - Key-rotation playbook tested in staging and prod, twice. Last prod rotation: 20 minutes, zero pages.
  - Docs older than 180 days flagged; ~70 pages archived or updated.
  - Portal maintenance FTE: 1.0 → 0.1 (occasional CI bumps only).

If you measure anything, measure onboarding time and MTTR. If they don’t move, your knowledge system is theater.

Trade-offs: what to keep boring, when to extend

I’m not anti-Backstage. I’m anti-starting-with-Backstage. It’s great when you need a catalog with templates, scorecards, and plugin integrations. But the operational cost is real.

Start with boring: Markdown + MkDocs + Actions + CODEOWNERS. Low lift, easy adoption, zero proprietary lock-in.
Search: MkDocs search is fine for most teams. If you truly need cross-repo search, consider GitHub code search or a minimal OpenSearch/Meilisearch sidecar that indexes Markdown nightly. Avoid yak-shaving on relevancy tuning until you have usage data.
Wikis: If you already have Confluence, keep it for cross-functional stuff (HR, compliance) and link out. Don’t put runbooks there.
AI assistants: Great as a layer on top of validated docs. Bad as a source of truth. If you add AI Q&A, use retrieval from the published docs and log unknowns as GitHub Issues. Treat hallucinations as defects.
Compliance: ADRs and RFCs double as change-management artifacts. Tag them with jira: references if you must. Keep PII out of docs; link to secure stores for secrets.

Rule of thumb: if a tool isn’t in your engineers’ daily flow (Git, PRs, CI), it’s a sidecar at best and a graveyard at worst.

Make it stick: governance, habits, and metrics

Technology is the easy part. The glue is expectations and tiny bits of automation.

Define a docs SLO: e.g., 95% of runbooks touched in last 180 days. Report it monthly like uptime.
Gate changes: If a PR changes a critical path (e.g., auth), require an ADR update in the same PR.
Incident hygiene: Every post-incident review should update or create at least two runbooks.
Rotate ownership: Put docs updates on the on-call weekly checklist. It takes 10 minutes.
Kill stale content: Archiving increases trust. Stale docs are worse than no docs.
Track KPIs:
- Onboarding to first PR
- MTTR (and # of incidents resolved by following a runbook)
- Docs freshness score (last-modified distribution)
- Help channel deflection rate (answered by links vs human)

If you need a first-month plan, do this:

Stand up MkDocs Material and the Actions workflow.
Add CODEOWNERS and ADR templates to your service template.
Migrate top 10 runbooks from Slack/Confluence into repos.
Publish a services.md index with owners and links.
Set a docs SLO and report it at eng review.

Boring systems win. Your future self at 2 a.m. will thank you.

FAQs

What about teams that hate writing docs?
- Pair write during incident reviews and design reviews. Keep docs short, procedural, and close to code. Reward links in PRs.
Can we do this on GitLab/Bitbucket?
- Yes. Use GitLab Pages/CI or Bitbucket Pipelines + a static hosting bucket. Same pattern.
How do we handle secrets in runbooks?
- Never inline secrets. Reference aws ssm get-parameter or vault kv get commands and link to access controls.
What if we already invested in Backstage?
- Keep it, but feed it from the same Markdown sources. Don’t fork the truth.
How do we avoid stale ADRs?
- Treat ADRs like code: PRs, reviews, and deprecations. Use a linter to ensure every ADR has a Status and Last Reviewed date.

Related Resources

Key takeaways

Stop treating Slack and tribal memory as your knowledge system; they have a 30-day half-life.
Favor paved-road defaults: Markdown, repo templates, MkDocs Material, GitHub Pages/GitLab Pages, and automation.
Capture decisions (ADRs), operations (runbooks), and change proposals (RFCs) next to code, not in a separate portal.
Make docs part of the delivery pipeline with owners, checks, and an SLO; don’t rely on goodwill.
Measure outcomes: onboarding time, MTTR, doc freshness, and unblocked-help requests.

Implementation checklist

Create a company-wide repo template with README, ADR, runbook, and RFC templates.
Stand up MkDocs Material + GitHub Actions to publish docs on every merge.
Require CODEOWNERS for docs/ and adr/ directories.
Adopt a lightweight RFC process using GitHub Issues + labels.
Instrument doc freshness and broken-link checks in CI.
Add an index page that enumerates services with owners, SLOs, and top runbooks.
Review two docs per incident in postmortems; update or archive stale content.

Questions we hear from teams

What about teams that hate writing docs?: Pair write during incident reviews and design reviews. Keep docs short, procedural, and close to code. Reward links in PRs.
Can we do this on GitLab/Bitbucket?: Yes. Use GitLab Pages/CI or Bitbucket Pipelines plus a static hosting bucket. Same pattern and principles.
How do we handle secrets in runbooks?: Never inline secrets. Reference commands like aws ssm get-parameter or vault kv get, and link to access controls with least privilege.
What if we already invested in Backstage?: Keep it, but feed it from the same Markdown sources. Don’t fork the truth. Backstage can render or link to the docs; the source of truth remains Markdown in repos.
How do we avoid stale ADRs?: Treat ADRs like code: PRs, reviews, and deprecations. Add a linter that checks Status and Last Reviewed date; include ADR updates in change PRs.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Get a 4-week paved-road rollout plan See how we cut MTTR 37% with boring docs