The Debt Diet That Saved a Rocket Ship: Cutting MTTR 88% and Doubling Deploys in 90 Days

A Series B startup was scaling like crazy and bleeding reliability. We halted the spiral by treating technical debt like real debt—with a budget, an owner, and a paydown plan—without slowing feature velocity.

Debt doesn’t kill you overnight; it bleeds you dry every deploy.
Back to all posts

The quarter the wheels started shaking

You know the pattern. Headcount jumps from 12 to 45 engineers, revenue doubles, and your main branch starts feeling like a war zone. At this client—a Series B B2B SaaS—deploys slowed to twice a week, CI stretched past 45 minutes, and every schema migration felt like Russian roulette. On-call was paging 60+ times a month. Cloud spend was growing faster than ARR. I’ve watched this movie since the Rails monolith days at unicorns-you-know; the plot doesn’t change.

The founders called us when a Friday hotfix caused a 41-minute checkout outage. The postmortem read like a reliability bingo card: missing SLOs, noisy alerts, database lock contention, and a canary that was really just “deploy to 10% and pray.” We started Monday.

The debt we found and why it mattered

We ran a two-week diagnostic across code, infra, and delivery. Tools: Datadog and Prometheus for golden signals, git log hotspots, pg_stat_statements for query offenders, kubectl + kubecost for cluster efficiency, and DORA baselines from GitHub PR analytics.

Highlights:

  • CI/CD: 45–55 min pipeline, 28% change failure rate, manual rollbacks.
  • Observability: 1,200+ alerts, 90% unactionable, no SLOs, no error budgets.
  • DB: Two queries accounted for 37% of CPU on r6g.4xlarge Postgres; migrations locked orders for 10–20s.
  • K8s: Requests set to limits, no HPAs, 60% of services over-provisioned, noisy microservices (“two pizza teams” turned into “twelve one-slice services”).
  • Feature management: Flags existed but weren’t used for safe refactors; canaries were eyeballed via dashboards.

Why this matters to the business:

  • Sales was discounting enterprise deals by 10–15% to compensate for uptime anxiety.
  • Engineering was burning cycles on pages and rollbacks instead of roadmap work.
  • Cash burn: infra + incident toil were adding ~$280k/quarter in avoidable costs.

The 90-day intervention that actually worked

We’ve seen big-bang rewrites crater startups. Here’s the play we run when the room is on fire and you still need to ship.

  1. Set SLOs and an error budget policy
    • Pick two user journeys: login and checkout.
    • SLOs: login 99.95% (latency p95 < 300ms), checkout 99.9% (p95 < 600ms).
    • Policy: When 2% budget burns in 1 hour, pause risky deploys; at 5%, freeze and roll back.
  2. Create a visible debt backlog and budget
    • Scored by blast radius, MTTR reduction, and cost savings.
    • Leadership backed a hard 20% capacity for debt for 6 sprints. Non-negotiable.
  3. Stabilize delivery before refactors
    • Trunk-based dev, mandatory canary via Argo Rollouts, automated rollback on SLO burn.
    • Cut CI time under 15 minutes; parallelize tests; cache everything.
  4. Instrument and prune
    • Reduce alerts by 70%; convert the rest to SLO burn and a handful of symptom alerts.
  5. Right-size infra and reduce surface area
    • Requests/limits sane defaults, HPAs across the board, kill/merge low-value services.
  6. Attack the top 3 DB hotspots and the top 3 sources of incidents
    • Indexes, query plans, migration strategy, and circuit breakers on external calls.

The plumbing: what we changed (with receipts)

A few representative changes that moved the needle fast.

  • CI cache + parallel tests on GitHub Actions
# .github/workflows/ci.yaml
name: CI
on:
  pull_request:
  push:
    branches: [main]
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: npm
      - run: npm ci --prefer-offline --no-audit
      - name: Restore turbo cache
        uses: actions/cache@v4
        with:
          path: .turbo
          key: turbo-${{ github.ref }}-${{ hashFiles('**/package-lock.json') }}
          restore-keys: |
            turbo-
      - run: npx turbo run test -- --maxWorkers=50%
  • Canary deploys with automated rollback using Argo Rollouts
# rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: api
spec:
  replicas: 6
  strategy:
    canary:
      canaryService: api-canary
      stableService: api
      trafficRouting:
        nginx: {}
      steps:
        - setWeight: 10
        - pause: { duration: 120 }
        - analysis:
            templates:
              - templateName: error-rate
        - setWeight: 50
        - pause: { duration: 180 }
        - analysis:
            templates:
              - templateName: error-rate
  • SLO burn alerts in Prometheus (p95 latency on checkout)
# slo-alerts.yaml
- alert: CheckoutErrorBudgetBurnFast
  expr: (histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job="checkout"}[5m])) by (le)) > 0.6)
    and (sum(rate(http_requests_total{job="checkout",status=~"5.."}[5m])) / sum(rate(http_requests_total{job="checkout"}[5m])) > 0.02)
  for: 5m
  labels:
    severity: critical
  annotations:
    runbook: https://runbooks.internal/checkout-slo
  • Right-sizing Kubernetes with HPAs and sane requests/limits (Helm values)
# values.yaml
resources:
  requests:
    cpu: "200m"
    memory: "256Mi"
  limits:
    cpu: "500m"
    memory: "512Mi"
autoscaling:
  enabled: true
  minReplicas: 3
  maxReplicas: 15
  targetCPUUtilizationPercentage: 65
  • Postgres index to kill a 400ms query on orders
-- before: WHERE status = 'open' AND created_at > now() - interval '7 days'
CREATE INDEX CONCURRENTLY idx_orders_status_created_at ON orders (status, created_at DESC);
  • Safer external calls with a TypeScript circuit breaker (opossum)
import CircuitBreaker from 'opossum'
import fetch from 'node-fetch'

const callPartner = async (payload: any) => {
  const res = await fetch('https://partner.example.com/api', { method: 'POST', body: JSON.stringify(payload) })
  if (!res.ok) throw new Error(`Partner error: ${res.status}`)
  return res.json()
}

const breaker = new CircuitBreaker(callPartner, {
  timeout: 300,
  errorThresholdPercentage: 50,
  resetTimeout: 10000,
})

export const safePartnerCall = (payload: any) => breaker.fire(payload)
  • Safer migrations using gh-ost style pattern and read-only toggles via LaunchDarkly
# example: backfill column without table lock
psql $DB <<'SQL'
ALTER TABLE orders ADD COLUMN meta jsonb;
SQL

# backfill via batched job
node scripts/backfill-orders-meta.js --batch-size=1000 --sleep=200

We also killed four low-value microservices (two were just glue for cron + an HTTP call), moved their logic back into the api service, and reclaimed ~18% CPU across the cluster. Not everything needs to be a service. Your incident graph will thank you.

The results the business could feel

We don’t ship slideware. We ship deltas.

  • DORA metrics, 90 days in:
    • Deploy frequency: 2/week -> 25/day (canary + automated rollback made this safe).
    • Lead time for change: ~4 days -> ~6 hours.
    • Change failure rate: 28% -> 7%.
    • MTTR: 180 min -> 22 min (88% reduction).
  • Reliability and on-call:
    • P1 incidents/month: 11 -> 4.
    • On-call pages/month: 65 -> 18.
    • Alert noise: 1,200+ alerts -> ~320, 80% mapped to SLO burn.
  • Cost and capacity:
    • Cloud spend: -23% via right-sizing and killing services; paid for the engagement by week 6.
    • Postgres CPU: -31% peak after two indexes and query tuning.
  • Business impact:
    • Quarter-end churn: -1.2 pts; NRR: +3 pts (CISO objections cooled with SLO reporting).
    • Sales cycle shortened ~12 days on enterprise deals with reliability requirements.
    • Roadmap regained ~15–20% capacity previously lost to incident toil.

“We didn’t slow down to pay debt. We sped up because we paid it.” — VP Eng

What we’d do again (and what to skip)

What worked:

  • Tie debt to SLOs and DORA or you’ll lose budget in the first exec review.
  • Stabilize deploys before refactors; Argo Rollouts + flags + auto-rollback is the speed enabler.
  • Attack the DB early. Nine out of ten “platform” fires start with queries and migrations.
  • Kill vanity microservices. Consolidation reduces blast radius and cost.
  • Make the debt budget visible. We kept a burn-up chart next to the roadmap.

What we’d skip next time:

  • “Dashboard-driven” canaries without automated rollback. Humans are too slow.
  • Global circuit breaker configs. Tune breakers per dependency; defaults hurt either reliability or latency.
  • Alert hoarding. If an alert didn’t lead to action in 30 days, we deleted or converted it.
  • Rewriting flaky tests before measuring flake rate. First, quarantine and parallelize; then fix the worst 10.

Actionable steps you can start this week:

  1. Write two SLOs and a burn policy. Put them in the runbook.
  2. Add a canary stage with auto-rollback on burn. No exceptions.
  3. Right-size requests/limits for your top 5 services; add HPAs.
  4. Identify the top 3 slowest queries via pg_stat_statements; add indexes, remove N+1s.
  5. Cap CI at 15 minutes: cache, split tests, fail fast.

If you’re here, we’ve been there

GitPlumbers exists to fix the plumbing so your teams can ship safely—AI-assisted code or legacy monolith, we’ve seen it. If you need a 90-day plan that doesn’t blow up your roadmap, we’ll bring the wrenches and leave you with working gauges, not PowerPoints.

Related Resources

Key takeaways

  • Debt is an interest payment on every deploy—track it, budget it, and pay it down intentionally.
  • Tie debt to business impact using DORA and SLOs or you’ll lose the exec room.
  • Fix CI/CD, observability, and DB hotspots first; they unlock everything else.
  • Use canaries, feature flags, and error budgets to ship safely while refactoring.
  • Right-size Kubernetes and cut noisy services; it funds your debt work with real savings.
  • Time-box a 90-day intervention with a visible debt backlog and weekly burn-up.

Implementation checklist

  • Define 2-3 product SLOs with error budgets tied to on-call policy.
  • Create a visible debt backlog scored by blast radius and MTTR impact.
  • Carve a 20% debt budget for 6 sprints; make it leadership-backed and non-negotiable.
  • Stabilize CI/CD: cache aggressively, parallelize tests, add canaries and automated rollbacks.
  • Instrument golden signals; add SLO burn alerts at 2%/5% budget burn rates.
  • Right-size K8s requests/limits; add HPAs; kill or merge low-value microservices.
  • Fix the top 3 DB hotspots (indexes, N+1s, slow migrations) before new features.
  • Review outcomes weekly; track DORA metrics and cost deltas; publicize wins.

Questions we hear from teams

How do I justify a debt budget to my CFO and CRO?
Translate it into DORA improvements and revenue protection. Show how a 20% capacity allocation yielded a 23% cloud cost reduction, 88% MTTR drop, and 3-pt NRR lift. That’s less discounting, fewer churn drivers, and reclaimed roadmap capacity. Treat it like capex with measurable payback, not a science project.
Won’t a debt focus slow down feature delivery?
Done right, it speeds you up. We stabilize deploys (canaries + auto-rollback), shorten CI, and reduce on-call toil first. Teams ship more often with less fear. In this case study, deploy frequency increased from 2/week to 25/day during the same quarter.
We’re in hypergrowth—how do we start without a moratorium on features?
Time-box a 90-day intervention with a 20% debt budget. Prioritize fixes that unlock speed and safety: CI under 15 minutes, SLOs with burn alerts, right-sized K8s, and the top 3 DB hotspots. You don’t need a freeze; you need guardrails.
What metrics should we track weekly?
DORA (deploy frequency, lead time, change failure rate, MTTR), SLO burn rate, alert volume, top incident categories, CI duration, and infra cost/unit (per req or per tenant). Visualize trend deltas; celebrate wins to protect the debt budget.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to an engineer Get the SLO runbook

Related resources