The Rewrite We Didn’t Ship: 90 Days of Tech-Debt Paydown Dropped MTTR 90% and Cut Cloud Spend 24%

A high-growth startup was stuck in incident purgatory and AWS sticker shock. We killed three classes of debt, avoided a risky rewrite, and bought back 30% of engineering time in one quarter.

We didn’t rewrite. We paid down the right debt and bought back 30% of engineering time in 90 days.
Back to all posts

The rewrite we didn’t ship

I walked into a Monday exec sync where the VP Eng had a deck titled “Greenfield or Graveyard.” Classic. Incidents every week, AWS spend doubling quarter-over-quarter, and a sales team throwing shade about “demo-safe” windows. A senior IC pitched a total rewrite to a new stack. I’ve seen that movie. The sequel is called missed roadmap, attrition, and a bridge round.

We didn’t rewrite. We paid down surgical technical debt for 90 days and took the engineering org from panic deploys to predictable throughput. Here’s exactly what we changed and the business impact it created.

What we walked into

  • Context: B2B SaaS, PLG + enterprise. Traffic +40% MoM. Core stack: Node 18, Kubernetes 1.27 on EKS, PostgreSQL 13 (RDS), Kafka for ingestion, Redis for cache.
  • Constraints: 12 squads, thin SRE team (2 people), on-call burnout, “AI-assisted” code peppered everywhere—some of it pure vibe coding from early growth. Lead time for changes was ~2 days. MTTR averaged 3 hours. Change failure rate sat at ~23%.
  • Tool sprawl: Jenkins, GitHub Actions, manual kubectl apply from laptops (yes, really), three feature-flag providers (LaunchDarkly, a homebrew JSON, and OpenFeature experiments), and dashboards that looked pretty but didn’t answer “is the user okay?”

Where the debt was hiding

We mapped incidents for 90 days and scored debt by how often it hurt revenue and how long it blocked teams.

  • Deploy safety debt: No GitOps, no canaries, no progressive delivery. Rollbacks were manual and slow. Circuit breakers didn’t exist. One bad merge could take prod down for an hour.
  • Observability debt: Metrics for CPU but not SLIs. Logs without trace context. Pager alerts for noise, not for error budget burn. Debugging took forever.
  • Infra cost debt: Orphaned EBS, over-provisioned nodes, burstable instance roulette, Kafka retention set to “forever.” Tagging was optional; forecasting was fiction.
  • Code health debt: AI-generated code (“helpful” ChatGPT and Copilot drops) that duplicated patterns, ignored domain boundaries, and introduced silent retries. Reviews turned into archaeology.

“If you can’t see it, you can’t fix it. If you can’t deploy it safely, you won’t.”

What we changed in 90 days

We picked three levers with straight-line business impact.

  1. Make deployments boring

    • Standardized on ArgoCD for GitOps.
    • Introduced canary deployment via Istio and enforced circuit breaker defaults.
    • Mandated feature flags (LaunchDarkly) for risky paths; killed the homebrew flag JSON.
  2. Make debugging fast

    • Defined SLIs/SLOs for signup, checkout, and ingestion. Alerted on SLO burn via Prometheus.
    • Added tracing (OpenTelemetryJaeger) and log correlation.
    • Built a single on-call dashboard per domain with “are users okay?” at the top.
  3. Make infra spend predictable

    • Moved to Terraform with tagging, budgets, and autoscaling policy baselines.
    • Right-sized EKS nodes, shifted EBS to gp3, fixed Kafka retention, and enabled downscaling windows.
    • Weekly cost reviews tied to product traffic and SLO risk.

Implementation details you can steal

  • ArgoCD application for production sync with auto-prune and self-heal:
# apps/prod-app.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: web-prod
spec:
  project: default
  source:
    repoURL: https://github.com/acme/web
    targetRevision: main
    path: k8s/overlays/prod
  destination:
    server: https://kubernetes.default.svc
    namespace: web
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true
  • Istio canary with circuit breaker (outlier detection) baked in:
# k8s/overlays/prod/virtualservice.yaml
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: web-svc
spec:
  host: web-svc
  trafficPolicy:
    outlierDetection:
      consecutive5xxErrors: 5
      interval: 1s
      baseEjectionTime: 30s
      maxEjectionPercent: 50
---
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: web-svc
spec:
  hosts: ["web.example.com"]
  http:
    - route:
        - destination: { host: web-svc, subset: canary }
          weight: 10
        - destination: { host: web-svc, subset: stable }
          weight: 90

Roll forward by bumping weights in git and letting ArgoCD sync. If 5xx spikes, Istio ejects the bad pods. No heroics, no midnight kubectl roulette.

  • Prometheus multi-window burn-rate alert (thanks SRE workbook):
# prometheus/alerts/slo-burn.yaml
groups:
- name: slo-burn
  rules:
  - alert: SLOHighBurn
    expr: (
      sum(rate(http_request_errors_total{job="web", le="1"}[5m]))
      / sum(rate(http_requests_total{job="web"}[5m]))
    ) > (14.4 * (1 - 0.995))
    for: 5m
    labels:
      severity: page
    annotations:
      summary: "Fast error budget burn for web"
      description: "Error budget burning too fast. Consider rollback or canary freeze."
  • Terraform guardrails for tagging, storage, and budgets:
# terraform/modules/aws_baseline/main.tf
resource "aws_ebs_volume" "default" {
  # ...
  type = "gp3"
  encrypted = true
  tags = merge(var.default_tags, { component = var.component })
}

resource "aws_budgets_budget" "monthly" {
  name              = "prod-monthly"
  budget_type       = "COST"
  limit_amount      = var.monthly_limit
  limit_unit        = "USD"
  time_unit         = "MONTHLY"
  cost_types { include_credit = true }
  notification {
    comparison_operator = "GREATER_THAN"
    threshold           = 80
    threshold_type      = "PERCENTAGE"
    subscriber_email_addresses = var.owner_emails
  }
}
  • Kill vibe code and AI hallucinations with a single standard:
# enforce one toolchain in CI
pnpm dlx @biomejs/biome check --write
npm run typecheck
npm run test:unit
npm run test:contract

We ripped out duplicated “AI-generated helpers,” moved to one module per domain, and codified patterns. Code reviews dropped from archaeology to deltas.

Business results that mattered

We measured before and after. No vanity metrics.

  • MTTR: 3h → 27m (−90%). Faster rollback/cutover and real SLO paging.
  • Deployment frequency: 2/week → 30/day. GitOps + canaries + flags turned deploys into routine.
  • Lead time for changes: ~2 days → 45 minutes. CI/CD caching, smaller PRs, and pre-prod parity.
  • Change failure rate: 23% → 7%. Canary + outlier ejection caught bad releases early.
  • AWS spend: −24% (EKS right-sizing, gp3, autoscaling windows, Kafka retention). Variance tightened; finance stopped guessing.
  • Incidents: P0/P1 from 9/month → 2/month. On-call pages dropped 63%.
  • Product impact: Checkout success +3.2% (stability + tracing fixed a flaky retry). Churn down 0.6 pts; sales stopped asking for “demo-safe windows.”

Timeline: we saw early wins in week 2 with GitOps and canaries; cost curve bent by week 5; SLO-driven alerts stabilized on-call by week 7; the rest was compounding.

Lessons learned and what we’d do differently

  • Pick three debts max. We focused on deploy safety, observability, and infra cost. Everything else waited.
  • Tie debt to revenue. “This alert fires when we burn 4h of error budget” gets funded. CPU graphs don’t.
  • Standardize or die. One feature-flag system (LaunchDarkly or OpenFeature with Flagsmith) beats three half-broken ones.
  • Avoid rewrites. Use the strangler pattern: wrap old endpoints, isolate behind Istio, and pay down the seam.
  • Make rollbacks blameless and instant. If rollbacks are painful, you’ll keep broken code in prod to “save face.”
  • Chaos test the happy path. Light Chaos Engineering—drop Kafka for 30s in staging, verify circuit breaker and retries, then ship.
  • Don’t ignore AI code rot. We found non-idempotent scripts and silent retries copied from “vibe coding.” Do a vibe code cleanup pass, document patterns, and gate with CI.

If you’re here right now

If your team is arguing about rewrites while the business is bleeding, you don’t need a prettier deck—you need safer deploys, real SLOs, and cost controls that stick.

GitPlumbers comes in as the calm adult: we map the top-10 debts, implement GitOps + canaries, wire SLOs to the pager, and Terraform the cost basics so finance smiles. It’s not magic; it’s boring, proven engineering. And it works.

Ping us when you’re ready to buy back engineering time without a rewrite. We’ll bring the guardrails and the receipts.

Related Resources

Key takeaways

  • You don’t need a rewrite to get outsized gains—target the three debts that hurt deploy safety, debugging speed, and infra costs.
  • Codify deployments with `GitOps` (ArgoCD) and add canaries plus circuit breakers (Istio). It’s the fastest way to lower change failure rate.
  • Make SLOs the center of gravity. Alert on burn rate, not CPU. Engineers respond to what leadership measures.
  • Kill ‘vibe code’ and AI-generated inconsistencies by standardizing patterns and linters; this shrinks cognitive load and review time.
  • Infra cost wins come from boring fixes: right-size, tag, autoscale, storage tiers. Use Terraform to make it stick.

Implementation checklist

  • Define 3-5 SLIs and SLOs that match real customer pain.
  • Automate deployments with `ArgoCD` and add `canary` + `circuit breaker` in the mesh.
  • Create a `Technical Debt Wall`: 10 highest-impact items with owner, ETA, and expected KPI movement.
  • Instrument user journeys end-to-end; budget time for `Prometheus`/`Grafana` and tracing.
  • Standardize on one feature flag system and one testing pyramid.
  • Move to infra-as-code (`Terraform`) with enforced tagging and budgets.
  • Run weekly cost and error-budget reviews; decide work from data, not vibes.

Questions we hear from teams

Why not just rewrite the system?
Rewrites rarely land faster than 12–18 months and you carry the old system while building the new one. Debt compounds on both. We targeted deploy safety, observability, and infra cost—the three areas with immediate impact on MTTR, change failure rate, and margin—so the business could keep shipping.
Do we need Istio to do canaries and circuit breakers?
No, but a mesh like Istio makes it consistent. You can start with NGINX annotations for canaries and app-level circuit breakers (e.g., `resilience4j`). The key is automated rollout/rollback and health-based gating.
How do we avoid AI-generated ‘vibe code’ in the future?
Keep AI in the loop but enforce patterns: architecture decision records (ADRs), shared libraries, strong linters/formatters, and CI gates. Run periodic `AI code refactoring` sprints to consolidate helpers and remove dead code. Treat Copilot/ChatGPT as interns: review everything.
What’s the minimal tool stack to replicate this?
ArgoCD for GitOps, your feature flags of choice (LaunchDarkly/OpenFeature), Prometheus + Grafana for SLOs, OpenTelemetry for tracing, Terraform for infra guardrails. If you’re on a PaaS (Fly.io, Render), you can still apply the same principles with their equivalents.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to GitPlumbers See how we fix tech debt without rewrites

Related resources