From 1 Deploy/Week to 20/Day: The 90‑Day Tech Debt Cut That Paid for Itself
A high‑growth startup was burning cash and engineers. We killed the debt, tightened the loop, and turned release fear into release fuel—in one quarter.
“We thought we needed more engineers. What we needed was fewer footguns.”Back to all posts
The Startup: Growth, Gravity, and a Mountain of Debt
I got the call from a Series B SaaS startup (B2B checkout tooling) riding 8x YoY growth and dragging a canoe full of anchors: a half‑baked microservices sprawl, EKS running hot, CI held together by bash glue, and a pile of AI‑generated “vibe code” that looked right until you tried to scale it. They were six weeks out from Series C diligence. Freeze wasn’t an option; the roadmap was contractual.
Constraints we walked into:
- Stack: AWS
EKS 1.27,Terraform 1.6,ArgoCD 2.9(half‑deployed),Istio 1.20, Node.js 18, Aurora Postgres 13, ClickHouse for events,Prometheus Operator,Grafana,OpenTelemetrytraces. - Team: 42 engineers, 4 SREs, rotating on‑call with escalation to VPs every other week.
- Symptoms: 1 deploy/week, 35% change failure rate, MTTR 14 hours, p95 checkout latency 380 ms, cloud spend $420k/quarter.
“We thought we needed more engineers. What we needed was fewer footguns.”
Symptoms You Can’t Ignore
Here’s what we found in the first 10 days—none of this will shock anyone who’s lived through hypergrowth:
- Pipeline roulette: Three CI systems (
GitHub Actions, a legacyJenkins, and a bespoke runner). No consistent gates. Reverts were manual and political. - Test flake hell: ~19% of merges re‑ran tests due to flakiness. A third of those were never root‑caused.
- Observability theater: Dashboards everywhere, SLOs nowhere. Alerts fired late and loud. MTTR suffered.
- AI‑code drift: Copilot‑crafted helpers duplicated across repos with inconsistent error handling and types. Good demos, bad production.
- Infra drift: Terraform modules forked across teams, tagging inconsistent, auto‑scaling disabled “temporarily.” Spend climbed regardless of traffic.
Business impact, in CFO‑speak:
- Revenue risk: Feature lead time stretched to 21 days; enterprise pilots were slipping.
- Support costs: On‑call burnouts, credits issued for incidents.
- Dilution risk: Diligence flagged engineering execution as a financing risk.
What We Changed in 90 Days
We didn’t boil the ocean. We killed 20% of the debt that caused 80% of the pain. Playbook, with owners and timeboxes:
Standardize delivery with GitOps and guardrails (3 weeks)
- Consolidated CI to
GitHub Actions→ArgoCDApp per service. - Enforced canary deployment + auto‑rollback via
Argo Rolloutsand health checks. - Required
kubectlrollout health + smoke tests before promotion.
- Consolidated CI to
Put SLOs in writing and alerts on burn rate (2 weeks)
- Defined checkout SLO: 99.9% success, p95 latency < 200 ms.
- Added
PrometheusRulemulti‑window burn alerts; pager only on budget burn, not just errors.
Kill flake, shorten feedback (3 weeks)
- Quarantined flaky tests with a bot; owners had 2 weeks to fix or delete.
- Introduced ephemeral preview envs per PR (namespaced on EKS) for product sign‑off.
Refactor AI‑generated “vibe code” (2 weeks, parallel)
- Audited helpers; collapsed 11 error‑handling libs → 1 package.
- Gated PRs with
eslint,tsc --noEmit, architectural rules (depcruise), and coverage checks.
Right‑size infra with Terraform and budgets (ongoing, 3 weeks heavy)
- Replaced snowflake stacks with reviewed modules; enforced tagging and budgets.
- Set HPA/VPA for hotspots; fixed noisy retry storms with circuit breakers.
DB reality check (1 week)
- Indexed slow paths, fixed N+1s, established p50/p95/p99 targets by endpoint.
Code and Config We Actually Shipped
We get asked for receipts. Here are a few representative slices.
ArgoCD Application with Canary Rollout
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: checkout-service
spec:
project: default
source:
repoURL: https://github.com/acme/checkout
targetRevision: main
path: deploy/overlays/prod
destination:
server: https://kubernetes.default.svc
namespace: checkout
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
---
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: checkout
spec:
replicas: 6
strategy:
canary:
steps:
- setWeight: 10
- pause: {duration: 120}
- setWeight: 50
- pause: {duration: 300}
analysis:
templates:
- templateName: success-rate
startingStep: 1
selector:
matchLabels: { app: checkout }
template:
metadata:
labels: { app: checkout }
spec:
containers:
- name: checkout
image: ghcr.io/acme/checkout:1.23.4
ports: [{containerPort: 8080}]SLO Burn‑Rate Alert (Prometheus Operator)
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: checkout-slo
namespace: monitoring
spec:
groups:
- name: checkout.slo
rules:
- alert: CheckoutErrorBudgetBurn
expr: |
(1 - sum(rate(http_requests_total{app="checkout",status!~"2.."}[5m]))
/ sum(rate(http_requests_total{app="checkout"}[5m])))
> (1 - 0.999) * 14
for: 5m
labels:
severity: page
annotations:
summary: "Checkout SLO burn > 14x (5m)"
runbook: "https://internal.runbooks/checkout-slo"Terraform: Tagging, Budgets, and Right‑Sizing
module "eks_nodegroup_checkout" {
source = "git::ssh://git@github.com/acme/infra-modules.git//eks-nodegroup?ref=v0.8.1"
cluster_name = module.eks.name
name = "checkout-ng"
min_size = 3
max_size = 12
desired_size = 6
instance_types = ["m6i.large"]
tags = {
Service = "checkout"
Owner = "payments"
CostCenter= "cc-1234"
Env = "prod"
SLO = "checkout-99.9"
}
}
resource "aws_budgets_budget" "checkout_prod" {
name = "checkout-prod-monthly"
budget_type = "COST"
limit_amount = "60000"
limit_unit = "USD"
time_unit = "MONTHLY"
notification {
comparison_operator = "GREATER_THAN"
threshold = 80
threshold_type = "PERCENTAGE"
subscriber_email_addresses = ["finops@acme.com", "sre@acme.com"]
}
}Feature Flag + Circuit Breaker (Node.js)
import CircuitBreaker from 'opossum';
import fetch from 'node-fetch';
const FF_ROLLOUT_NEW_CHECKOUT = process.env.FF_ROLLOUT_NEW_CHECKOUT === 'true';
const pay = async (payload: any) => {
const res = await fetch(process.env.PAYMENTS_URL!, {
method: 'POST', headers: {'content-type': 'application/json'},
body: JSON.stringify(payload)
});
if (!res.ok) throw new Error(`payments ${res.status}`);
return res.json();
};
const breaker = new CircuitBreaker(pay, {
errorThresholdPercentage: 50,
resetTimeout: 10000,
rollingCountTimeout: 30000
});
export async function checkout(payload: any) {
if (!FF_ROLLOUT_NEW_CHECKOUT) return legacyCheckout(payload);
try {
return await breaker.fire(payload);
} catch (e) {
// Fallback + metric
return legacyCheckout(payload);
}
}Ops hygiene we enforced in PRs
eslint+tsc --noEmit+ architectural import rules (dependency-cruiser).- Coverage gates at 80% for changed lines (
diff‑coverage). - Canary required in
deploy/overlays/*/kustomization.yaml. - Preview env required for user‑facing changes.
Results You Can Take to the Board
In 90 days we moved the needle on both engineering and dollars. Actuals:
- Deploy frequency: 1/week → 20/day (median) across core services.
- Change failure rate: 35% → 7% (5x reduction).
- MTTR: 14h → 28m (burn‑rate alerts + rollback automation).
- Latency (p95 checkout): 380 ms → 110 ms (DB indexes + circuit breakers + autoscaling).
- Infra spend: $420k/quarter → $310k/quarter (−26%) without hurting SLOs.
- Lead time for changes: 21 days → 2 days (PR to production via GitOps + previews).
- On‑call pages: −62% pages/week; we removed “alert fatigue” as a risk.
Business outcomes:
- Sales velocity: Enterprise pilots unblocked; two deals closed citing “reliable deploys.”
- Retention: Support credits down 70%; NRR ticked up 3 points in the following quarter.
- Financing: Series C diligence flagged “mature SRE practices” as a strength.
Yes, the work paid for itself in ~6 weeks on infra savings alone—even before counting fewer incidents and faster features.
What We’d Do Differently Next Time
I’ve seen this movie a dozen times. A few scars worth sharing:
- Adopt fewer tools, more conventions. The team didn’t need a new service mesh; they needed
ArgoCDconsistency and SLOs. - Treat AI suggestions like junior devs. Great for scaffolding, terrible for standards. We now auto‑label “AI‑assisted” PRs and require senior review.
- Delete aggressively. Quarantining flaky tests is nice; deleting dead tests and code pays faster dividends.
- Push cost ownership to teams. Tagging + budgets per service turned cost from “cloud bill” to “your SLO tax.”
- Document rollback over root‑cause theater. A clean rollback path keeps Friday deploys safe; RCA can wait until Monday.
Playbook You Can Steal
If you’re staring at similar graphs, here’s a condensed sequence you can run without a full rewrite:
- Inventory pain by metric: change failure, MTTR, lead time, and cost per SLO.
- Pick three levers: GitOps + canary, SLOs + burn alerts, Terraform cost discipline.
- Add feature flags and circuit breakers to high‑risk endpoints.
- Quarantine and fix flake; stand up PR preview envs.
- Audit AI‑generated code for duplication and error handling; consolidate to one lib.
- Publish before/after dashboards; celebrate fewer pages as loudly as new features.
A quarter from now, you can be shipping daily without sweating the pager. That’s not optimism—that’s muscle memory you can buy with focus.
Key takeaways
- Tech debt is a business problem: reducing it can fund itself fast through higher deploy frequency, lower MTTR, and reduced cloud spend.
- Standardize the pipeline: GitOps with ArgoCD and canaries turns fear into fast, reversible change.
- Define and enforce SLOs with burn‑rate alerts—debate stops when error budgets are visible.
- Treat AI‑generated code as untrusted input; add quality gates and refactor to common libraries.
- Right‑size infra with Terraform modules and tags; measure cost per SLA, not per node.
- Make test flakiness illegal—quarantine and fix in 2 weeks or delete the test.
Implementation checklist
- Inventory debt by impact: change failure, MTTR, infra cost, lead time.
- Pick 3 levers for 90 days: CI/CD standardization, SLOs/observability, infra cost controls.
- Move to GitOps (`ArgoCD`) and require canary + automated rollback.
- Codify SLOs with `PrometheusRule` and burn‑rate alerts.
- Introduce feature flags + circuit breakers for high‑risk paths.
- Consolidate Terraform into reviewed modules; enforce tagging and budgets.
- Quarantine flaky tests and add preview envs to cut merge friction.
- Audit and refactor AI‑generated code; gate with linters, coverage, and architectural checks.
Questions we hear from teams
- How do you choose which technical debt to pay first?
- Score debt by business impact, not aesthetics. We use four numbers: change failure rate, MTTR, lead time, and cost per SLO. Anything that improves deploy safety (GitOps, canaries) and time‑to‑rollback usually wins in the first 90 days.
- Do we need a full platform team to adopt GitOps and SLOs?
- No. In this case, 2 SREs and 1 Staff Eng did the heavy lift. Start with one golden path service, templatize it, and scale by copying the pattern—not by inventing a platform.
- What about AI‑generated code—ban it or embrace it?
- Treat AI like a junior pair: useful scaffolding, untrustworthy defaults. Keep it, but gate it with linters, coverage, and architectural rules. Then refactor duplicated helpers into a hardened library.
- Will canaries slow us down?
- They speed you up. Automated promotion + rollback removes human bottlenecks and reduces blast radius. Median lead time dropped from 21 days to 2 days once canaries and GitOps were standard.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
