Why did you replace Istio instead of fixing it?

The estate didn’t need Istio’s feature set, and its configurational surface was causing operator error. Linkerd delivered mTLS, retries, timeouts, and simple canaries with a fraction of the YAML. When you’re fighting toil, choose the smallest tool that meets your 80% path and simplify first.

Did consolidation slow delivery for teams?

Short term, yes—merging 42 nanoservices into 12 domains required interface changes and shared repos. We mitigated with temporary adapters and parallel releases. Net effect after two sprints: fewer coordinated deploys, fewer rollbacks, and faster root cause analysis.

Why ArgoCD over Flux?

Both are solid. The org already had ArgoCD expertise, and its UI + app-of-apps model fit their platform team’s mental model. Flux would also have worked; the value is in GitOps discipline, not the specific tool.

How did you avoid downtime during the migration?

We migrated per service family with `TrafficSplit` canaries, kept old and new paths live, and rolled forward only after SLO burn stayed below thresholds for 24 hours. Database changes were backward compatible and gated via feature flags.

What metrics should I track to prove success?

Pages per SRE, MTTR, change failure rate, deploy frequency, rollback rate, tickets per SRE, and cloud spend. Tie alerts to SLO burn and ensure every change is traceable back to Git.

Case-studies · Oct 1, 2025 · 9 minute read

From 180 Microservices to 75: The Migration That Cut Ops Toil 45%

A real-world refactor of a sprawl of Kubernetes services into a manageable platform—without killing delivery velocity or breaking SOC 2.

Back to all posts

From 180 Microservices to 75: The Migration That Cut Ops Toil 45%

Key takeaways

Implementation checklist