From 180 Microservices to 75: The Migration That Cut Ops Toil 45%
A real-world refactor of a sprawl of Kubernetes services into a manageable platform—without killing delivery velocity or breaking SOC 2.
“We can’t keep hiring SREs to mask platform complexity.” — CTOBack to all posts
The microservices migration that stopped the pager from dictating the roadmap
Two summers ago, a mid-market fintech (2,300 people, global B2B payments) asked GitPlumbers to help untangle a Kubernetes estate that had grown like ivy: 180 microservices spread across EKS 1.24–1.27
, two cloud accounts, three CI systems, and enough Helm drift to make helm diff
cry. On-call was a blood sport—SREs were averaging 20+ pages/week and product teams had normalized 3 a.m. canary rollbacks.
“We can’t keep hiring SREs to mask platform complexity.” — CTO
They didn’t want a rewrite. They wanted fewer moving parts, fewer 2 a.m. surprises, and the ability to ship without an incident budget line item. We delivered a migration that cut ops toil by 45% and reduced pages by 65%, while keeping deploy frequency steady. Here’s the real playbook—warts, tradeoffs, and the boring tech that actually works.
What we walked into
I’ve seen this movie before. Lots of good intentions, too many knobs:
- 180 services, 9 languages (heavy
Go
andNode.js
, pockets ofJava 11
andPython 3.9
). - DIY service mesh:
Istio 1.16
with per-teamEnvoyFilter
snowflakes, mutual TLS misconfigurations, and seven flavors ofVirtualService
. - Inconsistent deploys: Jenkins freestyle jobs, GitHub Actions, and a rogue
GitLab CI
island. - Mix of Helm and raw
kubectl apply
; three differentvalues.yaml
conventions. - Observability in name only: Prometheus scraping some namespaces, logs in
CloudWatch
andLoki
, traces nowhere. - Compliance guardrails bolted on after the fact: PSP deprecation half-migrated,
NetworkPolicy
optional, admission controllers inconsistent.
KPIs told the story:
- MTTR: 94 minutes (P1s).
- Change failure rate: 18% across critical services.
- Pages: 320/month across 6 SREs (~13/SRE/week).
- Cloud spend trending +11% QoQ without usage growth.
The constraints that made this hairy
- Zero downtime mandate: Payment rails can’t go dark. No “big bang.”
- SOC 2 + PCI DSS: Audit trails for deploys, immutable infra changes, and access boundaries.
- Multi-region active/active: US-East + EU-West with data residency constraints.
- No feature freeze: Product kept shipping; we had to thread the needle.
- Budget-aware: No six-figure platform licenses; prioritize ROI and boring tech.
What we changed, in the order that worked
If you only take one thing: sequence matters. We didn’t start with a new mesh. We started with a map.
Service taxonomy and consolidation
- We built a catalog in
Backstage
and tagged every service by domain, data criticality, deploy cadence, and runtime complexity. Call graphs frompyroscope
sampling andvflow
interrogations informed coupling. - Merged 42 nanoservices into 12 domain services where 90% of deploys and rollbacks were correlated. Yes, we ate some repo and contract churn. Delivery got simpler.
- Rule of thumb: if two services always ship together, share the same pager, and roll back together, they’re the same service.
- We built a catalog in
Standardized cluster baseline (EKS, one per env per region)
- Unified on
EKS 1.27
with managed node groups andBottlerocket
for stateless pools. - Replaced deprecated PSP with
Pod Security Admission
and enforced policies viaKyverno
. - Locked logging to
Fluent Bit -> Grafana Loki
, metrics viaPrometheus Operator
.
Example Kyverno policy to block privileged pods:
apiVersion: kyverno.io/v1 kind: ClusterPolicy metadata: name: disallow-privileged spec: validationFailureAction: enforce rules: - name: no-privileged match: resources: kinds: [Pod] validate: message: Privileged mode is not allowed pattern: spec: containers: - =(securityContext): =(privileged): false
- Unified on
GitOps with ArgoCD (app-of-apps)
- No more imperative
kubectl
. One repo per service, oneenvironment
repo per env. ArgoCD managed all workloads.
apiVersion: argoproj.io/v1alpha1 kind: Application metadata: name: platform namespace: argocd spec: project: default source: repoURL: https://github.com/ledgerloop/infra-environments targetRevision: main path: clusters/prod destination: server: https://kubernetes.default.svc namespace: argocd syncPolicy: automated: prune: true selfHeal: true
- Helm stayed, but we normalized charts and values. Some teams moved to
kustomize
overlays where it simplified deltas.
- No more imperative
Simplified the mesh: Istio out, Linkerd + Gateway API in
- I love Istio for complex edge cases. This estate didn’t need it. We removed 80% of routing config by moving to
Linkerd 2.14
(mTLS, retries, timeouts) andGateway API
for north-south. - Canary via
TrafficSplit
beat five layers ofEnvoyFilter
magic.
apiVersion: split.smi-spec.io/v1alpha1 kind: TrafficSplit metadata: name: payments namespace: prod spec: service: payments backends: - service: payments-v1 weight: 80 - service: payments-v2 weight: 20
- I love Istio for complex edge cases. This estate didn’t need it. We removed 80% of routing config by moving to
Observability that enforces reality
- Standardized on
OpenTelemetry
SDKs exporting to the Collector, metrics scraped by Prometheus, logs in Loki, traces inTempo
. - SLOs codified with
Sloth
and alerts wired to on-call rotations per domain.
Example SLO for
payments
availability:apiVersion: sloth.slok.dev/v1 kind: PrometheusServiceLevel metadata: name: payments-availability namespace: slo spec: service: payments labels: team: ledger slos: - name: availability objective: 99.9 sli: events: errorQuery: sum(rate(http_requests_total{job="payments",status=~"5.."}[5m])) totalQuery: sum(rate(http_requests_total{job="payments"}[5m])) alerting: name: payments-availability labels: severity: page annotations: summary: Payments availability SLO burn
- Standardized on
Paved-path CI/CD templates (GitHub Actions)
- We didn’t outlaw experimentation; we made the golden path easier. Reusable workflow with build, test,
trivy
scan, and ArgoCD image tag bump via PR to the env repo.
name: ci on: [push] jobs: build-test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-go@v5 with: { go-version: '1.21' } - run: go test ./... - uses: aquasecurity/trivy-action@master with: { scan-type: 'fs', ignore-unfixed: true } - run: | docker build -t ghcr.io/ledgerloop/payments:${{ github.sha }} . echo "image: ghcr.io/ledgerloop/payments:${{ github.sha }}" > image.txt - name: bump env uses: peter-evans/create-pull-request@v5 with: token: ${{ secrets.GITHUB_TOKEN }} commit-message: Bump payments image title: Bump payments image branch: bump/payments-${{ github.sha }} path: infra-environments/clusters/prod/payments
- We didn’t outlaw experimentation; we made the golden path easier. Reusable workflow with build, test,
Infra as Code everywhere (Terraform)
- One
terraform
root per account, modules for clusters, node pools, gateways, and secrets. No click-ops.
module "eks" { source = "terraform-aws-modules/eks/aws" version = "20.8.3" cluster_name = "prod-us-east" cluster_version = "1.27" eks_managed_node_groups = { stateless = { instance_types = ["m6g.large"], desired_size = 6 } stateful = { instance_types = ["r6g.large"], taints = [{ key = "stateful", value = "true", effect = "NO_SCHEDULE" }] } } }
- One
None of this is rocket science. The trick was sequencing, guardrails, and holding the line on paved paths.
Results you can take to the board
Six months, zero downtime, and metrics that mattered:
- Services: 180 -> 75 (58% reduction; 42 merged, 63 retired, 20 kept as-is).
- Pages: 320 -> 112/month (−65%).
- MTTR: 94 -> 22 minutes (−77%).
- Change failure rate: 18% -> 6% (tracked via ArgoCD health + rollbacks).
- Deploy frequency: steady at ~240 deploys/week, but with fewer rollbacks (−71%).
- Cloud spend: −28% on compute and data egress, mostly from right-sized node pools and fewer cross-service hops.
- Tickets/SRE/month: 52 -> 28 (−46%).
- Audit findings: 0 material issues; PCI scope simplified due to consistent ingress/egress patterns.
The CTO didn’t need a slide deck—the burn chart on pages and cost told the story.
What I'd repeat—and what I'd skip next time
What worked:
- Consolidation first, platform second. Merge nanoservices before arguing about the mesh.
- Git as the single source of truth. ArgoCD’s drift detection paid for itself the first weekend we avoided a mystery hotfix.
- SLOs before dashboards. Alert on burn, not noise. Pages dropped because the pager stopped lying.
- Boring defaults.
EKS + Linkerd + Prometheus Operator + ArgoCD
handled 90% of cases without bespoke YAML.
What I’d do differently:
- Earlier repo scaffolding. We waited on Backstage templates. Ship them day one and avoid template drift.
- Avoid “temporary” mesh overlap. We ran Istio and Linkerd in parallel for two weeks; it complicated root cause. Move service families wholesale.
- Budget a deprecation sprint. Killing dead Helm charts took longer than it should have. Timebox it and be ruthless.
Steal this and adapt it
Here’s the short version you can run next quarter without a feature freeze:
- Build a service catalog and tag by domain, cadence, and criticality.
- Merge coupled nanoservices; retire the zombies.
- Standardize one cluster baseline per env; enforce
Pod Security Admission
andKyverno
. - Move deploys to GitOps with ArgoCD; app-of-apps for platform.
- Simplify the network path (Linkerd + Gateway API) and introduce
TrafficSplit
canaries. - Instrument with OpenTelemetry; define SLOs via
Sloth
; alert on burn rate. - Ship a golden CI/CD workflow and enforce via templates and scorecards.
- Track toil weekly: pages, MTTR, change failure rate, tickets/SRE, and spend. Celebrate deltas.
If you need a partner who’s done this under SOC 2/PCI pressure without pausing product, GitPlumbers has the scars and the receipts. Let’s make your pager boring again.
Related Resources
Key takeaways
- Consolidate nanoservices ruthlessly—merge by domain and failure blast radius, not org chart.
- Pick boring tech for the core path: EKS + ArgoCD + Linkerd + Prometheus Operator is plenty for 90% of use cases.
- GitOps isn’t magic; you need a service taxonomy, repo structure, and paved path templates to avoid drift.
- Measure what matters: SLOs, change failure rate, MTTR, pager volume, and tickets per SRE.
- Reduce mesh complexity before you scale it—ambient promises don’t fix your routing graph.
- No freeze required: migrate incrementally with canaries and traffic splitting per service family.
Implementation checklist
- Inventory and categorize services by domain, data criticality, and runtime complexity.
- Merge nanoservices where call graphs and deploy cadence are tightly coupled.
- Standardize cluster baselines (version, PSP replacement, network policy, logging).
- Adopt ArgoCD app-of-apps and lock deployments to Git as source of truth.
- Simplify the mesh or remove it; start with least surprise for 80% traffic paths.
- Instrument services with OpenTelemetry and define SLOs with Sloth.
- Ship a golden CI/CD template and enforce with repo scaffolding and scorecards.
- Track toil: pages/SRE/month, tickets/SRE, MTTR, change failure rate, and cloud spend.
Questions we hear from teams
- Why did you replace Istio instead of fixing it?
- The estate didn’t need Istio’s feature set, and its configurational surface was causing operator error. Linkerd delivered mTLS, retries, timeouts, and simple canaries with a fraction of the YAML. When you’re fighting toil, choose the smallest tool that meets your 80% path and simplify first.
- Did consolidation slow delivery for teams?
- Short term, yes—merging 42 nanoservices into 12 domains required interface changes and shared repos. We mitigated with temporary adapters and parallel releases. Net effect after two sprints: fewer coordinated deploys, fewer rollbacks, and faster root cause analysis.
- Why ArgoCD over Flux?
- Both are solid. The org already had ArgoCD expertise, and its UI + app-of-apps model fit their platform team’s mental model. Flux would also have worked; the value is in GitOps discipline, not the specific tool.
- How did you avoid downtime during the migration?
- We migrated per service family with `TrafficSplit` canaries, kept old and new paths live, and rolled forward only after SLO burn stayed below thresholds for 24 hours. Database changes were backward compatible and gated via feature flags.
- What metrics should I track to prove success?
- Pages per SRE, MTTR, change failure rate, deploy frequency, rollback rate, tickets per SRE, and cloud spend. Tie alerts to SLO burn and ensure every change is traceable back to Git.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.