From 180 Microservices to 75: The Migration That Cut Ops Toil 45%

A real-world refactor of a sprawl of Kubernetes services into a manageable platform—without killing delivery velocity or breaking SOC 2.

“We can’t keep hiring SREs to mask platform complexity.” — CTO
Back to all posts

The microservices migration that stopped the pager from dictating the roadmap

Two summers ago, a mid-market fintech (2,300 people, global B2B payments) asked GitPlumbers to help untangle a Kubernetes estate that had grown like ivy: 180 microservices spread across EKS 1.24–1.27, two cloud accounts, three CI systems, and enough Helm drift to make helm diff cry. On-call was a blood sport—SREs were averaging 20+ pages/week and product teams had normalized 3 a.m. canary rollbacks.

“We can’t keep hiring SREs to mask platform complexity.” — CTO

They didn’t want a rewrite. They wanted fewer moving parts, fewer 2 a.m. surprises, and the ability to ship without an incident budget line item. We delivered a migration that cut ops toil by 45% and reduced pages by 65%, while keeping deploy frequency steady. Here’s the real playbook—warts, tradeoffs, and the boring tech that actually works.

What we walked into

I’ve seen this movie before. Lots of good intentions, too many knobs:

  • 180 services, 9 languages (heavy Go and Node.js, pockets of Java 11 and Python 3.9).
  • DIY service mesh: Istio 1.16 with per-team EnvoyFilter snowflakes, mutual TLS misconfigurations, and seven flavors of VirtualService.
  • Inconsistent deploys: Jenkins freestyle jobs, GitHub Actions, and a rogue GitLab CI island.
  • Mix of Helm and raw kubectl apply; three different values.yaml conventions.
  • Observability in name only: Prometheus scraping some namespaces, logs in CloudWatch and Loki, traces nowhere.
  • Compliance guardrails bolted on after the fact: PSP deprecation half-migrated, NetworkPolicy optional, admission controllers inconsistent.

KPIs told the story:

  • MTTR: 94 minutes (P1s).
  • Change failure rate: 18% across critical services.
  • Pages: 320/month across 6 SREs (~13/SRE/week).
  • Cloud spend trending +11% QoQ without usage growth.

The constraints that made this hairy

  • Zero downtime mandate: Payment rails can’t go dark. No “big bang.”
  • SOC 2 + PCI DSS: Audit trails for deploys, immutable infra changes, and access boundaries.
  • Multi-region active/active: US-East + EU-West with data residency constraints.
  • No feature freeze: Product kept shipping; we had to thread the needle.
  • Budget-aware: No six-figure platform licenses; prioritize ROI and boring tech.

What we changed, in the order that worked

If you only take one thing: sequence matters. We didn’t start with a new mesh. We started with a map.

  1. Service taxonomy and consolidation

    • We built a catalog in Backstage and tagged every service by domain, data criticality, deploy cadence, and runtime complexity. Call graphs from pyroscope sampling and vflow interrogations informed coupling.
    • Merged 42 nanoservices into 12 domain services where 90% of deploys and rollbacks were correlated. Yes, we ate some repo and contract churn. Delivery got simpler.
    • Rule of thumb: if two services always ship together, share the same pager, and roll back together, they’re the same service.
  2. Standardized cluster baseline (EKS, one per env per region)

    • Unified on EKS 1.27 with managed node groups and Bottlerocket for stateless pools.
    • Replaced deprecated PSP with Pod Security Admission and enforced policies via Kyverno.
    • Locked logging to Fluent Bit -> Grafana Loki, metrics via Prometheus Operator.

    Example Kyverno policy to block privileged pods:

    apiVersion: kyverno.io/v1
    kind: ClusterPolicy
    metadata:
      name: disallow-privileged
    spec:
      validationFailureAction: enforce
      rules:
        - name: no-privileged
          match:
            resources:
              kinds: [Pod]
          validate:
            message: Privileged mode is not allowed
            pattern:
              spec:
                containers:
                  - =(securityContext):
                      =(privileged): false
  3. GitOps with ArgoCD (app-of-apps)

    • No more imperative kubectl. One repo per service, one environment repo per env. ArgoCD managed all workloads.
    apiVersion: argoproj.io/v1alpha1
    kind: Application
    metadata:
      name: platform
      namespace: argocd
    spec:
      project: default
      source:
        repoURL: https://github.com/ledgerloop/infra-environments
        targetRevision: main
        path: clusters/prod
      destination:
        server: https://kubernetes.default.svc
        namespace: argocd
      syncPolicy:
        automated:
          prune: true
          selfHeal: true
    • Helm stayed, but we normalized charts and values. Some teams moved to kustomize overlays where it simplified deltas.
  4. Simplified the mesh: Istio out, Linkerd + Gateway API in

    • I love Istio for complex edge cases. This estate didn’t need it. We removed 80% of routing config by moving to Linkerd 2.14 (mTLS, retries, timeouts) and Gateway API for north-south.
    • Canary via TrafficSplit beat five layers of EnvoyFilter magic.
    apiVersion: split.smi-spec.io/v1alpha1
    kind: TrafficSplit
    metadata:
      name: payments
      namespace: prod
    spec:
      service: payments
      backends:
        - service: payments-v1
          weight: 80
        - service: payments-v2
          weight: 20
  5. Observability that enforces reality

    • Standardized on OpenTelemetry SDKs exporting to the Collector, metrics scraped by Prometheus, logs in Loki, traces in Tempo.
    • SLOs codified with Sloth and alerts wired to on-call rotations per domain.

    Example SLO for payments availability:

    apiVersion: sloth.slok.dev/v1
    kind: PrometheusServiceLevel
    metadata:
      name: payments-availability
      namespace: slo
    spec:
      service: payments
      labels:
        team: ledger
      slos:
        - name: availability
          objective: 99.9
          sli:
            events:
              errorQuery: sum(rate(http_requests_total{job="payments",status=~"5.."}[5m]))
              totalQuery: sum(rate(http_requests_total{job="payments"}[5m]))
          alerting:
            name: payments-availability
            labels:
              severity: page
            annotations:
              summary: Payments availability SLO burn
  6. Paved-path CI/CD templates (GitHub Actions)

    • We didn’t outlaw experimentation; we made the golden path easier. Reusable workflow with build, test, trivy scan, and ArgoCD image tag bump via PR to the env repo.
    name: ci
    on: [push]
    jobs:
      build-test:
        runs-on: ubuntu-latest
        steps:
          - uses: actions/checkout@v4
          - uses: actions/setup-go@v5
            with: { go-version: '1.21' }
          - run: go test ./...
          - uses: aquasecurity/trivy-action@master
            with: { scan-type: 'fs', ignore-unfixed: true }
          - run: |
              docker build -t ghcr.io/ledgerloop/payments:${{ github.sha }} .
              echo "image: ghcr.io/ledgerloop/payments:${{ github.sha }}" > image.txt
          - name: bump env
            uses: peter-evans/create-pull-request@v5
            with:
              token: ${{ secrets.GITHUB_TOKEN }}
              commit-message: Bump payments image
              title: Bump payments image
              branch: bump/payments-${{ github.sha }}
              path: infra-environments/clusters/prod/payments
  7. Infra as Code everywhere (Terraform)

    • One terraform root per account, modules for clusters, node pools, gateways, and secrets. No click-ops.
    module "eks" {
      source          = "terraform-aws-modules/eks/aws"
      version         = "20.8.3"
      cluster_name    = "prod-us-east"
      cluster_version = "1.27"
      eks_managed_node_groups = {
        stateless = { instance_types = ["m6g.large"], desired_size = 6 }
        stateful  = { instance_types = ["r6g.large"], taints = [{ key = "stateful", value = "true", effect = "NO_SCHEDULE" }] }
      }
    }

None of this is rocket science. The trick was sequencing, guardrails, and holding the line on paved paths.

Results you can take to the board

Six months, zero downtime, and metrics that mattered:

  • Services: 180 -> 75 (58% reduction; 42 merged, 63 retired, 20 kept as-is).
  • Pages: 320 -> 112/month (−65%).
  • MTTR: 94 -> 22 minutes (−77%).
  • Change failure rate: 18% -> 6% (tracked via ArgoCD health + rollbacks).
  • Deploy frequency: steady at ~240 deploys/week, but with fewer rollbacks (−71%).
  • Cloud spend: −28% on compute and data egress, mostly from right-sized node pools and fewer cross-service hops.
  • Tickets/SRE/month: 52 -> 28 (−46%).
  • Audit findings: 0 material issues; PCI scope simplified due to consistent ingress/egress patterns.

The CTO didn’t need a slide deck—the burn chart on pages and cost told the story.

What I'd repeat—and what I'd skip next time

What worked:

  • Consolidation first, platform second. Merge nanoservices before arguing about the mesh.
  • Git as the single source of truth. ArgoCD’s drift detection paid for itself the first weekend we avoided a mystery hotfix.
  • SLOs before dashboards. Alert on burn, not noise. Pages dropped because the pager stopped lying.
  • Boring defaults. EKS + Linkerd + Prometheus Operator + ArgoCD handled 90% of cases without bespoke YAML.

What I’d do differently:

  • Earlier repo scaffolding. We waited on Backstage templates. Ship them day one and avoid template drift.
  • Avoid “temporary” mesh overlap. We ran Istio and Linkerd in parallel for two weeks; it complicated root cause. Move service families wholesale.
  • Budget a deprecation sprint. Killing dead Helm charts took longer than it should have. Timebox it and be ruthless.

Steal this and adapt it

Here’s the short version you can run next quarter without a feature freeze:

  1. Build a service catalog and tag by domain, cadence, and criticality.
  2. Merge coupled nanoservices; retire the zombies.
  3. Standardize one cluster baseline per env; enforce Pod Security Admission and Kyverno.
  4. Move deploys to GitOps with ArgoCD; app-of-apps for platform.
  5. Simplify the network path (Linkerd + Gateway API) and introduce TrafficSplit canaries.
  6. Instrument with OpenTelemetry; define SLOs via Sloth; alert on burn rate.
  7. Ship a golden CI/CD workflow and enforce via templates and scorecards.
  8. Track toil weekly: pages, MTTR, change failure rate, tickets/SRE, and spend. Celebrate deltas.

If you need a partner who’s done this under SOC 2/PCI pressure without pausing product, GitPlumbers has the scars and the receipts. Let’s make your pager boring again.

Related Resources

Key takeaways

  • Consolidate nanoservices ruthlessly—merge by domain and failure blast radius, not org chart.
  • Pick boring tech for the core path: EKS + ArgoCD + Linkerd + Prometheus Operator is plenty for 90% of use cases.
  • GitOps isn’t magic; you need a service taxonomy, repo structure, and paved path templates to avoid drift.
  • Measure what matters: SLOs, change failure rate, MTTR, pager volume, and tickets per SRE.
  • Reduce mesh complexity before you scale it—ambient promises don’t fix your routing graph.
  • No freeze required: migrate incrementally with canaries and traffic splitting per service family.

Implementation checklist

  • Inventory and categorize services by domain, data criticality, and runtime complexity.
  • Merge nanoservices where call graphs and deploy cadence are tightly coupled.
  • Standardize cluster baselines (version, PSP replacement, network policy, logging).
  • Adopt ArgoCD app-of-apps and lock deployments to Git as source of truth.
  • Simplify the mesh or remove it; start with least surprise for 80% traffic paths.
  • Instrument services with OpenTelemetry and define SLOs with Sloth.
  • Ship a golden CI/CD template and enforce with repo scaffolding and scorecards.
  • Track toil: pages/SRE/month, tickets/SRE, MTTR, change failure rate, and cloud spend.

Questions we hear from teams

Why did you replace Istio instead of fixing it?
The estate didn’t need Istio’s feature set, and its configurational surface was causing operator error. Linkerd delivered mTLS, retries, timeouts, and simple canaries with a fraction of the YAML. When you’re fighting toil, choose the smallest tool that meets your 80% path and simplify first.
Did consolidation slow delivery for teams?
Short term, yes—merging 42 nanoservices into 12 domains required interface changes and shared repos. We mitigated with temporary adapters and parallel releases. Net effect after two sprints: fewer coordinated deploys, fewer rollbacks, and faster root cause analysis.
Why ArgoCD over Flux?
Both are solid. The org already had ArgoCD expertise, and its UI + app-of-apps model fit their platform team’s mental model. Flux would also have worked; the value is in GitOps discipline, not the specific tool.
How did you avoid downtime during the migration?
We migrated per service family with `TrafficSplit` canaries, kept old and new paths live, and rolled forward only after SLO burn stayed below thresholds for 24 hours. Database changes were backward compatible and gated via feature flags.
What metrics should I track to prove success?
Pages per SRE, MTTR, change failure rate, deploy frequency, rollback rate, tickets per SRE, and cloud spend. Tie alerts to SLO burn and ensure every change is traceable back to Git.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Make your pager boring See how we do GitOps