The Microservices Migration That Cut On‑Call Pages 72% and Retired 38 Helm Charts

We took a fintech with 120+ chatty services, seven snowflake EKS clusters, and a pager screaming every night—and turned it into a lean platform with GitOps, Linkerd, and managed data services. No silver bullets. Just boring, proven engineering.

We stopped babysitting clusters and started shipping features again. On‑call finally got boring.
Back to all posts

The mess we walked into

Twelve months ago, a late‑stage fintech asked us to “make on‑call boring again.” They had 120+ microservices, seven EKS clusters (one per team, plus a couple of “temporary” ones that stuck), two service meshes (half the fleet on Istio 1.7, the rest using no mesh), and three CI systems fighting for attention (Jenkins, CircleCI, and GitHub Actions). Deploys were kubectl apply in Slack threads. SOC 2 auditors were not amused.

  • Weekly deploys per service: 0–1 (most teams feared prod)
  • MTTR: 62 min median
  • On‑call pages: ~140/month across SRE rotation
  • Change failure rate: 23% (rolled back or hotfixed)
  • Infra spend: healthy, but ops time was the real bleed (SREs doing YAML archaeology)

I’ve seen this movie. It ends with a platform team rewriting the world in six months and burning out. We did the opposite: reduce blast radius, standardize the boring stuff, and merge services where it actually pays off.

Constraints that actually mattered

  • Regulated: SOC 2 Type II in flight, PCI scope on payments path.
  • Zero downtime for money flows: Payments path had a hard 99.95% SLO.
  • Team autonomy: We couldn’t shove a monolith down everyone’s throat.
  • Cost neutral: Any infra spend had to be offset by reduced toil or other savings.
  • Skill mix: Strong backend engineers, thin SRE bench. We needed tech that doesn’t need a PhD to run.

Translation: prefer managed services, Linkerd over Istio, ArgoCD over homegrown CD, and Terraform for everything that moves. And wherever we merged microservices, we did it by bounded contexts, not by org chart.

What we changed (and why it worked)

We ran three workstreams in parallel, with one rule: no platform feature ships without a reference app and docs.

  1. Topology simplification

    • From seven clusters to three regional EKS clusters (us‑east‑1, eu‑west‑1, ap‑southeast‑1). Namespaces per domain (e.g., payments, risk, catalog).
    • Traffic managed by AWS ALB Ingress Controller + ExternalDNS. No more NGINX per team.
    • Linkerd instead of upgrading snowflake Istio. Fewer knobs; fewer 3 a.m. surprises.
  2. GitOps + policy by default

    • ArgoCD for all workloads; infra in Terraform; OPA Gatekeeper to enforce guardrails (resource requests, no :latest, proper livenessProbe).
    • External Secrets Operator wired to AWS Secrets Manager. Deleted dozens of bespoke init containers.
  3. Service consolidation

    • Merged 22 chatty nanoservices in payments into 6 domain services (auth, ledger, payouts, reconciliation, notifications, reporting). Same repo or monorepo where it helped; kept clear APIs.
    • Killed deadweight jobs with event‑driven patterns. Moved self‑managed Kafka to Confluent Cloud to stop babysitting Zookeeper.
  4. Observability and reliability baked in

    • Standardized Prometheus metrics with Grafana Cloud and Loki for logs. SLOs codified via Sloth. Alerts via Alertmanager → PagerDuty.
    • Argo Rollouts for canary; default circuit breakers and retries at the mesh.

It sounds neat now. It wasn’t. We cut over domain by domain with canaries and backstops. The key was shipping a platform template every team could copy without thinking.

The boring but essential templates

Here’s what “paved road” looked like. Teams pasted these into deploy/ and moved on.

  • ArgoCD Application (one per service):
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: payments-ledger
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/acme/payments
    targetRevision: main
    path: deploy/overlays/prod
    helm:
      values: |
        image:
          tag: {{ .SHA }}
        rollouts:
          strategy: canary
  destination:
    server: https://kubernetes.default.svc
    namespace: payments
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true
  • Argo Rollouts canary with Linkerd traffic split:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: ledger
spec:
  strategy:
    canary:
      steps:
        - setWeight: 10
        - pause: {duration: 120}
        - setWeight: 25
        - pause: {duration: 180}
        - setWeight: 50
        - pause: {duration: 300}
      trafficRouting:
        smi:
          rootService: ledger
  • Linkerd service profile with timeouts/retries (no per‑service snowflakes):
apiVersion: linkerd.io/v1alpha2
kind: ServiceProfile
metadata:
  name: ledger.payments.svc.cluster.local
  namespace: payments
spec:
  routes:
    - name: POST /entries
      condition: {method: POST, pathRegex: "/entries"}
      isRetryable: true
      timeout: 2s
      responseClasses:
        - condition: {status: {min: 500, max: 599}}
          isFailure: true
  • Resource policy + HPA baked in:
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ledger
spec:
  template:
    spec:
      containers:
        - name: app
          image: ghcr.io/acme/ledger:{{ .Values.image.tag }}
          resources:
            requests: {cpu: "250m", memory: "256Mi"}
            limits: {cpu: "1", memory: "512Mi"}
          readinessProbe:
            httpGet: {path: /healthz, port: 8080}
            initialDelaySeconds: 5
            periodSeconds: 5
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ledger
spec:
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
  • SLO as code using Sloth:
apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
  name: payments-ledger-availability
  namespace: payments
spec:
  service: ledger
  slos:
    - name: availability
      objective: 99.95
      sli:
        events:
          errorQuery: sum(rate(http_requests_total{job="ledger",status=~"5.."}[5m]))
          totalQuery: sum(rate(http_requests_total{job="ledger"}[5m]))
      alerting:
        name: LedgerAvailability
        labels: {severity: page}
        annotations:
          runbook: https://runbooks.acme.internal/ledger
  • GitHub Actions to stamp images with SHA and let ArgoCD pull:
name: build
on:
  push:
    branches: [main]
jobs:
  docker:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: docker/setup-buildx-action@v3
      - uses: docker/login-action@v3
        with:
          registry: ghcr.io
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}
      - uses: docker/build-push-action@v6
        with:
          context: .
          push: true
          tags: ghcr.io/acme/ledger:${{ github.sha }}
  • Terraform for EKS add‑ons and ArgoCD app‑of‑apps:
module "eks" {
  source          = "terraform-aws-modules/eks/aws"
  cluster_name    = "acme-us-east-1"
  cluster_version = "1.29"
  vpc_id          = module.vpc.vpc_id
  subnet_ids      = module.vpc.private_subnets
}

resource "helm_release" "argocd" {
  name       = "argocd"
  repository = "https://argoproj.github.io/argo-helm"
  chart      = "argo-cd"
  namespace  = "argocd"
}

Nothing exotic. Just the stuff that keeps weekends quiet.

Cutovers, trade‑offs, and the “don’t do this” list

We migrated by domain in this order: catalog (low risk), risk, then payments last. A few hard‑earned lessons:

  • Consolidate services before consolidating clusters. We tried moving a chatty mesh of 10 nanoservices as‑is. Canary looked fine; steady‑state lit up p99. We merged them first; autoscaling stabilized.
  • Choose one service mesh. We killed Istio early. Keeping two meshes is a pager tax.
  • Managed Kafka was a cheat code. Confluent Cloud removed an entire class of “which broker died” incidents. The bill was lower than the SRE time it replaced.
  • Policy saved us from ourselves. Gatekeeper caught a few image: latest oops moments that would’ve caused thundering herds.
  • Docs matter. A 30‑minute “how to onboard to the platform template” recording did more than any all‑hands decree.

We didn’t “adopt microservices.” We adopted fewer microservices with better boundaries and boring platform defaults.

The results you can actually measure

Three months per region, nine months total. Here’s what moved the needle:

  • On‑call pages: 140 → 39 per month (−72%)
  • MTTR: 62 min → 19 min median (playbooks + rollbacks + standard dashboards)
  • Change failure rate: 23% → 9% (canary + policy + pre‑prod parity)
  • Lead time for changes: 2–3 days → <4 hours (DORA)
  • Deploy frequency: 0–1/week → 5–10/week per service
  • Helm charts: 63 bespoke → 25 standardized (retired 38)
  • Clusters: 7 → 3 (regional)
  • SRE “toil” time: ~30 hrs/wk → <8 hrs/wk (measured via ticket tags)
  • Infra cost: roughly flat; Confluent spend offset by killing EC2 Kafka, EBS, and engineer time.

For the SOC 2 auditors: every prod change had a Git commit, PR approval, and ArgoCD audit trail. No more “who applied this at 2 a.m.?” mysteries.

What we’d do differently next time

  • Earlier golden paths. We waited a sprint too long to publish the platform template, and teams invented their own. Undoing that took cycles.
  • Staging traffic replay. We used synthetic load; we should’ve done shadow traffic with Service Mesh Interface earlier. It caught one payments edge case late.
  • Dashboard curation. We shipped too many Grafana dashboards. We now enforce one golden dashboard per service with the four golden signals and SLO.
  • Cost showback. Teams behaved better once we added namespace‑level cost allocation via AWS CUR + Kubecost. Should’ve shipped with day one.

If you’re about to do this, start here

  • Pick a simple target topology (2–3 regional clusters, namespaces by domain). Draw it. Socialize it.
  • Ship a copy‑paste platform template (deployment, HPA, SLO, canary, runbook link). Don’t let every team reinvent it.
  • Move deploys to ArgoCD and enforce a few OPA policies. You’ll get safety and auditability for free.
  • Use a lightweight mesh like Linkerd unless you truly need Istio features. Your on‑call will thank you.
  • Offload undifferentiated heavy lifting: Confluent, Grafana Cloud, RDS/Aurora. Stop racking your own pet clusters.
  • Measure DORA and SLOs weekly. Optimize for MTTR and change failure, not CPU graphs.
  • Cut over by domain, not by team. Use canaries. Keep rollbacks as git revert + Argo sync.

And if you need a crew that’s done this without burning your team out, GitPlumbers has the scars and the playbooks.

Related Resources

Key takeaways

  • Consolidate services around bounded contexts before you consolidate clusters.
  • Adopt GitOps (ArgoCD) and policy-as-code to kill snowflake environments and audit gaps.
  • Prefer boring tech: Linkerd over complex meshes, managed Kafka over self-hosted.
  • Standardize deploy templates (HPA, SLOs, probes) so you can scale ops with headcount ≈ 0.
  • Measure with DORA + SLOs; optimize for MTTR and change failure rate, not vanity metrics.

Implementation checklist

  • Inventory services and traffic; identify chatty nanoservices to merge.
  • Design 2–3 regional clusters with clear namespace/domain boundaries.
  • Move deploys to GitOps (ArgoCD), infra to Terraform, enforce with OPA Gatekeeper.
  • Adopt a lightweight mesh (Linkerd) and standard retries/timeouts at the platform level.
  • Shift self-managed Kafka/Elastic to managed equivalents where possible.
  • Instrument golden signals and codify SLOs; wire Prometheus/Alertmanager to PagerDuty.
  • Ship a platform template: resource requests, HPA, probes, canary, dashboards, runbooks.
  • Cut over by domain with canaries and rollbacks; measure DORA metrics every week.

Questions we hear from teams

Why Linkerd instead of Istio?
Operational overhead. Linkerd’s defaults cover retries, timeouts, and mTLS without constant YAML spelunking. The team didn’t need multi‑cluster gateways or custom EnvoyFilters; they needed reliability with fewer levers.
How did you manage secrets across environments?
External Secrets Operator read from AWS Secrets Manager with environment‑scoped prefixes. ArgoCD managed ESO manifests; engineers never touched raw Kubernetes Secrets. Rotations were handled in ASM with zero redeploys when apps watched for updates.
What about stateful services like Kafka and Elasticsearch?
We moved Kafka to Confluent Cloud and logs to Grafana Cloud Loki. We kept Aurora for relational data, managed by RDS. The rule was: if a vendor can run it better and cheaper than our SRE time, buy it.
Did you consider going back to a monolith?
We merged nanoservices into domain services where latency and ownership made sense. Full re‑monolith would’ve blown the schedule and team autonomy. Bounded contexts gave 80% of the benefits with 20% of the churn.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to an engineer See our platform playbook

Related resources