The Microservices Migration That Cut On‑Call Pages 72% and Retired 38 Helm Charts
We took a fintech with 120+ chatty services, seven snowflake EKS clusters, and a pager screaming every night—and turned it into a lean platform with GitOps, Linkerd, and managed data services. No silver bullets. Just boring, proven engineering.
We stopped babysitting clusters and started shipping features again. On‑call finally got boring.Back to all posts
The mess we walked into
Twelve months ago, a late‑stage fintech asked us to “make on‑call boring again.” They had 120+ microservices, seven EKS clusters (one per team, plus a couple of “temporary” ones that stuck), two service meshes (half the fleet on Istio 1.7, the rest using no mesh), and three CI systems fighting for attention (Jenkins, CircleCI, and GitHub Actions). Deploys were kubectl apply in Slack threads. SOC 2 auditors were not amused.
- Weekly deploys per service: 0–1 (most teams feared prod)
- MTTR: 62 min median
- On‑call pages: ~140/month across SRE rotation
- Change failure rate: 23% (rolled back or hotfixed)
- Infra spend: healthy, but ops time was the real bleed (SREs doing YAML archaeology)
I’ve seen this movie. It ends with a platform team rewriting the world in six months and burning out. We did the opposite: reduce blast radius, standardize the boring stuff, and merge services where it actually pays off.
Constraints that actually mattered
- Regulated: SOC 2 Type II in flight, PCI scope on payments path.
- Zero downtime for money flows: Payments path had a hard 99.95% SLO.
- Team autonomy: We couldn’t shove a monolith down everyone’s throat.
- Cost neutral: Any infra spend had to be offset by reduced toil or other savings.
- Skill mix: Strong backend engineers, thin SRE bench. We needed tech that doesn’t need a PhD to run.
Translation: prefer managed services, Linkerd over Istio, ArgoCD over homegrown CD, and Terraform for everything that moves. And wherever we merged microservices, we did it by bounded contexts, not by org chart.
What we changed (and why it worked)
We ran three workstreams in parallel, with one rule: no platform feature ships without a reference app and docs.
Topology simplification
- From seven clusters to three regional EKS clusters (us‑east‑1, eu‑west‑1, ap‑southeast‑1). Namespaces per domain (e.g.,
payments,risk,catalog). - Traffic managed by AWS ALB Ingress Controller + ExternalDNS. No more NGINX per team.
- Linkerd instead of upgrading snowflake Istio. Fewer knobs; fewer 3 a.m. surprises.
- From seven clusters to three regional EKS clusters (us‑east‑1, eu‑west‑1, ap‑southeast‑1). Namespaces per domain (e.g.,
GitOps + policy by default
- ArgoCD for all workloads; infra in Terraform; OPA Gatekeeper to enforce guardrails (resource requests, no
:latest, properlivenessProbe). - External Secrets Operator wired to AWS Secrets Manager. Deleted dozens of bespoke init containers.
- ArgoCD for all workloads; infra in Terraform; OPA Gatekeeper to enforce guardrails (resource requests, no
Service consolidation
- Merged 22 chatty nanoservices in
paymentsinto 6 domain services (auth, ledger, payouts, reconciliation, notifications, reporting). Same repo or monorepo where it helped; kept clear APIs. - Killed deadweight jobs with event‑driven patterns. Moved self‑managed Kafka to Confluent Cloud to stop babysitting Zookeeper.
- Merged 22 chatty nanoservices in
Observability and reliability baked in
- Standardized Prometheus metrics with Grafana Cloud and Loki for logs. SLOs codified via Sloth. Alerts via Alertmanager → PagerDuty.
- Argo Rollouts for canary; default circuit breakers and retries at the mesh.
It sounds neat now. It wasn’t. We cut over domain by domain with canaries and backstops. The key was shipping a platform template every team could copy without thinking.
The boring but essential templates
Here’s what “paved road” looked like. Teams pasted these into deploy/ and moved on.
- ArgoCD Application (one per service):
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: payments-ledger
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/acme/payments
targetRevision: main
path: deploy/overlays/prod
helm:
values: |
image:
tag: {{ .SHA }}
rollouts:
strategy: canary
destination:
server: https://kubernetes.default.svc
namespace: payments
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true- Argo Rollouts canary with Linkerd traffic split:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: ledger
spec:
strategy:
canary:
steps:
- setWeight: 10
- pause: {duration: 120}
- setWeight: 25
- pause: {duration: 180}
- setWeight: 50
- pause: {duration: 300}
trafficRouting:
smi:
rootService: ledger- Linkerd service profile with timeouts/retries (no per‑service snowflakes):
apiVersion: linkerd.io/v1alpha2
kind: ServiceProfile
metadata:
name: ledger.payments.svc.cluster.local
namespace: payments
spec:
routes:
- name: POST /entries
condition: {method: POST, pathRegex: "/entries"}
isRetryable: true
timeout: 2s
responseClasses:
- condition: {status: {min: 500, max: 599}}
isFailure: true- Resource policy + HPA baked in:
apiVersion: apps/v1
kind: Deployment
metadata:
name: ledger
spec:
template:
spec:
containers:
- name: app
image: ghcr.io/acme/ledger:{{ .Values.image.tag }}
resources:
requests: {cpu: "250m", memory: "256Mi"}
limits: {cpu: "1", memory: "512Mi"}
readinessProbe:
httpGet: {path: /healthz, port: 8080}
initialDelaySeconds: 5
periodSeconds: 5
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: ledger
spec:
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70- SLO as code using Sloth:
apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
name: payments-ledger-availability
namespace: payments
spec:
service: ledger
slos:
- name: availability
objective: 99.95
sli:
events:
errorQuery: sum(rate(http_requests_total{job="ledger",status=~"5.."}[5m]))
totalQuery: sum(rate(http_requests_total{job="ledger"}[5m]))
alerting:
name: LedgerAvailability
labels: {severity: page}
annotations:
runbook: https://runbooks.acme.internal/ledger- GitHub Actions to stamp images with SHA and let ArgoCD pull:
name: build
on:
push:
branches: [main]
jobs:
docker:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: docker/setup-buildx-action@v3
- uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- uses: docker/build-push-action@v6
with:
context: .
push: true
tags: ghcr.io/acme/ledger:${{ github.sha }}- Terraform for EKS add‑ons and ArgoCD app‑of‑apps:
module "eks" {
source = "terraform-aws-modules/eks/aws"
cluster_name = "acme-us-east-1"
cluster_version = "1.29"
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnets
}
resource "helm_release" "argocd" {
name = "argocd"
repository = "https://argoproj.github.io/argo-helm"
chart = "argo-cd"
namespace = "argocd"
}Nothing exotic. Just the stuff that keeps weekends quiet.
Cutovers, trade‑offs, and the “don’t do this” list
We migrated by domain in this order: catalog (low risk), risk, then payments last. A few hard‑earned lessons:
- Consolidate services before consolidating clusters. We tried moving a chatty mesh of 10 nanoservices as‑is. Canary looked fine; steady‑state lit up p99. We merged them first; autoscaling stabilized.
- Choose one service mesh. We killed Istio early. Keeping two meshes is a pager tax.
- Managed Kafka was a cheat code. Confluent Cloud removed an entire class of “which broker died” incidents. The bill was lower than the SRE time it replaced.
- Policy saved us from ourselves. Gatekeeper caught a few
image: latestoops moments that would’ve caused thundering herds. - Docs matter. A 30‑minute “how to onboard to the platform template” recording did more than any all‑hands decree.
We didn’t “adopt microservices.” We adopted fewer microservices with better boundaries and boring platform defaults.
The results you can actually measure
Three months per region, nine months total. Here’s what moved the needle:
- On‑call pages: 140 → 39 per month (−72%)
- MTTR: 62 min → 19 min median (playbooks + rollbacks + standard dashboards)
- Change failure rate: 23% → 9% (canary + policy + pre‑prod parity)
- Lead time for changes: 2–3 days → <4 hours (DORA)
- Deploy frequency: 0–1/week → 5–10/week per service
- Helm charts: 63 bespoke → 25 standardized (retired 38)
- Clusters: 7 → 3 (regional)
- SRE “toil” time: ~30 hrs/wk → <8 hrs/wk (measured via ticket tags)
- Infra cost: roughly flat; Confluent spend offset by killing EC2 Kafka, EBS, and engineer time.
For the SOC 2 auditors: every prod change had a Git commit, PR approval, and ArgoCD audit trail. No more “who applied this at 2 a.m.?” mysteries.
What we’d do differently next time
- Earlier golden paths. We waited a sprint too long to publish the platform template, and teams invented their own. Undoing that took cycles.
- Staging traffic replay. We used synthetic load; we should’ve done shadow traffic with
Service Mesh Interfaceearlier. It caught one payments edge case late. - Dashboard curation. We shipped too many Grafana dashboards. We now enforce one golden dashboard per service with the four golden signals and SLO.
- Cost showback. Teams behaved better once we added namespace‑level cost allocation via AWS CUR + Kubecost. Should’ve shipped with day one.
If you’re about to do this, start here
- Pick a simple target topology (2–3 regional clusters, namespaces by domain). Draw it. Socialize it.
- Ship a copy‑paste platform template (deployment, HPA, SLO, canary, runbook link). Don’t let every team reinvent it.
- Move deploys to ArgoCD and enforce a few OPA policies. You’ll get safety and auditability for free.
- Use a lightweight mesh like Linkerd unless you truly need Istio features. Your on‑call will thank you.
- Offload undifferentiated heavy lifting: Confluent, Grafana Cloud, RDS/Aurora. Stop racking your own pet clusters.
- Measure DORA and SLOs weekly. Optimize for MTTR and change failure, not CPU graphs.
- Cut over by domain, not by team. Use canaries. Keep rollbacks as
git revert+ Argo sync.
And if you need a crew that’s done this without burning your team out, GitPlumbers has the scars and the playbooks.
Key takeaways
- Consolidate services around bounded contexts before you consolidate clusters.
- Adopt GitOps (ArgoCD) and policy-as-code to kill snowflake environments and audit gaps.
- Prefer boring tech: Linkerd over complex meshes, managed Kafka over self-hosted.
- Standardize deploy templates (HPA, SLOs, probes) so you can scale ops with headcount ≈ 0.
- Measure with DORA + SLOs; optimize for MTTR and change failure rate, not vanity metrics.
Implementation checklist
- Inventory services and traffic; identify chatty nanoservices to merge.
- Design 2–3 regional clusters with clear namespace/domain boundaries.
- Move deploys to GitOps (ArgoCD), infra to Terraform, enforce with OPA Gatekeeper.
- Adopt a lightweight mesh (Linkerd) and standard retries/timeouts at the platform level.
- Shift self-managed Kafka/Elastic to managed equivalents where possible.
- Instrument golden signals and codify SLOs; wire Prometheus/Alertmanager to PagerDuty.
- Ship a platform template: resource requests, HPA, probes, canary, dashboards, runbooks.
- Cut over by domain with canaries and rollbacks; measure DORA metrics every week.
Questions we hear from teams
- Why Linkerd instead of Istio?
- Operational overhead. Linkerd’s defaults cover retries, timeouts, and mTLS without constant YAML spelunking. The team didn’t need multi‑cluster gateways or custom EnvoyFilters; they needed reliability with fewer levers.
- How did you manage secrets across environments?
- External Secrets Operator read from AWS Secrets Manager with environment‑scoped prefixes. ArgoCD managed ESO manifests; engineers never touched raw Kubernetes Secrets. Rotations were handled in ASM with zero redeploys when apps watched for updates.
- What about stateful services like Kafka and Elasticsearch?
- We moved Kafka to Confluent Cloud and logs to Grafana Cloud Loki. We kept Aurora for relational data, managed by RDS. The rule was: if a vendor can run it better and cheaper than our SRE time, buy it.
- Did you consider going back to a monolith?
- We merged nanoservices into domain services where latency and ownership made sense. Full re‑monolith would’ve blown the schedule and team autonomy. Bounded contexts gave 80% of the benefits with 20% of the churn.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
