From Snowflake Jenkins to GitOps: The Platform Migration That Cut Lead Time by 71%
How a mid-market fintech ditched brittle pipelines, shipped 6x faster, and stopped paging platform engineers at 2am.
“The best part is it’s boring. People deploy without asking permission, and prod is calm.” — Director of Platform, Fintech ClientBack to all posts
The situation: a fintech drowning in bespoke pipelines
A 200-person fintech (think AML reporting, not a unicorn) asked us to fix developer productivity without blowing up SOC 2 and their PCI-ish boundaries. They had:
- A pet
JenkinsVM with 300 freestyle jobs and a “do not reboot” wiki page - Three Kubernetes clusters (two EKS, one on-prem) that drifted monthly
- Terraform sprinkled across repos with vendor modules forked three times
- Static AWS keys in GitHub org secrets and on two build agents (I know…)
- Lead time for changes averaging 4.1 days; deploys once per service per week; on-call burned out
I’ve seen this fail when teams buy a platform product and pray. We did the opposite: keep the platform thin, make the paved road unavoidable, and measure everything.
Constraints we had to respect (and why they matter)
These weren’t optional:
- Compliance: SOC 2 Type II, data residency for EU tenants, segmenting PII services
- Uptime: 99.9% SLO on core APIs; change freezes at quarter-end
- Cost: No new headcount; infra spend already under procurement heat
- Tooling reality: Developers were on GitHub Enterprise Cloud; AWS for prod; some stubborn services on-prem
Translation: no greenfield rewrites; we had to migrate in-place, minimize blast radius, and preserve auditability. That steered us to GitOps with ArgoCD, Terraform modules consolidated under a single registry, GitHub Actions with OIDC, and a Backstage catalog to hide the sharp edges.
What we changed: thin platform, thick paved road
We killed bespoke as policy. The platform stayed boring: EKS, managed Postgres (RDS), S3, SQS. The developer experience got opinionated.
- GitOps with ArgoCD: Every cluster drift converged by reconcilers; no kubectl in prod. App state is code under
environments/repos. - Backstage for service catalog + templates: One golden path per runtime (Node.js, Go) with
Dockerfile, health checks,Helmchart,OpenTelemetry, and aGitHub Actionsworkflow baked in. - Terraform modules + Terragrunt: Centralized AWS/EKS modules versioned and consumable; stop copying vendor examples.
- OIDC for CI:
GitHub Actionsfederated to AWS; no long-lived secrets. - Preview environments: Namespace-per-PR driven by the same Helm chart; automatic teardown on merge.
- SLOs and budgets: Default SLOs per service with Prometheus and RED/USE dashboards in Grafana.
Here’s the ArgoCD Application we standardized (one per service per env):
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: payments-api-prod
namespace: argocd
spec:
project: core-services
source:
repoURL: https://github.com/fintech/environments
targetRevision: main
path: charts/payments-api/overlays/prod
helm:
valuesFiles:
- values-prod.yaml
destination:
server: https://kubernetes.default.svc
namespace: payments-api
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
- ServerSideApply=trueBoring on purpose. When ArgoCD owns prod, humans stop “just hotfixing in the cluster,” and your audits get easier overnight.
The migration plan that didn’t melt prod
Big bangs are for blog posts that leave out the pager history. We ran a migration factory over 12 weeks:
- Week 1–2: Baseline and pilot
- Captured DORA metrics using
Four Keys+ GitHub data; verified with Datadog deploy markers - Piloted GitOps on a non-critical service, read-only Argo first, then automated sync
- Captured DORA metrics using
- Week 3–6: Golden path + OIDC
- Landed Backstage catalog; shipped v1 templates for Node.js/Go
- Replaced Jenkins jobs with GitHub Actions; enabled OIDC to AWS
- Week 7–10: App migrations
- 18 services onto the paved road; dual-running old pipeline for 48h per service
- Introduced preview environments for two customer-facing apps
- Week 11–12: Retire and harden
- Turned off Jenkins; rotated all static keys; enabled ArgoCD Image Updater for patch bumps
- Added SLO dashboards; codified incident runbooks per service
The GitHub Actions workflow became the default. Note the OIDC block and deployment job handing off to GitOps via PR to the env repo:
name: ci
on:
push:
branches: [main]
pull_request:
permissions:
id-token: write
contents: read
jobs:
build-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with: { node-version: 20 }
- run: npm ci && npm test -- --ci
- run: npm run build
- uses: docker/setup-buildx-action@v3
- uses: docker/login-action@v3
with: { registry: ghcr.io, username: ${{ github.actor }}, password: ${{ secrets.GITHUB_TOKEN }} }
- run: |
docker buildx build \
--push \
-t ghcr.io/fintech/payments-api:${{ github.sha }} .
deploy:
needs: build-test
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with: { repository: fintech/environments, token: ${{ secrets.ENV_REPO_PAT }} }
- name: Bump image tag
run: |
yq -i '.image.tag = "'${{ github.sha }}'"' charts/payments-api/overlays/prod/values-prod.yaml
- name: Create PR to trigger ArgoCD
uses: peter-evans/create-pull-request@v6
with:
branch: bump/payments-${{ github.sha }}
title: "Deploy payments-api ${{ github.sha }} to prod"We also carved out a Terraform module boundary that developers could consume without learning AWS arcana:
module "service" {
source = "git::https://github.com/fintech/infra-modules.git//service?ref=v1.6.0"
name = "payments-api"
cpu = 250
memory = 512
replicas = 3
ingress_hostnames = ["api.prod.example.com"]
enable_sqs_dlq = true
enable_rds_iam_auth= true
}No Helm hell, no 400-line YAML PRs. The module renders sane defaults, the chart takes values, Argo reconciles.
The boring automation that paid for itself
A few unsexy details made the difference:
- Secret management with SOPS: Encrypted values in Git with
agekeys; ArgoCD decrypts server-side. No more “who rotated that?” mysteries. - Preview environments per PR: Same chart, new namespace; ephemeral RDS via templates for one team; mock services for others.
- OpenTelemetry out of the box: Every template emitted traces, logs, metrics. PMs stopped arguing about “it’s slow” vs “it’s fine.”
- Default SLOs: 99% latency SLO on p95; burn alerts to Slack with
Alertmanagerrouting by squad.
Backstage template snippet (developers choose runtime, we hard-code the good choices):
apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
name: golden-node-service
spec:
owner: platform
type: service
parameters:
- title: Service Info
properties:
name: { type: string }
squad: { type: string }
steps:
- id: fetch
action: fetch:template
input:
url: ./skeleton
- id: publish
action: publish:github
input:
repoUrl: github.com?owner=fintech&repo={{ parameters.name }}
- id: register
action: catalog:register
input:
repoContentsUrl: ${{ steps.publish.output.repoContentsUrl }}Developers click “Create,” get a repo with a Dockerfile, Helm chart, CI workflow, SLO dashboard, and a catalog-info.yaml. That’s the paved road.
Results by the numbers (90 days, no heroics)
We didn’t add headcount. We stopped doing platform magic tricks and let the tools work.
- Lead time for changes: 4.1 days → 1.2 days (−71%)
- Deployment frequency: 1.1/week/service → 6.8/week/service (6x)
- MTTR: 98 minutes → 41 minutes (−58%), thanks to fast rollbacks via Git revert + Argo sync
- Change failure rate: 18% → 9% (−50%), feature flags + preview envs caught the silly stuff
- Onboarding: new hire to first prod PR: 14 days → 3 days
- Build agent outages: 2/month → 0 (Jenkins retired)
- Infra cost: flat to −8% despite more deploys, because we right-sized with cluster autoscaler and killed zombie workloads
“The best part is it’s boring. People deploy without asking permission, and prod is calm.” — Director of Platform, Fintech Client
What we’d do differently next time
I’ve seen teams over-rotate into platformization and ship a labyrinth. We intentionally kept it simple, but a few things we’d tweak:
- Start
OpenFeatureearlier to decouple feature-flag vendor choice from code. - Set
ArgoCDto manual sync in prod for the first two weeks per service; one hotfix sprint got spicy. - Bake perf test scaffolding into the template from day one; we added k6 later after we hit p95 regressions in preview envs.
- Use a shared
environmentsrepo per domain instead of a monorepo for everything—merge conflicts got annoying around week 8.
Copy/paste: configs you can steal
Don’t ask for permission—steal what works. Three drop-ins that moved the needle:
- Minimal SOPS setup for values files:
# create age key
age-keygen -o key.txt
export SOPS_AGE_KEY_FILE=key.txt
sops -e values.yaml > values.enc.yaml- ArgoCD ApplicationSet for preview environments:
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: payments-previews
spec:
generators:
- pullRequest:
github:
owner: fintech
repo: payments-api
tokenRef:
secretName: github-token
key: token
requeueAfterSeconds: 60
template:
metadata:
name: payments-pr-{{number}}
spec:
source:
repoURL: https://github.com/fintech/payments-api
targetRevision: pull/{{number}}/head
path: chart
helm:
values: |
image:
tag: pr-{{number}}
env: pr-{{number}}
destination:
namespace: payments-pr-{{number}}
server: https://kubernetes.default.svc
syncPolicy:
automated: { prune: true, selfHeal: true }- Default SLO alert (PrometheusRule):
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: payments-slo
spec:
groups:
- name: latency-slo
rules:
- alert: PaymentsP95TooHigh
expr: histogram_quantile(0.95, sum(rate(http_server_duration_seconds_bucket{service="payments-api",le!="+Inf"}[5m])) by (le)) > 0.300
for: 10m
labels:
severity: page
annotations:
summary: "Payments API p95 latency over 300ms"
runbook_url: https://runbooks.internal/payments/p95Key takeaways
- Thin platform, thick paved road: keep the platform boring and the developer experience opinionated.
- GitOps with ArgoCD eliminated snowflake drift and weekend change windows.
- Backstage + golden templates prevented template sprawl and reduced onboarding time.
- OIDC to AWS removed long-lived keys and most flaky CI secrets issues.
- Preview environments paid off immediately for PR cycle time and PM sign-off.
- Measure with DORA and SLOs or you’re just rearranging deck chairs.
- Start with a migration factory: 80/20 automation plus office hours beats “big bang.”
Implementation checklist
- Baseline DORA metrics before touching anything.
- Stand up a read-only GitOps pilot on a non-critical service first.
- Codify one golden path (app template + infra modules) before you boil the ocean.
- Flip CI to OIDC; kill static cloud keys in repos and org secrets.
- Enable preview environments for at least one customer-facing service.
- Instrument with OpenTelemetry; add a default dashboard per service.
- Run a weekly migration clinic—engineers bring repos, leave with PRs.
Questions we hear from teams
- Why ArgoCD over Flux?
- Both are solid. This team wanted UI-first visibility and RBAC granularity for auditors. Argo’s Application notion mapped cleanly to their domains, and their SREs already knew it.
- Why not keep Jenkins and just fix it?
- We tried. The pet server, plugins, and secrets model were the problem. Moving to GitHub Actions with OIDC removed whole classes of failure and simplified compliance.
- How do preview environments avoid database collisions?
- Two modes: for stateless services we use seeded ephemeral DBs (RDS snapshot clones) in a shared test account; for stateful constraints we swap live DB calls with contract-tested mocks in preview envs.
- What did you measure to prove productivity gains?
- DORA metrics (lead time, deployment frequency, change failure rate, MTTR), onboarding time-to-first-PR, PR cycle time, and incident counts. We also tracked SLO burn and change freeze violations.
- Can we do this without Backstage?
- Yes, but you’ll pay the tax in templates and documentation drift. Backstage made the paved road discoverable and consistent. Start small with one template and a catalog.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
