From Jenkins Snowflakes to GitOps: The Platform Migration That Cut Lead Time by 92%
How we took a mid-market fintech from brittle pipelines and ticket queues to a sane, automated platform that devs actually like using.
“If you can cut our lead time under a day without blowing up audits, you’ll be heroes.” — VP Engineering, Fintech ClientBack to all posts
The legacy platform that slowed a fintech to a crawl
I walked into a 220-engineer fintech with six Jenkins masters, each a bespoke snowflake. Deployments required a ServiceNow ticket, a Slack nudge, and a prayer. Four environments (dev, qa, stage, prod) drifted like tectonic plates. A single flaky Selenium job could stall a release for a day. Security loved the gates; devs hated the wait.
- Stack reality: Java and Node services, Python batch, PostgreSQL and Redis, Kafka for streaming, AWS everything
- Compliance: SOC 2 Type II and customer audits; every change had to be traceable
- Constraints: tight budget, no 6-month freeze, and zero appetite for another consultant PowerPoint migration
"If you can cut our lead time under a day without blowing up audits, you’ll be heroes." — VP Eng
Why this hurt (and what we measured)
We started with DORA baselines and incident data from PagerDuty and Jira. No hand-wavy promises.
- Lead time for change: ~5 days median (merge to prod)
- Deployment frequency: ~2 per week per service (with spikes + rollbacks)
- Change failure rate: 21% (rollback/retry within 24h)
- MTTR: ~4 hours
- Environment drift: 1–2 critical diffs per week caught late (Helm values, IAM perms, image tags)
- Dev experience (internal NPS): 32
Auditors flagged manual steps with inconsistent evidence. Cloud costs weren’t insane, but nodes idled due to static ASGs and spiky workloads.
What we changed (in plain English)
We didn’t boil the ocean. We paved a road and moved the traffic.
- CI:
GitHub Actionsreplaced most Jenkins jobs. We kept a tiny Jenkins footprint for edge cases with a sunset plan. - CD via GitOps:
ArgoCDmanaged desired state from environment repos. No more clicking deploy. - IDP:
Backstagegave devs templates (Golden Paths) for services, infra, and runbooks. - Infra:
EKSwithKarpenterfor autoscaling;Terraformfor foundations;Crossplanefor AWS resources via CRDs when teams needed autonomy. - Traffic + safety:
Istiofor mesh and traffic splitting;Argo Rolloutsfor canaries;LaunchDarklyfor feature flags. - Security:
OPA Gatekeeperpolicies,Cosignimage signing,SemgrepandTrivyin CI,Checkov/Conftestfor IaC policy. - Observability:
Prometheus+Grafana, SLOs with burn-rate alerts to PagerDuty.
We shipped an app-of-apps GitOps model in 6 weeks, migrated five high-traffic services in 90 days, then scaled out.
CI/CD in the trenches (configs we actually used)
We standardised one GitHub Actions workflow that built, scanned, signed, and updated the env repo via PR. It killed 80% of the Jenkins fragility in a week.
# .github/workflows/build-and-release.yml
name: build-and-release
on:
push:
branches: [ main ]
pull_request:
branches: [ main ]
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with: { node-version: '20' }
- name: Install deps
run: npm ci
- name: Unit tests
run: npm test -- --ci
- name: Static analysis (Semgrep)
uses: returntocorp/semgrep-action@v1
- name: Build image
run: docker build -t ${{ secrets.ECR_REPO }}:${{ github.sha }} .
- name: Trivy image scan
uses: aquasecurity/trivy-action@0.20.0
with:
image-ref: ${{ secrets.ECR_REPO }}:${{ github.sha }}
format: 'table'
exit-code: '1'
ignore-unfixed: true
- name: Push image
run: |
aws ecr get-login-password | docker login --username AWS --password-stdin ${{ secrets.ECR_REPO_HOST }}
docker push ${{ secrets.ECR_REPO }}:${{ github.sha }}
- name: Cosign sign
run: cosign sign --key ${{ secrets.COSIGN_KEY }} ${{ secrets.ECR_REPO }}:${{ github.sha }}
- name: Update env repo (ArgoCD watches this)
uses: peter-evans/create-pull-request@v6
with:
token: ${{ secrets.ENV_REPO_TOKEN }}
branch: bump/${{ github.sha }}
title: Bump image to ${{ github.sha }}
commit-message: Bump image to ${{ github.sha }}
add-paths: chart/values/prod.yamlArgoCD watched the env repo and reconciled changes. We used an app-of-apps pattern to keep things sane.
# environments/prod/apps/app-of-apps.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: prod-apps
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/fintech/env-prod.git
path: apps
targetRevision: main
directory:
recurse: true
destination:
server: https://kubernetes.default.svc
namespace: argocd
syncPolicy:
automated: { prune: true, selfHeal: true }
syncOptions: [ CreateNamespace=true ]Canary deployments were declarative with Argo Rollouts.
# k8s/rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: payments-api
spec:
replicas: 6
strategy:
canary:
trafficRouting:
istio: { virtualService: payments-vs, stableSubsetName: stable, canarySubsetName: canary }
steps:
- setWeight: 10
- pause: { duration: 5m }
- setWeight: 30
- pause: { duration: 10m }
- setWeight: 60
- pause: { duration: 10m }
- setWeight: 100Paved roads with Backstage (Golden Paths)
We killed the copy-paste hell by shipping Backstage templates that encoded our best practices. New services weren’t a blank page; they were a paved road.
# templates/service/template.yaml
apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
name: node-service-golden-path
title: Node Service (Golden Path)
description: Node + Helm + Argo Rollouts + SLO skeleton
spec:
owner: platform-team
parameters:
- title: Service details
required: [ name, owner ]
properties:
name: { type: string }
owner: { type: string }
steps:
- id: fetch-base
name: Fetch base
action: fetch:template
input:
url: ./skeleton
- id: publish
name: Publish to GitHub
action: publish:github
- id: register
name: Register in Backstage
action: catalog:registerTeams got a catalog-info.yaml, SLOs, Dockerfile, Helm chart, RBAC, and a GitHub Actions workflow out of the box. Onboarding dropped from weeks to days.
Guardrails that pass audits (without killing velocity)
Speed means nothing if your next SOC 2 audit torpedoes the roadmap. We embedded compliance as code.
- Policy:
OPA Gatekeeperblocked non-signed images, privileged pods, and nakedLoadBalancers. - IaC:
CheckovandConftestran on Terraform. Drift was tracked withterraform planin CI. - Provenance: Image signing with
Cosign, build attestation via reusable workflows. Not full SLSA, but auditors loved the paper trail. - Secrets:
External Secrets Operatorfor AWS Secrets Manager. No app secrets in repos. Ever. - AI-generated code sanity: We saw “vibe coding” PRs with sketchy libs.
Semgreprules flagged risky patterns; dependency diffing auto-commented CVEs.
Example OPA policy snippet that saved us from a 2 a.m. page:
package k8s.allowedImages
deny[msg] {
input.review.object.kind == "Pod"
some c
img := input.review.object.spec.containers[c].image
not startswith(img, "${ECR_REPO}/")
msg := sprintf("image %s is not from allowed registry", [img])
}And a tiny Terraform module we standardized for Karpenter:
module "karpenter" {
source = "terraform-aws-modules/eks/aws//modules/karpenter"
cluster_name = module.eks.cluster_name
irsa_iam_role_name = "karpenter-controller"
tags = { app = "platform", env = var.env }
}
``;
## Results that stuck (90 and 180 days)
We didn’t count story points. We measured outcomes.
- Lead time: **~5 days → ~2 hours** median (92% reduction)
- Deployment frequency: **~2/week → 12/day** for the top 20 services (8x)
- Change failure rate: **21% → 6%**
- MTTR: **~4h → 28m** (Prometheus SLO burn-rate + quick rollback via Rollouts)
- Drift: near-zero; ArgoCD kept prod honest, and preview envs caught config issues
- Costs: **-18%** for compute on comparable load (Karpenter + right-sizing)
- Dev experience: internal NPS **32 → 67**
- Audit: next SOC 2 cycle had **0 platform findings**; evidence came from pipelines and Git history
We cut over 5 flagship services in 90 days with zero downtime. Full portfolio migration took 6 months with a long tail of low-risk services.
## What we’d do differently (and what you can reuse)
Things we’d change:
- Start `Backstage` one sprint earlier. The paved road accelerates migration more than you think.
- Push `Crossplane` later. Terraform covered 80%; Crossplane shines when teams truly need self-serve infra CRDs.
- Budget more time for `Istio` multi-cluster if you have strict availability zones or regional failover requirements.
What you can copy-paste tomorrow:
1. Stand up ArgoCD with an app-of-apps repo. Keep it boring.
2. Ship a single `GitHub Actions` workflow for build/scan/sign/push/PR-to-env.
3. Write 5 OPA policies: signed images, resource limits, no privileged, no hostPath, allowed registries.
4. Add `Argo Rollouts` and make canary the default via a Helm subchart.
5. Publish a Backstage Golden Path that generates service + SLO + pipeline.
6. Track DORA + MTTR weekly and report to the exec team.
If you’re fighting legacy CI, environment drift, or AI-generated “vibe code” sneaking into prod, you do not need a ground-up rewrite. You need a platform that makes the right thing the easy thing. That’s exactly what we built here.Key takeaways
- Automate from the repo out: GitHub Actions + ArgoCD + Backstage templates beat bespoke Jenkins every time.
- Guardrails, not gates: OPA policies, canary rollouts, and SLOs enable speed and safety.
- Golden Paths matter: Templates and paved roads reduced PR review time by 40% and onboarding from weeks to days.
- Measure outcomes: DORA metrics and incident MTTR made value visible to execs and auditors.
- Cutover in phases: Start with an app-of-apps GitOps model and migrate one service slice at a time.
Implementation checklist
- Define DORA baselines (lead time, deploy frequency, MTTR, change failure rate).
- Stand up GitOps core (ArgoCD, env repos, app-of-apps).
- Codify guardrails (OPA Gatekeeper, image signing, RBAC, SLSA-ish provenance).
- Create Golden Paths in Backstage with service templates.
- Introduce canaries with `Argo Rollouts` and feature flags for safety.
- Automate preview environments for every PR.
- Migrate the top 5 services first, then scale the pattern.
- Publish SLOs with Prometheus and burn-rate alerts tied to PagerDuty.
Questions we hear from teams
- How did you migrate without a freeze?
- We ran old and new pipelines in parallel. For each service, we cut a feature flag, shipped the new pipeline and GitOps manifests, and used Argo Rollouts to route 10% traffic, then 30%, then 100%. Rollback was one `kubectl argo rollouts rollback` away. No big-bang weekends.
- Why GitHub Actions over Jenkins?
- We needed consistency, marketplace actions for security scanners, and fewer pets to patch. Actions gave us reusable workflows, per-repo configs, and simple org-wide policies. We kept a minimal Jenkins for edge workloads and retired it within 6 months.
- How did you handle database migrations?
- We templated `Flyway`/`Liquibase` steps in the pipeline and gated canary rollout until migrations were applied. For risky migrations, we used LaunchDarkly to decouple schema rollout from feature exposure.
- What about AI-generated code and supply chain risk?
- We treated it like any other code risk. Semgrep rulesets flagged insecure patterns, Trivy scanned images, and Cosign signed artifacts. We also diffed SBOMs in PRs and blocked unknown sources. The goal is guardrails for speed, not locks that force shadow IT.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
