How did you migrate without a freeze?

We ran old and new pipelines in parallel. For each service, we cut a feature flag, shipped the new pipeline and GitOps manifests, and used Argo Rollouts to route 10% traffic, then 30%, then 100%. Rollback was one `kubectl argo rollouts rollback` away. No big-bang weekends.

Why GitHub Actions over Jenkins?

We needed consistency, marketplace actions for security scanners, and fewer pets to patch. Actions gave us reusable workflows, per-repo configs, and simple org-wide policies. We kept a minimal Jenkins for edge workloads and retired it within 6 months.

How did you handle database migrations?

We templated `Flyway`/`Liquibase` steps in the pipeline and gated canary rollout until migrations were applied. For risky migrations, we used LaunchDarkly to decouple schema rollout from feature exposure.

What about AI-generated code and supply chain risk?

We treated it like any other code risk. Semgrep rulesets flagged insecure patterns, Trivy scanned images, and Cosign signed artifacts. We also diffed SBOMs in PRs and blocked unknown sources. The goal is guardrails for speed, not locks that force shadow IT.

Case-studies · Dec 11, 2025 · 10 minute read

From Jenkins Snowflakes to GitOps: The Platform Migration That Cut Lead Time by 92%

How we took a mid-market fintech from brittle pipelines and ticket queues to a sane, automated platform that devs actually like using.

Alex Kim

Principal Architect, GitPlumbers

20 years in the trenches from on-prem Java monoliths to cloud-native platforms. Ex-Netflix platform team, ex-HashiCorp solutions. I’ve led more than a dozen platform cutovers and cleaned up my share of AI-assisted vibe code along the way.

“If you can cut our lead time under a day without blowing up audits, you’ll be heroes.” — VP Engineering, Fintech Client

Back to all posts

The legacy platform that slowed a fintech to a crawl

I walked into a 220-engineer fintech with six Jenkins masters, each a bespoke snowflake. Deployments required a ServiceNow ticket, a Slack nudge, and a prayer. Four environments (dev, qa, stage, prod) drifted like tectonic plates. A single flaky Selenium job could stall a release for a day. Security loved the gates; devs hated the wait.

Stack reality: Java and Node services, Python batch, PostgreSQL and Redis, Kafka for streaming, AWS everything
Compliance: SOC 2 Type II and customer audits; every change had to be traceable
Constraints: tight budget, no 6-month freeze, and zero appetite for another consultant PowerPoint migration

"If you can cut our lead time under a day without blowing up audits, you’ll be heroes." — VP Eng

Why this hurt (and what we measured)

We started with DORA baselines and incident data from PagerDuty and Jira. No hand-wavy promises.

Lead time for change: ~5 days median (merge to prod)
Deployment frequency: ~2 per week per service (with spikes + rollbacks)
Change failure rate: 21% (rollback/retry within 24h)
MTTR: ~4 hours
Environment drift: 1–2 critical diffs per week caught late (Helm values, IAM perms, image tags)
Dev experience (internal NPS): 32

Auditors flagged manual steps with inconsistent evidence. Cloud costs weren’t insane, but nodes idled due to static ASGs and spiky workloads.

What we changed (in plain English)

We didn’t boil the ocean. We paved a road and moved the traffic.

CI: GitHub Actions replaced most Jenkins jobs. We kept a tiny Jenkins footprint for edge cases with a sunset plan.
CD via GitOps: ArgoCD managed desired state from environment repos. No more clicking deploy.
IDP: Backstage gave devs templates (Golden Paths) for services, infra, and runbooks.
Infra: EKS with Karpenter for autoscaling; Terraform for foundations; Crossplane for AWS resources via CRDs when teams needed autonomy.
Traffic + safety: Istio for mesh and traffic splitting; Argo Rollouts for canaries; LaunchDarkly for feature flags.
Security: OPA Gatekeeper policies, Cosign image signing, Semgrep and Trivy in CI, Checkov/Conftest for IaC policy.
Observability: Prometheus + Grafana, SLOs with burn-rate alerts to PagerDuty.

We shipped an app-of-apps GitOps model in 6 weeks, migrated five high-traffic services in 90 days, then scaled out.

CI/CD in the trenches (configs we actually used)

We standardised one GitHub Actions workflow that built, scanned, signed, and updated the env repo via PR. It killed 80% of the Jenkins fragility in a week.

# .github/workflows/build-and-release.yml
name: build-and-release
on:
  push:
    branches: [ main ]
  pull_request:
    branches: [ main ]

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: '20' }
      - name: Install deps
        run: npm ci
      - name: Unit tests
        run: npm test -- --ci
      - name: Static analysis (Semgrep)
        uses: returntocorp/semgrep-action@v1
      - name: Build image
        run: docker build -t ${{ secrets.ECR_REPO }}:${{ github.sha }} .
      - name: Trivy image scan
        uses: aquasecurity/trivy-action@0.20.0
        with:
          image-ref: ${{ secrets.ECR_REPO }}:${{ github.sha }}
          format: 'table'
          exit-code: '1'
          ignore-unfixed: true
      - name: Push image
        run: |
          aws ecr get-login-password | docker login --username AWS --password-stdin ${{ secrets.ECR_REPO_HOST }}
          docker push ${{ secrets.ECR_REPO }}:${{ github.sha }}
      - name: Cosign sign
        run: cosign sign --key ${{ secrets.COSIGN_KEY }} ${{ secrets.ECR_REPO }}:${{ github.sha }}
      - name: Update env repo (ArgoCD watches this)
        uses: peter-evans/create-pull-request@v6
        with:
          token: ${{ secrets.ENV_REPO_TOKEN }}
          branch: bump/${{ github.sha }}
          title: Bump image to ${{ github.sha }}
          commit-message: Bump image to ${{ github.sha }}
          add-paths: chart/values/prod.yaml

ArgoCD watched the env repo and reconciled changes. We used an app-of-apps pattern to keep things sane.

# environments/prod/apps/app-of-apps.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: prod-apps
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/fintech/env-prod.git
    path: apps
    targetRevision: main
    directory:
      recurse: true
  destination:
    server: https://kubernetes.default.svc
    namespace: argocd
  syncPolicy:
    automated: { prune: true, selfHeal: true }
    syncOptions: [ CreateNamespace=true ]

Canary deployments were declarative with Argo Rollouts.

# k8s/rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: payments-api
spec:
  replicas: 6
  strategy:
    canary:
      trafficRouting:
        istio: { virtualService: payments-vs, stableSubsetName: stable, canarySubsetName: canary }
      steps:
        - setWeight: 10
        - pause: { duration: 5m }
        - setWeight: 30
        - pause: { duration: 10m }
        - setWeight: 60
        - pause: { duration: 10m }
        - setWeight: 100

Paved roads with Backstage (Golden Paths)

We killed the copy-paste hell by shipping Backstage templates that encoded our best practices. New services weren’t a blank page; they were a paved road.

# templates/service/template.yaml
apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
  name: node-service-golden-path
  title: Node Service (Golden Path)
  description: Node + Helm + Argo Rollouts + SLO skeleton
spec:
  owner: platform-team
  parameters:
    - title: Service details
      required: [ name, owner ]
      properties:
        name: { type: string }
        owner: { type: string }
  steps:
    - id: fetch-base
      name: Fetch base
      action: fetch:template
      input:
        url: ./skeleton
    - id: publish
      name: Publish to GitHub
      action: publish:github
    - id: register
      name: Register in Backstage
      action: catalog:register

Teams got a catalog-info.yaml, SLOs, Dockerfile, Helm chart, RBAC, and a GitHub Actions workflow out of the box. Onboarding dropped from weeks to days.

Guardrails that pass audits (without killing velocity)

Speed means nothing if your next SOC 2 audit torpedoes the roadmap. We embedded compliance as code.

Policy: OPA Gatekeeper blocked non-signed images, privileged pods, and naked LoadBalancers.
IaC: Checkov and Conftest ran on Terraform. Drift was tracked with terraform plan in CI.
Provenance: Image signing with Cosign, build attestation via reusable workflows. Not full SLSA, but auditors loved the paper trail.
Secrets: External Secrets Operator for AWS Secrets Manager. No app secrets in repos. Ever.
AI-generated code sanity: We saw “vibe coding” PRs with sketchy libs. Semgrep rules flagged risky patterns; dependency diffing auto-commented CVEs.

Example OPA policy snippet that saved us from a 2 a.m. page:

package k8s.allowedImages

deny[msg] {
  input.review.object.kind == "Pod"
  some c
  img := input.review.object.spec.containers[c].image
  not startswith(img, "${ECR_REPO}/")
  msg := sprintf("image %s is not from allowed registry", [img])
}

And a tiny Terraform module we standardized for Karpenter:

module "karpenter" {
  source  = "terraform-aws-modules/eks/aws//modules/karpenter"
  cluster_name = module.eks.cluster_name
  irsa_iam_role_name = "karpenter-controller"
  tags = { app = "platform", env = var.env }
}
``;

## Results that stuck (90 and 180 days)

We didn’t count story points. We measured outcomes.

- Lead time: **~5 days → ~2 hours** median (92% reduction)
- Deployment frequency: **~2/week → 12/day** for the top 20 services (8x)
- Change failure rate: **21% → 6%**
- MTTR: **~4h → 28m** (Prometheus SLO burn-rate + quick rollback via Rollouts)
- Drift: near-zero; ArgoCD kept prod honest, and preview envs caught config issues
- Costs: **-18%** for compute on comparable load (Karpenter + right-sizing)
- Dev experience: internal NPS **32 → 67**
- Audit: next SOC 2 cycle had **0 platform findings**; evidence came from pipelines and Git history

We cut over 5 flagship services in 90 days with zero downtime. Full portfolio migration took 6 months with a long tail of low-risk services.

## What we’d do differently (and what you can reuse)

Things we’d change:

- Start `Backstage` one sprint earlier. The paved road accelerates migration more than you think.
- Push `Crossplane` later. Terraform covered 80%; Crossplane shines when teams truly need self-serve infra CRDs.
- Budget more time for `Istio` multi-cluster if you have strict availability zones or regional failover requirements.

What you can copy-paste tomorrow:

1. Stand up ArgoCD with an app-of-apps repo. Keep it boring.
2. Ship a single `GitHub Actions` workflow for build/scan/sign/push/PR-to-env.
3. Write 5 OPA policies: signed images, resource limits, no privileged, no hostPath, allowed registries.
4. Add `Argo Rollouts` and make canary the default via a Helm subchart.
5. Publish a Backstage Golden Path that generates service + SLO + pipeline.
6. Track DORA + MTTR weekly and report to the exec team.

If you’re fighting legacy CI, environment drift, or AI-generated “vibe code” sneaking into prod, you do not need a ground-up rewrite. You need a platform that makes the right thing the easy thing. That’s exactly what we built here.

Related Resources

Key takeaways

Automate from the repo out: GitHub Actions + ArgoCD + Backstage templates beat bespoke Jenkins every time.
Guardrails, not gates: OPA policies, canary rollouts, and SLOs enable speed and safety.
Golden Paths matter: Templates and paved roads reduced PR review time by 40% and onboarding from weeks to days.
Measure outcomes: DORA metrics and incident MTTR made value visible to execs and auditors.
Cutover in phases: Start with an app-of-apps GitOps model and migrate one service slice at a time.

Implementation checklist

Define DORA baselines (lead time, deploy frequency, MTTR, change failure rate).
Stand up GitOps core (ArgoCD, env repos, app-of-apps).
Codify guardrails (OPA Gatekeeper, image signing, RBAC, SLSA-ish provenance).
Create Golden Paths in Backstage with service templates.
Introduce canaries with `Argo Rollouts` and feature flags for safety.
Automate preview environments for every PR.
Migrate the top 5 services first, then scale the pattern.
Publish SLOs with Prometheus and burn-rate alerts tied to PagerDuty.

Questions we hear from teams

How did you migrate without a freeze?: We ran old and new pipelines in parallel. For each service, we cut a feature flag, shipped the new pipeline and GitOps manifests, and used Argo Rollouts to route 10% traffic, then 30%, then 100%. Rollback was one `kubectl argo rollouts rollback` away. No big-bang weekends.
Why GitHub Actions over Jenkins?: We needed consistency, marketplace actions for security scanners, and fewer pets to patch. Actions gave us reusable workflows, per-repo configs, and simple org-wide policies. We kept a minimal Jenkins for edge workloads and retired it within 6 months.
How did you handle database migrations?: We templated `Flyway`/`Liquibase` steps in the pipeline and gated canary rollout until migrations were applied. For risky migrations, we used LaunchDarkly to decouple schema rollout from feature exposure.
What about AI-generated code and supply chain risk?: We treated it like any other code risk. Semgrep rulesets flagged insecure patterns, Trivy scanned images, and Cosign signed artifacts. We also diffed SBOMs in PRs and blocked unknown sources. The goal is guardrails for speed, not locks that force shadow IT.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to an engineer about your platform migration See how GitPlumbers approaches platform modernization