From Snowflake Jenkins to GitOps: The Platform Migration That Cut Lead Time by 71%

How a mid-market fintech ditched brittle pipelines, shipped 6x faster, and stopped paging platform engineers at 2am.

“The best part is it’s boring. People deploy without asking permission, and prod is calm.” — Director of Platform, Fintech Client
Back to all posts

The situation: a fintech drowning in bespoke pipelines

A 200-person fintech (think AML reporting, not a unicorn) asked us to fix developer productivity without blowing up SOC 2 and their PCI-ish boundaries. They had:

  • A pet Jenkins VM with 300 freestyle jobs and a “do not reboot” wiki page
  • Three Kubernetes clusters (two EKS, one on-prem) that drifted monthly
  • Terraform sprinkled across repos with vendor modules forked three times
  • Static AWS keys in GitHub org secrets and on two build agents (I know…)
  • Lead time for changes averaging 4.1 days; deploys once per service per week; on-call burned out

I’ve seen this fail when teams buy a platform product and pray. We did the opposite: keep the platform thin, make the paved road unavoidable, and measure everything.

Constraints we had to respect (and why they matter)

These weren’t optional:

  • Compliance: SOC 2 Type II, data residency for EU tenants, segmenting PII services
  • Uptime: 99.9% SLO on core APIs; change freezes at quarter-end
  • Cost: No new headcount; infra spend already under procurement heat
  • Tooling reality: Developers were on GitHub Enterprise Cloud; AWS for prod; some stubborn services on-prem

Translation: no greenfield rewrites; we had to migrate in-place, minimize blast radius, and preserve auditability. That steered us to GitOps with ArgoCD, Terraform modules consolidated under a single registry, GitHub Actions with OIDC, and a Backstage catalog to hide the sharp edges.

What we changed: thin platform, thick paved road

We killed bespoke as policy. The platform stayed boring: EKS, managed Postgres (RDS), S3, SQS. The developer experience got opinionated.

  • GitOps with ArgoCD: Every cluster drift converged by reconcilers; no kubectl in prod. App state is code under environments/ repos.
  • Backstage for service catalog + templates: One golden path per runtime (Node.js, Go) with Dockerfile, health checks, Helm chart, OpenTelemetry, and a GitHub Actions workflow baked in.
  • Terraform modules + Terragrunt: Centralized AWS/EKS modules versioned and consumable; stop copying vendor examples.
  • OIDC for CI: GitHub Actions federated to AWS; no long-lived secrets.
  • Preview environments: Namespace-per-PR driven by the same Helm chart; automatic teardown on merge.
  • SLOs and budgets: Default SLOs per service with Prometheus and RED/USE dashboards in Grafana.

Here’s the ArgoCD Application we standardized (one per service per env):

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: payments-api-prod
  namespace: argocd
spec:
  project: core-services
  source:
    repoURL: https://github.com/fintech/environments
    targetRevision: main
    path: charts/payments-api/overlays/prod
    helm:
      valuesFiles:
        - values-prod.yaml
  destination:
    server: https://kubernetes.default.svc
    namespace: payments-api
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true
      - ServerSideApply=true

Boring on purpose. When ArgoCD owns prod, humans stop “just hotfixing in the cluster,” and your audits get easier overnight.

The migration plan that didn’t melt prod

Big bangs are for blog posts that leave out the pager history. We ran a migration factory over 12 weeks:

  1. Week 1–2: Baseline and pilot
    • Captured DORA metrics using Four Keys + GitHub data; verified with Datadog deploy markers
    • Piloted GitOps on a non-critical service, read-only Argo first, then automated sync
  2. Week 3–6: Golden path + OIDC
    • Landed Backstage catalog; shipped v1 templates for Node.js/Go
    • Replaced Jenkins jobs with GitHub Actions; enabled OIDC to AWS
  3. Week 7–10: App migrations
    • 18 services onto the paved road; dual-running old pipeline for 48h per service
    • Introduced preview environments for two customer-facing apps
  4. Week 11–12: Retire and harden
    • Turned off Jenkins; rotated all static keys; enabled ArgoCD Image Updater for patch bumps
    • Added SLO dashboards; codified incident runbooks per service

The GitHub Actions workflow became the default. Note the OIDC block and deployment job handing off to GitOps via PR to the env repo:

name: ci
on:
  push:
    branches: [main]
  pull_request:
permissions:
  id-token: write
  contents: read
jobs:
  build-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: 20 }
      - run: npm ci && npm test -- --ci
      - run: npm run build
      - uses: docker/setup-buildx-action@v3
      - uses: docker/login-action@v3
        with: { registry: ghcr.io, username: ${{ github.actor }}, password: ${{ secrets.GITHUB_TOKEN }} }
      - run: |
          docker buildx build \
            --push \
            -t ghcr.io/fintech/payments-api:${{ github.sha }} .
  deploy:
    needs: build-test
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with: { repository: fintech/environments, token: ${{ secrets.ENV_REPO_PAT }} }
      - name: Bump image tag
        run: |
          yq -i '.image.tag = "'${{ github.sha }}'"' charts/payments-api/overlays/prod/values-prod.yaml
      - name: Create PR to trigger ArgoCD
        uses: peter-evans/create-pull-request@v6
        with:
          branch: bump/payments-${{ github.sha }}
          title: "Deploy payments-api ${{ github.sha }} to prod"

We also carved out a Terraform module boundary that developers could consume without learning AWS arcana:

module "service" {
  source             = "git::https://github.com/fintech/infra-modules.git//service?ref=v1.6.0"
  name               = "payments-api"
  cpu                = 250
  memory             = 512
  replicas           = 3
  ingress_hostnames  = ["api.prod.example.com"]
  enable_sqs_dlq     = true
  enable_rds_iam_auth= true
}

No Helm hell, no 400-line YAML PRs. The module renders sane defaults, the chart takes values, Argo reconciles.

The boring automation that paid for itself

A few unsexy details made the difference:

  • Secret management with SOPS: Encrypted values in Git with age keys; ArgoCD decrypts server-side. No more “who rotated that?” mysteries.
  • Preview environments per PR: Same chart, new namespace; ephemeral RDS via templates for one team; mock services for others.
  • OpenTelemetry out of the box: Every template emitted traces, logs, metrics. PMs stopped arguing about “it’s slow” vs “it’s fine.”
  • Default SLOs: 99% latency SLO on p95; burn alerts to Slack with Alertmanager routing by squad.

Backstage template snippet (developers choose runtime, we hard-code the good choices):

apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
  name: golden-node-service
spec:
  owner: platform
  type: service
  parameters:
    - title: Service Info
      properties:
        name: { type: string }
        squad: { type: string }
  steps:
    - id: fetch
      action: fetch:template
      input:
        url: ./skeleton
    - id: publish
      action: publish:github
      input:
        repoUrl: github.com?owner=fintech&repo={{ parameters.name }}
    - id: register
      action: catalog:register
      input:
        repoContentsUrl: ${{ steps.publish.output.repoContentsUrl }}

Developers click “Create,” get a repo with a Dockerfile, Helm chart, CI workflow, SLO dashboard, and a catalog-info.yaml. That’s the paved road.

Results by the numbers (90 days, no heroics)

We didn’t add headcount. We stopped doing platform magic tricks and let the tools work.

  • Lead time for changes: 4.1 days → 1.2 days (−71%)
  • Deployment frequency: 1.1/week/service → 6.8/week/service (6x)
  • MTTR: 98 minutes → 41 minutes (−58%), thanks to fast rollbacks via Git revert + Argo sync
  • Change failure rate: 18% → 9% (−50%), feature flags + preview envs caught the silly stuff
  • Onboarding: new hire to first prod PR: 14 days → 3 days
  • Build agent outages: 2/month → 0 (Jenkins retired)
  • Infra cost: flat to −8% despite more deploys, because we right-sized with cluster autoscaler and killed zombie workloads

“The best part is it’s boring. People deploy without asking permission, and prod is calm.” — Director of Platform, Fintech Client

What we’d do differently next time

I’ve seen teams over-rotate into platformization and ship a labyrinth. We intentionally kept it simple, but a few things we’d tweak:

  • Start OpenFeature earlier to decouple feature-flag vendor choice from code.
  • Set ArgoCD to manual sync in prod for the first two weeks per service; one hotfix sprint got spicy.
  • Bake perf test scaffolding into the template from day one; we added k6 later after we hit p95 regressions in preview envs.
  • Use a shared environments repo per domain instead of a monorepo for everything—merge conflicts got annoying around week 8.

Copy/paste: configs you can steal

Don’t ask for permission—steal what works. Three drop-ins that moved the needle:

  • Minimal SOPS setup for values files:
# create age key
age-keygen -o key.txt
export SOPS_AGE_KEY_FILE=key.txt
sops -e values.yaml > values.enc.yaml
  • ArgoCD ApplicationSet for preview environments:
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: payments-previews
spec:
  generators:
    - pullRequest:
        github:
          owner: fintech
          repo: payments-api
          tokenRef:
            secretName: github-token
            key: token
        requeueAfterSeconds: 60
  template:
    metadata:
      name: payments-pr-{{number}}
    spec:
      source:
        repoURL: https://github.com/fintech/payments-api
        targetRevision: pull/{{number}}/head
        path: chart
        helm:
          values: |
            image:
              tag: pr-{{number}}
            env: pr-{{number}}
      destination:
        namespace: payments-pr-{{number}}
        server: https://kubernetes.default.svc
      syncPolicy:
        automated: { prune: true, selfHeal: true }
  • Default SLO alert (PrometheusRule):
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: payments-slo
spec:
  groups:
  - name: latency-slo
    rules:
    - alert: PaymentsP95TooHigh
      expr: histogram_quantile(0.95, sum(rate(http_server_duration_seconds_bucket{service="payments-api",le!="+Inf"}[5m])) by (le)) > 0.300
      for: 10m
      labels:
        severity: page
      annotations:
        summary: "Payments API p95 latency over 300ms"
        runbook_url: https://runbooks.internal/payments/p95

Related Resources

Key takeaways

  • Thin platform, thick paved road: keep the platform boring and the developer experience opinionated.
  • GitOps with ArgoCD eliminated snowflake drift and weekend change windows.
  • Backstage + golden templates prevented template sprawl and reduced onboarding time.
  • OIDC to AWS removed long-lived keys and most flaky CI secrets issues.
  • Preview environments paid off immediately for PR cycle time and PM sign-off.
  • Measure with DORA and SLOs or you’re just rearranging deck chairs.
  • Start with a migration factory: 80/20 automation plus office hours beats “big bang.”

Implementation checklist

  • Baseline DORA metrics before touching anything.
  • Stand up a read-only GitOps pilot on a non-critical service first.
  • Codify one golden path (app template + infra modules) before you boil the ocean.
  • Flip CI to OIDC; kill static cloud keys in repos and org secrets.
  • Enable preview environments for at least one customer-facing service.
  • Instrument with OpenTelemetry; add a default dashboard per service.
  • Run a weekly migration clinic—engineers bring repos, leave with PRs.

Questions we hear from teams

Why ArgoCD over Flux?
Both are solid. This team wanted UI-first visibility and RBAC granularity for auditors. Argo’s Application notion mapped cleanly to their domains, and their SREs already knew it.
Why not keep Jenkins and just fix it?
We tried. The pet server, plugins, and secrets model were the problem. Moving to GitHub Actions with OIDC removed whole classes of failure and simplified compliance.
How do preview environments avoid database collisions?
Two modes: for stateless services we use seeded ephemeral DBs (RDS snapshot clones) in a shared test account; for stateful constraints we swap live DB calls with contract-tested mocks in preview envs.
What did you measure to prove productivity gains?
DORA metrics (lead time, deployment frequency, change failure rate, MTTR), onboarding time-to-first-PR, PR cycle time, and incident counts. We also tracked SLO burn and change freeze violations.
Can we do this without Backstage?
Yes, but you’ll pay the tax in templates and documentation drift. Backstage made the paved road discoverable and consistent. Start small with one template and a catalog.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to a GitPlumbers engineer See how we approach platform engineering

Related resources