The Fintech Rollout That Didn’t Breach: Security‑First Dev That Paid Off When Prod Got Probed

A scale-up in payments went from “please don’t let us be on Have I Been Pwned” to measurable risk reduction by baking security into the dev loop. Here’s exactly what we did, what it cost, and what it saved.

“I expected a 2 a.m. incident bridge. Instead, we went home.” — VP Eng, anonymized fintech client
Back to all posts

The moment that proved the bet

Midnight cutover, fintech scale-up, new card‑issuing features riding on EKS behind CloudFront + WAF. Within 7 minutes of DNS flipping, the usual internet fauna showed up: Shodan, masscan, opportunistic SSRF pokes. We watched the WAF logs spike, Istio mTLS handshakes humming, and—most importantly—no exfil. Two months earlier, the same shop had public S3 buckets and long‑lived creds tucked into a values.yaml. I’ve seen that movie end in breach reports. This time, the blast radius was pre‑shrunk by design.

“I expected a 2 a.m. incident bridge. Instead, we went home.” — VP Eng, anonymized fintech client

We didn’t get lucky. We changed how code was written, built, and admitted to prod. Security wasn’t a gate at the end; it was a failing test from the first commit.

Where we started (and the constraints)

Context matters:

  • Industry: card issuing + payouts (PCI DSS, SOC 2 Type II). Seasonality: Friday traffic spikes from gig worker payouts.
  • Stack: Monolith (Node 14 + MongoDB) plus new TypeScript microservices on EKS (1.28) with Argo CD GitOps. API Gateway → Istio ingress → services.
  • Reality: AI‑assisted code boosted PR count 40%, but also shipped “vibe code” patterns—SSRFiest HTTP clients, lax input validation, and copy‑pasted axios calls to metadata endpoints.
  • Constraints: 14 engineers total, 1 security hire, 3 months to audit for a partner bank. No appetite for a “stop ship” program.

Symptoms we found in week one:

  • Hardcoded JWT_SECRET in env.ts, API keys in Git history, and an S3 bucket with public-read ACLs.
  • Containers running as root, no NetworkPolicies, and wide‑open egress.
  • Dependencies with known vulns (e.g., jsonwebtoken@8 pre-patch, transitive lodash CVEs).
  • IaC drift and hand‑edited security groups.

I’ve seen teams try to fix this with a quarterly pen test and a Confluence page. It never sticks. We needed guardrails that broke builds when risk increased.

What we changed in the dev loop

We added opinionated friction where it counts and speed everywhere else.

  1. Pre-commit and pre-push hooks to catch dumb stuff fast:

    • pre-commit with gitleaks and detect-secrets to stop secrets at source.
    • semgrep rules for SSRF, unsanitized axios calls, and insecure deserialization.
  2. CI that fails on real risk, not noise:

    • SAST: semgrep --config auto with a handful of custom rules.
    • SCA/Containers: osv-scanner, npm audit --omit=dev, and trivy image.
    • SBOMs: syft packages -o cyclonedx attached to releases.
    • Signing: cosign sign --keyless on images; verify at admission.
  3. Remediation SLOs baked into delivery:

    • P0: 48h, P1: 7d, P2: 30d. Jira automation opens tickets with CVE + owning team.

Here’s the condensed GitHub Actions we used to gate merges:

name: ci-security
on:
  pull_request:
  push:
    branches: [main]
permissions:
  contents: read
  id-token: write  # OIDC for keyless signing
  packages: write
jobs:
  build_and_scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: '20' }
      - name: Install deps
        run: npm ci
      - name: Semgrep SAST
        uses: returntocorp/semgrep-action@v1
        with:
          config: auto
          generateSarif: true
          publishToken: ${{ secrets.SEMGREP_TOKEN }}
      - name: Secret scan
        uses: gitleaks/gitleaks-action@v2
      - name: Build image
        run: |
          docker build -t ghcr.io/acme/payments:${{ github.sha }} .
      - name: Trivy image scan
        uses: aquasecurity/trivy-action@0.24.0
        with:
          image-ref: ghcr.io/acme/payments:${{ github.sha }}
          format: 'table'
          vuln-type: 'os,library'
          severity: 'CRITICAL,HIGH'
          exit-code: '1'  # fail on high/critical
      - name: Generate SBOM (CycloneDX)
        run: |
          curl -sSfL https://raw.githubusercontent.com/anchore/syft/main/install.sh | sh -s -- -b /usr/local/bin
          syft packages dir:. -o cyclonedx-json > sbom.json
      - name: Push image
        run: |
          echo ${{ secrets.GITHUB_TOKEN }} | docker login ghcr.io -u $ --password-stdin
          docker push ghcr.io/acme/payments:${{ github.sha }}
      - name: Cosign sign (keyless)
        run: |
          COSIGN_EXPERIMENTAL=1 cosign sign --yes ghcr.io/acme/payments:${{ github.sha }}

No silver bullets—just fast, deterministic feedback. Devs saw why a PR failed, with links to the line and the fix.

Guardrails in the platform (contain the blast)

Catching issues early is half the game. The other half is making “bad” code non‑exploitable.

  • Admission control with signatures and baseline security:
    • We used Kyverno to require cosign signatures and runAsNonRoot.
    • Pod Security level: baseline cluster‑wide; privileged only via just‑in‑time exceptions.
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: verify-and-harden
spec:
  validationFailureAction: enforce
  rules:
    - name: require-signed-images
      match: { resources: { kinds: [Pod] } }
      verifyImages:
        - imageReferences: ['ghcr.io/acme/*']
          attestors:
            - entries:
                - keys:
                    publicKeys: |
                      # Fulcio certs keyless trust root (simplified for brevity)
                      MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQE...
    - name: require-nonroot-and-readonly
      match: { resources: { kinds: [Pod] } }
      validate:
        pattern:
          spec:
            securityContext:
              runAsNonRoot: true
            containers:
              - securityContext:
                  readOnlyRootFilesystem: true
  • Network segmentation and egress deny:
    • Default‑deny egress with targeted allowlists killed SSRF to metadata/IPs.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-and-allowlist
  namespace: payments
spec:
  podSelector: {}
  policyTypes: [Ingress, Egress]
  ingress:
    - from:
        - namespaceSelector: { matchLabels: { istio-injection: enabled } }
  egress:
    - to:
        - namespaceSelector: { matchLabels: { name: infra } }
        - ipBlock: { cidr: 0.0.0.0/0, except: ["169.254.169.254/32"] }
    - ports: [{ protocol: TCP, port: 443 }]
  • Cloud hardening:
    • Enforced AWS IMDSv2 on all nodes, blocked IMDS IP at network layer, and moved to OIDC‑based federation for CI to AWS (no long‑lived keys).
  • IaC scanning and drift control:
    • checkov and tfsec in CI; Terraform plans auto‑commented on PRs; Argo CD drift alerts.

The supply chain piece most teams skip

If an attacker can push an unsigned image or you can’t answer “what’s running where,” you’re already in breach‑adjacent territory.

  • SBOMs by default: We generated CycloneDX SBOMs per image and uploaded them as release assets, searchable by CVE.
  • Sign and verify: cosign sign --keyless during CI; Kyverno enforced verification before a Pod was admitted.
  • Provenance: We added SLSA provenance attestations so we could answer auditors on “who built this and how.”
  • Runtime: Falco rules on nodes for unexpected syscall patterns; signature mismatches emitted Prometheus alerts funneled into Grafana + PagerDuty.

When a third‑party SDK shipped a critical CVE, the SBOM allowed us to find and patch all affected services in hours, not days.

What happened when prod got probed (and the numbers)

Within the first week post‑launch we saw:

  • WAF spikes: 18k rule hits in 24h (SQLi/SSRF probes), 0 successful requests to sensitive paths.
  • Container security: 3 attempted pods killed at admission due to missing signatures (misconfigured internal jobs, not attackers—but the control worked).
  • Secrets protection: 137 secret pushes blocked by gitleaks in the first month; by month two, that dropped to 9 as devs learned.

Measurable outcomes over 90 days:

  • P0 vulnerability MTTR dropped from 23 days to 48 hours (tracked in Jira; dashboard visible in standups).
  • High/critical container vulns in running workloads down 82% (Trivy reports baseline vs. month 3).
  • Public resource misconfigs (S3/SG) from 7 to 0 (Checkov policy coverage).
  • Zero long‑lived cloud keys in repos (migrated to OIDC; validated by org‑wide secret scanning).
  • Pen test results: High findings from 12 → 1; no exploitable data paths.

Financially, the CFO could finally run a risk line: expected loss modeled from peers ~ $1.8–$3.2M for a data‑exposure event. Our controls avoided two classes of incidents we’ve personally cleaned up elsewhere: untrusted image in prod and SSRF to metadata. We can’t claim a counterfactual, but the cheapest incident is the one that never starts.

What actually saved us (not the slideware)

  • Default‑deny egress: SSRF attempts died on the vine. AI‑written HTTP clients couldn’t wander off to IMDS.
  • Signed images + verify at admission: Stopped “helpful” hotfixes from engineers pushing locally built images—and the classic crypto‑miner dropper angle.
  • Pre‑commit secret scanning: Prevented blast radius from ever reaching CI. You don’t want to rotate Stripe keys on a Friday.
  • SLOs for remediation: Time‑bound fixes create focus. Treat security like availability.
  • Developer‑centric feedback: We curated Semgrep rules to 14 that mattered to our stack. False positives kill programs.

I’ve seen teams buy a tool buffet and drown. This worked because it was ruthlessly small, automated, and enforced by code.

If you copy one thing, copy this rollout plan

Start small and ratchet up friction only after you prove value.

  1. Week 1–2: Add gitleaks, semgrep, and trivy to CI in warn‑only. Generate SBOMs with syft.
  2. Week 3: Turn on fail for CRITICAL/HIGH in Trivy; set remediation SLOs; open Jira automation.
  3. Week 4–6: Enforce Kyverno policies for signatures and non‑root. Ship default‑deny egress NetworkPolicy.
  4. Week 6–8: Move CI to OIDC; kill all long‑lived keys; enable IMDSv2 and block IMDS IP.
  5. Week 8–10: Add Falco + Prometheus alerts; codify exceptions with timeboxes (“break‑glass” labels).
  6. Ongoing: Curate SAST rules quarterly; practice incident drills; keep SBOMs visible to product.

Tooling we actually used and recommend:

  • Code: semgrep, gitleaks, osv-scanner
  • Containers & SBOM: trivy, syft (CycloneDX)
  • Supply chain: cosign, SLSA provenance
  • Platform: Kyverno (or OPA Gatekeeper), Pod Security, Istio mTLS
  • IaC: checkov, tfsec, Terraform Cloud policy sets
  • Secrets: AWS OIDC, HashiCorp Vault for app creds

What we’d do differently next time

  • Bake developer training earlier: 30 minutes on SSRF and egress rules saved a week of PR back‑and‑forth.
  • Start with soft‑fail longer: We flipped hard‑fail in week 3; one team lost a day to dependency churn. I’d stage by repo criticality.
  • Fewer tools, better rules: We dropped a DAST scanner that created noise and stuck to curated SAST + runtime checks.
  • Plan for AI code realities: Add lint rules that flag risky patterns common in AI‑generated snippets and budget time for vibe code cleanup and AI code refactoring in sprint planning.

Related Resources

Key takeaways

  • Security is a development concern, not a separate phase—treat misconfigs and vulnerable dependencies like failing tests.
  • CI gates with quick feedback (Semgrep, gitleaks, Trivy) beat after-the-fact audits and reduce MTTR dramatically.
  • Supply-chain controls (Cosign signing + verify at admission, SBOMs) stop untrusted images from ever running.
  • Platform guardrails (OPA/Kyverno, NetworkPolicy, IMDSv2) contain impact even when a service is probed.
  • Adopt a remediation SLO and ticket it like product work; measure time-to-fix just like MTTR.
  • Roll out in thin slices: warn → soft-fail → hard-fail to avoid developer revolt.

Implementation checklist

  • Add pre-commit hooks: `pre-commit`, `gitleaks`, `semgrep`
  • Gate CI on SAST/SCA/Container scan: `semgrep --config auto`, `trivy fs|image`, `osv-scanner`
  • Generate and publish SBOMs (`syft`) and sign artifacts (`cosign sign --keyless`)
  • Enforce signature & policy at admission (Kyverno or OPA Gatekeeper)
  • Default-deny egress with Kubernetes `NetworkPolicy`; enable Istio mTLS
  • Use short-lived creds (OIDC to cloud) and enforce AWS IMDSv2
  • Set remediation SLOs (e.g., P0 ≤ 48h) and track in Jira with dashboards

Questions we hear from teams

How did you keep developer velocity while adding security gates?
We pushed checks as left as possible: pre-commit for secrets and simple Semgrep rules, CI for heavier scans. We curated rules (14 custom Semgrep rules) to reduce false positives and staged warn→fail over three weeks. MTTR dropped while PR cycle time stayed within ±6% of baseline.
Why Kyverno over OPA Gatekeeper?
Both work. Kyverno’s policy-as-CRD model and built-in `verifyImages` made Cosign enforcement straightforward for the team. If you already have Gatekeeper expertise and a Rego library, stick with it.
Did you run DAST?
We trialed a DAST scanner but the signal-to-noise was poor for APIs behind mTLS. We focused on SAST, supply chain, IaC scanning, and runtime detection (Falco). For public-facing web apps, we’d pair this with targeted DAST and canary endpoints.
What’s the business impact you actually measured?
Security incident MTTR equivalents, pen test findings reduced from 12 high to 1, and policy coverage (0 public buckets, 0 long-lived keys). Finance used peer incident data to estimate avoided loss in the low seven figures for a data exposure event.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to an engineer about security-first delivery See how we fix AI-generated code safely

Related resources