The Security Gates That Didn't Slow Us Down: How a B2B Fintech Dodged a Seven-Figure Breach

Security-first development usually reads like overhead. Here’s how making it the default saved a payments platform from a very public, very expensive incident—without killing velocity.

> We didn’t buy a SIEM and hope. We made unsafe changes un-mergeable. That’s what actually prevents breaches.
Back to all posts

The setup you never want to inherit

I walked into a B2B payments platform scaling from 30 to 200+ deploys/week on EKS with Terraform IaC, ArgoCD for GitOps, and a mixed Go/Node.js stack. Think: SOC 2 Type II in-flight, PCI SAQ A-EP on the horizon, and an executive mandate not to slow feature delivery. Classic.

  • Infra: AWS EKS, RDS Postgres, MSK, S3, CloudFront. Istio for mTLS. Cilium for networking.
  • Tooling: GitHub Enterprise, Actions, CodeOwners, Renovate. Trivy/Grype for scanning. Syft for SBOM. Cosign for signing.
  • Constraints: 45 engineers, no feature freeze, p95 deploy latency target unchanged, auditors poking around logging and change control.

The team had decent hygiene—unit tests, canaries, on-call with SLOs—but security was “after QA.” I’ve seen that movie. It ends with a weekend incident and a board deck you don’t want to write.

The near-miss that changed the conversation

Two things happened within a week:

  1. A pen tester found an SSRF path in a legacy Node service calling out to a third-party AML API. Nothing hit prod, but the pattern was everywhere.
  2. Our Terraform plan for analytics accidentally widened an S3 bucket policy. The dev caught it in review, but it was luck, not process.

I’ve seen both become headline breaches. We needed security defaults that prevented these classes of bugs from ever merging—and we needed them fast.

What we changed in 90 days

We didn’t “boil the ocean.” We embedded four controls where they hurt least and helped most:

  • Shift-left checks in CI for code and IaC
  • Supply chain integrity (SBOM, signing, provenance)
  • Kubernetes guardrails that fail-closed
  • Runtime egress controls to blunt exfil and SSRF

We paired that with golden templates and clear failure messages so devs could self-serve fixes.

CI that blocks the right things (and nothing else)

We wired GitHub Actions to make risky changes impossible to merge. No tickets. No humans in the loop. If it’s unsafe, it doesn’t land.

# .github/workflows/security-gates.yml
name: security-gates
on:
  pull_request:
    branches: [ main ]
jobs:
  sast-iac-supplychain:
    runs-on: ubuntu-22.04
    permissions:
      contents: read
      security-events: write
      id-token: write
    steps:
      - uses: actions/checkout@v4
      - name: SAST (Semgrep)
        uses: returntocorp/semgrep-action@v1
        with:
          config: "p/ci,p/security-audit"
          severity: WARNING
      - name: IaC policy (Conftest)
        run: |
          docker run --rm -v "$PWD/policies:/policies" -v "$PWD:/project" openpolicyagent/conftest test /project/terraform -p /policies
      - name: SBOM (Syft) + scan (Grype)
        run: |
          curl -sSfL https://raw.githubusercontent.com/anchore/syft/main/install.sh | sh -s -- -b /usr/local/bin
          curl -sSfL https://raw.githubusercontent.com/anchore/grype/main/install.sh | sh -s -- -b /usr/local/bin
          syft packages -o json . > sbom.json
          grype sbom:sbom.json --fail-on critical
      - name: Build and sign image (Cosign)
        env:
          COSIGN_EXPERIMENTAL: "1"
        run: |
          docker build -t ghcr.io/org/service:${{ github.sha }} .
          echo "${{ secrets.COSIGN_KEY }}" > cosign.key
          cosign sign --key cosign.key ghcr.io/org/service:${{ github.sha }}
  • Semgrep caught the SSRF patterns. We added company rules for outbound calls.
  • Conftest blocked dangerous Terraform plans (public S3, 0.0.0.0/0 egress, unencrypted RDS).
  • Syft/Grype built and scanned an SBOM, failing PRs on criticals, including transitives.
  • Cosign signed images; later we enforced verification at admission.

Result: PRs failed with actionable output. p95 time-to-merge? +8 minutes, which the VP Eng signed off on because deploy frequency stayed flat.

Guardrails in the cluster (no heroics required)

We enforced safety with policy-as-code so reviewers didn’t have to play security cop. Two examples that paid off immediately:

  1. Reject risky pods by default with Kyverno
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: baseline-security
spec:
  validationFailureAction: enforce
  rules:
    - name: no-privileged-no-hostpath
      match:
        any:
        - resources:
            kinds: ["Pod", "Deployment"]
      validate:
        pattern:
          spec:
            securityContext:
              runAsNonRoot: true
            containers:
            - name: "*"
              securityContext:
                allowPrivilegeEscalation: false
                readOnlyRootFilesystem: true
                capabilities:
                  drop: ["ALL"]
              image: "!*:latest"
    - name: require-limits
      validate:
        pattern:
          spec:
            containers:
            - name: "*"
              resources:
                limits:
                  cpu: "*"
                  memory: "*"
  1. Only run signed images from our registry
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-signed-images
spec:
  validationFailureAction: enforce
  rules:
  - name: verify-cosign
    match:
      any:
      - resources:
          kinds: ["Pod", "Deployment"]
    verifyImages:
    - image: "ghcr.io/org/*"
      key: |-
        -----BEGIN PUBLIC KEY-----
        ...redacted...
        -----END PUBLIC KEY-----

No more :latest. No more privileged pods. No unsigned images. When Log4Shell-style transitives pop, they’re blocked at PR or fail admission before hitting a node.

Blunting exfil and SSRF with egress allowlists

The pen test SSRF finding pushed us to lock egress at the network layer. With Cilium, we used FQDN policies so services could only call what they were supposed to.

apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: payments-egress
  namespace: payments
spec:
  endpointSelector:
    matchLabels:
      app: payments
  egress:
    - toFQDNs:
        - matchName: api.partner-aml.com
        - matchName: auth.stripe.com
    - toEndpoints:
        - matchLabels:
            k8s:io.kubernetes.pod.namespace: observability
    - toEntities:
        - kube-dns

Combined with app-layer timeouts and explicit DNS, this killed whole classes of data exfil and SSRF attempts without developer heroics.

What it saved us (with numbers)

We’re allergic to vanity metrics. Here’s what moved:

  • 94% reduction in open critical vulns in 60 days (from 126 to 8) as Renovate plus SBOM scanning cleared the backlog.
  • Zero production security incidents in 12 months.
  • MTTR for security patches dropped from ~3 days to <24 hours; most were under a single business day.
  • p95 time-to-merge increased 8 minutes; deployment frequency and change failure rate stayed within DORA targets.
  • 99.8% of images in prod were signed and verified; the 0.2% were blocked at admission.
  • Estimated breach cost avoided: $3–5M, based on IBM’s 2024 average breach cost ($4.45M) and our data footprint. Not a stretch given the S3 policy near-miss.

Business impact: We passed SOC 2 Type II and tightened PCI scope without adding headcount. Feature delivery didn’t stall. The CFO stopped asking why security was “R&D overhead.”

A concrete example: killing a risky Terraform change at PR

The S3 policy near-miss? We turned it into a Conftest rule so it can never happen again.

# policies/s3.rego
package terraform.aws.s3

deny[msg] {
  input.resource_type == "aws_s3_bucket_public_access_block"
  input.change.set_public_policy == true
  msg := sprintf("Public S3 policy not allowed: %v", [input.address])
}

deny[msg] {
  input.resource_type == "aws_s3_bucket"
  input.after.acl == "public-read"
  msg := sprintf("S3 ACL must not be public: %v", [input.address])
}

Hooked up to the workflow above, any PR that widens S3 access fails fast with a message the developer can act on. No security review meeting. No exceptions spreadsheet.

What we’d do differently (and what you can do next week)

I’ve seen this fail when people try to roll out everything at once. The playbook that works:

  1. Pick two high-signal gates. For most teams: SBOM+vuln scan and IaC policy. Wire them to fail PRs.
  2. Add a single cluster guardrail with visible bite. Enforce no :latest and require limits.
  3. Lock down egress for your highest-risk namespace with Cilium or Calico; observe for a week, then enforce.
  4. Sign images and verify at admission. It’s boring. It works. Use cosign.
  5. Track the real KPIs: p95 merge time, deploy frequency, vuln counts by severity, MTTR. Publish weekly.

Two things we’d adjust next time:

  • Threat modeling earlier. Lightweight, service-by-service. It drives better custom rules (like our SSRF checks).
  • More golden templates. The more you pave the path, the less policy errors you see in PRs.

If you want a partner who has scars from doing this at scale, GitPlumbers will sit with your leads, wire the gates, and leave you with dashboards that show you didn’t trade velocity for safety.

Related Resources

Key takeaways

  • Security gates don’t have to slow delivery if they’re automated, fast, and fail-closed with clear remediation.
  • Policy-as-code (OPA/Kyverno) prevents entire classes of risky configs from ever reaching the cluster.
  • Supply-chain controls (SBOM + Cosign + provenance) catch the ugly stuff—transitive vulns and unsigned images—before runtime.
  • Keep the metrics honest: track p95 merge time, deployment frequency, and MTTR alongside vuln counts.
  • Make devs successful by default: pre-commit hooks, PR checks, golden templates, and paved paths.

Implementation checklist

  • Add SBOM generation (Syft) and vuln scan (Grype/Trivy) to CI; fail PRs on criticals.
  • Enforce signed images with Kyverno or Gatekeeper; verify with `cosign verify` and SLSA attestations.
  • Shift-left IaC checks using Conftest/Rego; block risky Terraform plans (S3 public access, wide security groups).
  • Create Kubernetes guardrails: disallow `privileged`, require limits, forbid `:latest`, read-only root FS, drop `NET_RAW`.
  • Lock down egress with CiliumNetworkPolicy or equivalent; allowlist destinations per service.
  • Instrument DORA + security KPIs; publish dashboards so teams see the trade-offs (or lack thereof).
  • Run quarterly red team or chaos-security days to validate the controls actually bite.

Questions we hear from teams

Will security gates slow our teams down?
Not if they’re automated, fast, and specific. In this engagement, p95 time-to-merge increased by 8 minutes, deployment frequency stayed flat, and change failure rate didn’t move. The trick is picking high-signal checks (SBOM + IaC policy), providing golden templates, and making failure messages fixable without a meeting.
Why Kyverno and not Gatekeeper?
Both work. Kyverno’s policy ergonomics (patterns, verifyImages) are friendlier for platform teams and don’t require learning Rego for common cases. If your org already speaks Rego and wants centralized OPA, Gatekeeper is fine. We’ve implemented both at banks and SaaS unicorns; pick the one your team will actually maintain.
Do we need Cosign and SBOMs if we already scan images?
Yes. Scanning images after they’re built doesn’t tell you what changed or whether the artifact is trustworthy. SBOMs give you visibility into transitives (think Log4Shell), and Cosign ensures you only run what you built. Together with provenance attestations, you get real supply-chain integrity, not just best-effort scanning.
What about developers working locally?
Make the paved path the easy path. Use pre-commit hooks (`detect-secrets`, `tfsec`, `yamllint`), dev containers with default policies, and local `conftest` scripts. The same checks run locally and in CI, so PR failures are rare and predictable.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to an engineer about wiring security gates that don’t slow you down See our security-first development checklist

Related resources