The Security Gates That Didn't Slow Us Down: How a B2B Fintech Dodged a Seven-Figure Breach
Security-first development usually reads like overhead. Here’s how making it the default saved a payments platform from a very public, very expensive incident—without killing velocity.
> We didn’t buy a SIEM and hope. We made unsafe changes un-mergeable. That’s what actually prevents breaches.Back to all posts
The setup you never want to inherit
I walked into a B2B payments platform scaling from 30 to 200+ deploys/week on EKS with Terraform IaC, ArgoCD for GitOps, and a mixed Go/Node.js stack. Think: SOC 2 Type II in-flight, PCI SAQ A-EP on the horizon, and an executive mandate not to slow feature delivery. Classic.
- Infra: AWS EKS, RDS Postgres, MSK, S3, CloudFront. Istio for mTLS. Cilium for networking.
- Tooling: GitHub Enterprise, Actions, CodeOwners, Renovate. Trivy/Grype for scanning. Syft for SBOM. Cosign for signing.
- Constraints: 45 engineers, no feature freeze, p95 deploy latency target unchanged, auditors poking around logging and change control.
The team had decent hygiene—unit tests, canaries, on-call with SLOs—but security was “after QA.” I’ve seen that movie. It ends with a weekend incident and a board deck you don’t want to write.
The near-miss that changed the conversation
Two things happened within a week:
- A pen tester found an SSRF path in a legacy Node service calling out to a third-party AML API. Nothing hit prod, but the pattern was everywhere.
- Our Terraform plan for analytics accidentally widened an S3 bucket policy. The dev caught it in review, but it was luck, not process.
I’ve seen both become headline breaches. We needed security defaults that prevented these classes of bugs from ever merging—and we needed them fast.
What we changed in 90 days
We didn’t “boil the ocean.” We embedded four controls where they hurt least and helped most:
- Shift-left checks in CI for code and IaC
- Supply chain integrity (SBOM, signing, provenance)
- Kubernetes guardrails that fail-closed
- Runtime egress controls to blunt exfil and SSRF
We paired that with golden templates and clear failure messages so devs could self-serve fixes.
CI that blocks the right things (and nothing else)
We wired GitHub Actions to make risky changes impossible to merge. No tickets. No humans in the loop. If it’s unsafe, it doesn’t land.
# .github/workflows/security-gates.yml
name: security-gates
on:
pull_request:
branches: [ main ]
jobs:
sast-iac-supplychain:
runs-on: ubuntu-22.04
permissions:
contents: read
security-events: write
id-token: write
steps:
- uses: actions/checkout@v4
- name: SAST (Semgrep)
uses: returntocorp/semgrep-action@v1
with:
config: "p/ci,p/security-audit"
severity: WARNING
- name: IaC policy (Conftest)
run: |
docker run --rm -v "$PWD/policies:/policies" -v "$PWD:/project" openpolicyagent/conftest test /project/terraform -p /policies
- name: SBOM (Syft) + scan (Grype)
run: |
curl -sSfL https://raw.githubusercontent.com/anchore/syft/main/install.sh | sh -s -- -b /usr/local/bin
curl -sSfL https://raw.githubusercontent.com/anchore/grype/main/install.sh | sh -s -- -b /usr/local/bin
syft packages -o json . > sbom.json
grype sbom:sbom.json --fail-on critical
- name: Build and sign image (Cosign)
env:
COSIGN_EXPERIMENTAL: "1"
run: |
docker build -t ghcr.io/org/service:${{ github.sha }} .
echo "${{ secrets.COSIGN_KEY }}" > cosign.key
cosign sign --key cosign.key ghcr.io/org/service:${{ github.sha }}Semgrepcaught the SSRF patterns. We added company rules for outbound calls.Conftestblocked dangerous Terraform plans (public S3,0.0.0.0/0egress, unencrypted RDS).Syft/Grypebuilt and scanned an SBOM, failing PRs on criticals, including transitives.Cosignsigned images; later we enforced verification at admission.
Result: PRs failed with actionable output. p95 time-to-merge? +8 minutes, which the VP Eng signed off on because deploy frequency stayed flat.
Guardrails in the cluster (no heroics required)
We enforced safety with policy-as-code so reviewers didn’t have to play security cop. Two examples that paid off immediately:
- Reject risky pods by default with Kyverno
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: baseline-security
spec:
validationFailureAction: enforce
rules:
- name: no-privileged-no-hostpath
match:
any:
- resources:
kinds: ["Pod", "Deployment"]
validate:
pattern:
spec:
securityContext:
runAsNonRoot: true
containers:
- name: "*"
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop: ["ALL"]
image: "!*:latest"
- name: require-limits
validate:
pattern:
spec:
containers:
- name: "*"
resources:
limits:
cpu: "*"
memory: "*"- Only run signed images from our registry
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-signed-images
spec:
validationFailureAction: enforce
rules:
- name: verify-cosign
match:
any:
- resources:
kinds: ["Pod", "Deployment"]
verifyImages:
- image: "ghcr.io/org/*"
key: |-
-----BEGIN PUBLIC KEY-----
...redacted...
-----END PUBLIC KEY-----No more :latest. No more privileged pods. No unsigned images. When Log4Shell-style transitives pop, they’re blocked at PR or fail admission before hitting a node.
Blunting exfil and SSRF with egress allowlists
The pen test SSRF finding pushed us to lock egress at the network layer. With Cilium, we used FQDN policies so services could only call what they were supposed to.
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
name: payments-egress
namespace: payments
spec:
endpointSelector:
matchLabels:
app: payments
egress:
- toFQDNs:
- matchName: api.partner-aml.com
- matchName: auth.stripe.com
- toEndpoints:
- matchLabels:
k8s:io.kubernetes.pod.namespace: observability
- toEntities:
- kube-dnsCombined with app-layer timeouts and explicit DNS, this killed whole classes of data exfil and SSRF attempts without developer heroics.
What it saved us (with numbers)
We’re allergic to vanity metrics. Here’s what moved:
- 94% reduction in open critical vulns in 60 days (from 126 to 8) as Renovate plus SBOM scanning cleared the backlog.
- Zero production security incidents in 12 months.
- MTTR for security patches dropped from ~3 days to <24 hours; most were under a single business day.
- p95 time-to-merge increased 8 minutes; deployment frequency and change failure rate stayed within DORA targets.
- 99.8% of images in prod were signed and verified; the 0.2% were blocked at admission.
- Estimated breach cost avoided: $3–5M, based on IBM’s 2024 average breach cost ($4.45M) and our data footprint. Not a stretch given the S3 policy near-miss.
Business impact: We passed SOC 2 Type II and tightened PCI scope without adding headcount. Feature delivery didn’t stall. The CFO stopped asking why security was “R&D overhead.”
A concrete example: killing a risky Terraform change at PR
The S3 policy near-miss? We turned it into a Conftest rule so it can never happen again.
# policies/s3.rego
package terraform.aws.s3
deny[msg] {
input.resource_type == "aws_s3_bucket_public_access_block"
input.change.set_public_policy == true
msg := sprintf("Public S3 policy not allowed: %v", [input.address])
}
deny[msg] {
input.resource_type == "aws_s3_bucket"
input.after.acl == "public-read"
msg := sprintf("S3 ACL must not be public: %v", [input.address])
}Hooked up to the workflow above, any PR that widens S3 access fails fast with a message the developer can act on. No security review meeting. No exceptions spreadsheet.
What we’d do differently (and what you can do next week)
I’ve seen this fail when people try to roll out everything at once. The playbook that works:
- Pick two high-signal gates. For most teams:
SBOM+vuln scanandIaC policy. Wire them to fail PRs. - Add a single cluster guardrail with visible bite. Enforce
no :latestandrequire limits. - Lock down egress for your highest-risk namespace with Cilium or Calico; observe for a week, then enforce.
- Sign images and verify at admission. It’s boring. It works. Use
cosign. - Track the real KPIs: p95 merge time, deploy frequency, vuln counts by severity, MTTR. Publish weekly.
Two things we’d adjust next time:
- Threat modeling earlier. Lightweight, service-by-service. It drives better custom rules (like our SSRF checks).
- More golden templates. The more you pave the path, the less policy errors you see in PRs.
If you want a partner who has scars from doing this at scale, GitPlumbers will sit with your leads, wire the gates, and leave you with dashboards that show you didn’t trade velocity for safety.
Key takeaways
- Security gates don’t have to slow delivery if they’re automated, fast, and fail-closed with clear remediation.
- Policy-as-code (OPA/Kyverno) prevents entire classes of risky configs from ever reaching the cluster.
- Supply-chain controls (SBOM + Cosign + provenance) catch the ugly stuff—transitive vulns and unsigned images—before runtime.
- Keep the metrics honest: track p95 merge time, deployment frequency, and MTTR alongside vuln counts.
- Make devs successful by default: pre-commit hooks, PR checks, golden templates, and paved paths.
Implementation checklist
- Add SBOM generation (Syft) and vuln scan (Grype/Trivy) to CI; fail PRs on criticals.
- Enforce signed images with Kyverno or Gatekeeper; verify with `cosign verify` and SLSA attestations.
- Shift-left IaC checks using Conftest/Rego; block risky Terraform plans (S3 public access, wide security groups).
- Create Kubernetes guardrails: disallow `privileged`, require limits, forbid `:latest`, read-only root FS, drop `NET_RAW`.
- Lock down egress with CiliumNetworkPolicy or equivalent; allowlist destinations per service.
- Instrument DORA + security KPIs; publish dashboards so teams see the trade-offs (or lack thereof).
- Run quarterly red team or chaos-security days to validate the controls actually bite.
Questions we hear from teams
- Will security gates slow our teams down?
- Not if they’re automated, fast, and specific. In this engagement, p95 time-to-merge increased by 8 minutes, deployment frequency stayed flat, and change failure rate didn’t move. The trick is picking high-signal checks (SBOM + IaC policy), providing golden templates, and making failure messages fixable without a meeting.
- Why Kyverno and not Gatekeeper?
- Both work. Kyverno’s policy ergonomics (patterns, verifyImages) are friendlier for platform teams and don’t require learning Rego for common cases. If your org already speaks Rego and wants centralized OPA, Gatekeeper is fine. We’ve implemented both at banks and SaaS unicorns; pick the one your team will actually maintain.
- Do we need Cosign and SBOMs if we already scan images?
- Yes. Scanning images after they’re built doesn’t tell you what changed or whether the artifact is trustworthy. SBOMs give you visibility into transitives (think Log4Shell), and Cosign ensures you only run what you built. Together with provenance attestations, you get real supply-chain integrity, not just best-effort scanning.
- What about developers working locally?
- Make the paved path the easy path. Use pre-commit hooks (`detect-secrets`, `tfsec`, `yamllint`), dev containers with default policies, and local `conftest` scripts. The same checks run locally and in CI, so PR failures are rare and predictable.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
