The Breach That Didn’t Happen: Security-First Dev Saved a Fintech’s Quarter (and Sleep)
A real-world case study: how we moved a team from “scan at the end” to security-first delivery—catching exploitable bugs before prod, shrinking MTTR, and avoiding the kind of incident that shows up in board decks.
Security-first isn’t a toolchain. It’s making the secure path the easiest path—at commit time, PR time, and deploy time.Back to all posts
The day the pen test stopped being “just a report”
I’ve watched this movie since the early PCI days: a security report lands, everyone agrees it’s bad, and then it gets Jira’d into oblivion because “we have roadmap commitments.”
This client was a mid-market fintech SaaS (Kubernetes on AWS EKS, Terraform, Node/TypeScript + a bit of Java). They weren’t reckless—just busy. They were also leaning hard into AI-assisted coding (Copilot-style), which increased throughput and… let’s call it “creative” input handling.
The trigger wasn’t paranoia. It was math:
- A pen test found 3 critical issues (authz bypass in an internal admin endpoint, SSRF in a PDF rendering service, and a dependency chain with known RCE).
- Their incident budget was already thin: on-call load was trending up, and their MTTR on security fixes was hovering around ~3 weeks.
- Leadership had regulatory and customer pressure (SOC 2 + bank partners asking uncomfortable questions).
They didn’t need a 200-page “DevSecOps transformation.” They needed guardrails that prevented the next breach without slowing delivery to a crawl.
Constraints we couldn’t hand-wave away
Security-first only works if it respects reality. Here’s what we had to work within:
- Two-week sprints, weekly releases, and a “no freeze” culture.
- A platform team of 3, app teams totaling ~18 engineers.
- A mixed legacy footprint: some services were solid, others were… 2017 Express middleware soup.
- A backlog of known issues, but no consistent definition of “done” for security.
- Vendor integrations and partner APIs that made “just lock it down” a fantasy.
So we focused on a pragmatic target: stop high-severity classes of bugs from reaching production, and make fixes fast when something slips through.
What we changed: security-first that devs actually used
We implemented a tight set of interventions that developers encountered naturally—at commit time, PR time, and deploy time.
1) PR-time threat modeling (lightweight, not theater)
No one wants a two-hour STRIDE workshop for a one-line change. But high-risk changes do need a pause.
We added a PR template section that only triggers when certain labels/files change (auth, file upload, HTTP client usage, infra):
- Data flow updated?
- New trust boundary?
- Any user-controlled URLs/paths?
- Authz decisions added/modified?
It wasn’t bureaucracy. It was a tripwire that caught obvious foot-guns—especially in AI-generated code.
2) Commit-time secret hygiene (stop leaking keys before CI even starts)
The fastest breach is still “someone committed a token.” We installed pre-commit with gitleaks and a couple of sanity linters.
# .pre-commit-config.yaml
repos:
- repo: https://github.com/gitleaks/gitleaks
rev: v8.18.4
hooks:
- id: gitleaks
args: ["--redact"]
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.5.0
hooks:
- id: end-of-file-fixer
- id: trailing-whitespaceThis alone prevented two “oops” moments in week one: an AWS access key in a debug script and a partner API token in a Postman export.
3) CI guardrails that fail for the right reasons
We used GitHub Actions as the control plane (they were already there). The key was setting fail thresholds that were strict on critical issues but didn’t create alert fatigue.
# .github/workflows/security.yml
name: security
on:
pull_request:
push:
branches: [ main ]
jobs:
sast-and-deps:
runs-on: ubuntu-latest
permissions:
security-events: write
contents: read
steps:
- uses: actions/checkout@v4
- name: Semgrep (SAST)
uses: returntocorp/semgrep-action@v1
with:
config: p/owasp-top-ten
- name: CodeQL (deep analysis)
uses: github/codeql-action/init@v3
with:
languages: javascript, typescript
- uses: github/codeql-action/analyze@v3
- name: Dependency scan (Snyk)
uses: snyk/actions/node@master
env:
SNYK_TOKEN: ${{ secrets.SNYK_TOKEN }}
with:
args: --severity-threshold=highWe tuned the rules (especially Semgrep) to reduce noise and added suppression only with explicit justification. I’ve seen teams “ignore” their way into a breach; we didn’t allow that.
4) Supply chain and container scanning that blocks bad builds
The fintech ran EKS, so container hygiene mattered. We added trivy scanning for images and CycloneDX SBOM generation.
# in CI
trivy image --severity CRITICAL,HIGH --exit-code 1 ghcr.io/acme/payments:${GIT_SHA}
# generate SBOM
cyclonedx-npm --output-file sbom.jsonThe practical win: we caught a transitive dependency with a known exploit before it shipped (the “it’s only in a sub-dependency” excuse doesn’t play well during incident response).
5) IaC policy-as-code (because the cloud will happily let you hurt yourself)
Their Terraform had the usual “temporary” permissions that became permanent. We used checkov for scanning and conftest (OPA/Rego) for enforcement.
# policy/iam.rego
package terraform.security
deny[msg] {
input.resource_type == "aws_iam_policy"
contains(input.config.policy, "\"Action\": \"*\"")
msg := "IAM policy allows Action '*'"
}
deny[msg] {
input.resource_type == "aws_iam_policy"
contains(input.config.policy, "\"Resource\": \"*\"")
msg := "IAM policy allows Resource '*'"
}And wired it into CI:
conftest test plan.json -p policy/
checkov -d infrastructure/ --quiet --soft-fail=falseThis is where “security-first” stops being a slide deck and becomes a merge gate.
Concrete catches: two exploit paths stopped before production
SSRF in a “simple” PDF preview service
A developer added URL-based PDF rendering. The AI-assisted patch looked tidy—until you notice it happily fetched internal metadata endpoints.
We flagged it during PR review because the threat model tripwire lit up (“user-controlled URL”). Then Semgrep backed it up.
Before (vulnerable):
import fetch from "node-fetch";
export async function renderFromUrl(url: string) {
const res = await fetch(url);
return await res.buffer();
}After (fixed with allowlist + DNS/IP hardening):
import dns from "dns/promises";
const ALLOWED_HOSTS = new Set(["docs.partner-a.com", "cdn.ourcompany.com"]);
async function isPublicIp(host: string) {
const addrs = await dns.lookup(host, { all: true });
return addrs.every(a => !a.address.startsWith("10.") && !a.address.startsWith("169.254."));
}
export async function renderFromUrl(rawUrl: string) {
const url = new URL(rawUrl);
if (!ALLOWED_HOSTS.has(url.hostname)) throw new Error("host not allowed");
if (!(await isPublicIp(url.hostname))) throw new Error("non-public host");
// fetch via hardened egress in infra (not shown)
const res = await fetch(url.toString(), { redirect: "error" });
return Buffer.from(await res.arrayBuffer());
}This prevented a classic path to cloud credential theft (169.254.169.254) and internal service probing.
Authz bypass in “internal admin” routes
They had an internal admin endpoint protected by “it’s behind the VPN” logic (I’ve seen this fail in every era, from Cisco VPNs to Zero Trust migrations).
We enforced explicit authz checks and added a regression test. The measurable outcome here wasn’t academic: the prior pen test showed a reliable bypass via a mis-scoped JWT claim.
Results: what changed in 10 weeks (numbers that leadership cared about)
We tracked outcomes weekly—because if you can’t measure it, you’re doing security cosplay.
- Critical vulnerabilities in CI (open at end of sprint): down 92% (from 13 across repos to 1 lingering legacy issue with a compensating control).
- Mean time to remediate (MTTR) for high/critical security findings: 19 days → 3.4 days.
- Escaped findings from pen test re-test: 7 high/critical → 1 medium (and it was a known low-risk legacy component slated for retirement).
- Secrets incidents: 6 “near-misses” in the prior quarter (found manually) → 0 leaked secrets merged after
gitleakspre-commit + repo scanning. - Deployment confidence: they kept weekly releases; we didn’t introduce a release freeze. Security checks added ~6–9 minutes to CI, which was acceptable given their build times.
The bigger win was cultural: developers stopped treating security as a separate phase. It became part of “normal engineering.”
What actually made this stick (and what I’ve seen fail)
I’ve seen security-first initiatives die for predictable reasons:
- Too many tools, no owner, and 1,000 alerts nobody trusts.
- Policies written like legal docs instead of executable rules.
- “Security says no” without giving devs a paved path.
Here’s what actually worked here:
- One merge gate per risk category (SAST, deps, IaC, containers), each with a clear severity threshold.
- Fast feedback (pre-commit + PR checks) so devs didn’t discover problems after context-switching.
- Exception handling with sunlight: temporary suppressions required an owner and expiry.
- Security rules aligned to architecture: SSRF mattered because they had internal services and cloud metadata; we didn’t waste time bikeshedding low-impact items.
GitPlumbers’ role was part mechanic, part referee: fix the worst debt, wire in the guardrails, and train the team so the system stayed healthy after we left.
If you want to replicate this without boiling the ocean
- Start with where breaches actually happen in your org: secrets, authz, SSRF/egress, dependency supply chain, cloud IAM.
- Add commit-time protection for secrets and basic hygiene.
- Add PR-time checks: SAST + dependency scan with a strict threshold for critical.
- Add deploy-time checks: image scanning + IaC policy-as-code.
- Track three metrics for 60 days:
- Open critical/high findings over time
- MTTR for security findings
- Escaped vulns (found after release)
If your CI turns red constantly, don’t weaken the gate—fix the top offenders, tune noisy rules, and keep moving.
The goal isn’t “perfect security.” The goal is preventing preventable incidents while shipping.
If you’re in the spot where AI-assisted code and legacy systems are increasing delivery risk, GitPlumbers can help you install guardrails that engineers won’t route around.
Key takeaways
- Security-first works when it’s embedded in the developer workflow (PRs, CI, and deploy gates), not bolted on at the end.
- Policy-as-code (OPA/Conftest) turns tribal security rules into repeatable guardrails that scale with team growth.
- Secrets hygiene and least-privilege IAM fix the “one leaked token = breach” failure mode.
- Measuring outcomes (critical vulns, MTTR, escaped defects) keeps security from becoming a faith-based initiative.
- AI-assisted code increases the need for consistent controls; it doesn’t replace them.
Implementation checklist
- Add `pre-commit` hooks for `gitleaks`/`detect-secrets` and linting before code leaves a laptop.
- Enforce SAST and dependency scanning in CI with a fail threshold for **critical** issues.
- Generate an SBOM (`CycloneDX`) and verify provenance (at least SLSA L2-ish) for build artifacts.
- Scan container images with `trivy` and block deploys on critical CVEs without compensating controls.
- Scan IaC (`terraform`) with `checkov` and enforce policy-as-code with `conftest`.
- Require lightweight threat modeling for security-sensitive changes (auth, file upload, SSRF surfaces).
- Rotate credentials, move secrets to `Vault`/cloud secret manager, and reduce IAM blast radius.
- Track security KPIs: critical findings over time, MTTR for vulns, escaped vulns, and “time-to-fix” by repo/team.
Questions we hear from teams
- Does security-first development slow teams down?
- It slows down the *bad merges*. In this case, CI added ~6–9 minutes, but MTTR on high/critical issues dropped from 19 days to 3.4 days because fixes happened while the code was still fresh.
- What’s the minimum viable toolchain to start?
- Start with `gitleaks` (or similar) + one SAST (`Semgrep` or `CodeQL`) + one dependency scanner (`Snyk`/`Dependabot`) in CI. Add `trivy` for container images and `checkov` + `conftest` if you’re heavy on `Terraform`.
- How do you handle false positives without turning security into a joke?
- Treat suppressions like production code: require an owner, a reason, and an expiry date. Tune rulesets to your codebase, but keep the fail thresholds for critical issues.
- Is this still relevant if we’re mostly using managed services and serverless?
- Yes. The failure modes shift (IAM, secrets, dependency supply chain, event injection), but the pattern holds: PR-time guardrails and policy-as-code prevent high-impact misconfigurations from shipping.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
