The Fintech Breach We Dodged: Shipping Faster After Making Security a First-Class Feature
A Series C fintech almost learned about data breaches the expensive way. We rewired their SDLC to make security boring, automated, and non-negotiable—without slowing delivery.
If it isn’t signed and scanned, it doesn’t ship. No exceptions, no Slack approvals.Back to all posts
The near-miss that changed everything
Three days before a planned Series C diligence review, a fintech client on AWS/EKS saw a spike in 401s and suspicious egress to an IP in an ASN we all recognize from bot farms. A partner embedded SDK had a server-side request forgery (SSRF) bug. In a previous life, this would’ve been a 2 a.m. war room and a press release. This time, nothing left the VPC. Why?
- Deny-all egress at the namespace level blocked calls to the internet.
- Istio mTLS and strict NetworkPolicy rules prevented lateral movement.
- CI had already blocked shipping the vulnerable image because it was unsigned and failed SCA.
That was the payoff. Here’s how we got there without tanking velocity.
Constraints and landmines
You know the drill: constraints are where security programs die.
- Domain: PCI-adjacent fintech handling PII, SOC 2 Type II on the horizon, auditors asking about SBOMs post-Log4Shell.
- Stack: EKS 1.28, Istio 1.20, ArgoCD GitOps, GitHub Actions, Terraform, Java 17 + Node 18, Postgres (Aurora), Redis (Elasticache).
- Reality: Two platform engineers, 10+ product squads. Legacy services still on Java 8, a few hand-rolled AMIs, and some “pets” EC2 boxes. Fast release cadence (20–40 deploys/day), very noisy alerting, and a dormant bug bounty program.
- Threats: Credential stuffing against auth, supply chain drift, SSRF via third-party SDKs, over-permissive IAM, and the usual S3 misconfig booby traps.
We couldn’t ask for a headcount bump or a velocity tax. Security had to be invisible and automated.
The playbook: make security a default, not a heroics event
We didn’t ship a 50-page policy deck. We wired guardrails straight into the developer workflow and platform.
1) CI gates that mean it
- SAST + SCA:
CodeQLandtrivyon every PR; hard gate on criticals in changed paths. - IaC scanning:
tfsec+checkovon Terraform modules. - DAST in staging nightly with
OWASP ZAPagainst ephemeral environments. - SBOMs with
syftpublished to the registry. - Sign everything with
cosignand verify before deploy.
# .github/workflows/secure-build.yaml
name: secure-build
on: [pull_request]
jobs:
build-test-scan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with: { node-version: '18' }
- run: npm ci && npm test
- name: CodeQL
uses: github/codeql-action/init@v3
with: { languages: javascript }
- uses: github/codeql-action/analyze@v3
- name: Build image
run: docker build -t ghcr.io/org/app:${{ github.sha }} .
- name: SBOM
run: syft packages ghcr.io/org/app:${{ github.sha }} -o spdx-json > sbom.json
- name: Trivy SCA image scan
uses: aquasecurity/trivy-action@0.20.0
with:
image-ref: ghcr.io/org/app:${{ github.sha }}
vuln-type: 'os,library'
severity: 'CRITICAL,HIGH'
exit-code: '1'
- name: Cosign sign
env: { COSIGN_EXPERIMENTAL: '1' }
run: cosign sign --key $COSIGN_KEY ghcr.io/org/app:${{ github.sha }}
- name: Cosign verify gate
run: cosign verify --key $COSIGN_PUB ghcr.io/org/app:${{ github.sha }}If it isn’t signed and scanned, it doesn’t ship. No exceptions, no Slack approvals.
2) Policy-as-code everywhere
OPA/Rego and Kyverno rules made “security policy” a PR, not a wiki page.
# conftest policy: deny privileged containers
package k8s.security
violation[msg] {
input.kind == "Pod"
input.spec.containers[_].securityContext.privileged == true
msg := "Privileged containers are not allowed"
}# Kyverno: require image signatures
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: verify-image-signatures
spec:
rules:
- name: require-cosign
match:
resources:
kinds: [Deployment, StatefulSet, Job]
verifyImages:
- imageReferences: ["ghcr.io/org/*"]
attestors:
- entries:
- keys:
publicKeys: |
-----BEGIN PUBLIC KEY-----
...
-----END PUBLIC KEY-----3) Least-privilege by default
We killed wildcard IAM and long-lived creds.
# Terraform IAM policy scoped to S3 bucket prefix only
resource "aws_iam_policy" "s3_ro_limited" {
name = "s3-ro-limited"
policy = jsonencode({
Version = "2012-10-17",
Statement = [{
Action = ["s3:GetObject"],
Effect = "Allow",
Resource = "arn:aws:s3:::org-app-prod/*"
}]
})
}- Workloads used IRSA (OIDC to AWS) for short-lived tokens.
- Developers lost standing prod access; break-glass via audited SSM Sessions.
- Secrets moved to HashiCorp Vault with automatic rotation.
4) Zero-trust networking
- Kubernetes
NetworkPolicydefault deny; namespaces whitelisted by label. - Istio
STRICTmTLS and authorization policies.
# Deny-all, allow only service-to-db
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: app-egress
namespace: payments
spec:
podSelector: { matchLabels: { app: api } }
policyTypes: [Egress]
egress:
- to:
- namespaceSelector: { matchLabels: { name: databases } }
podSelector: { matchLabels: { role: postgres } }
ports:
- protocol: TCP
port: 5432# Istio: enforce mTLS cluster-wide
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: default
namespace: istio-system
spec:
mtls:
mode: STRICT5) GitOps with verification
ArgoCD synced only signed manifests and images. Unsigned? Sync failed, Slack pinged, not a human approval.
- ArgoCD image updater watched for new, signed tags only.
- Drift from Terraform or K8s was flagged in minutes—no more snowflakes.
The incident that didn’t happen (and how we know)
Two weeks in, a third-party SDK bug allowed SSRF from an edge service. Historically, that’s exfil city. This time:
- The unsigned image never passed the CI gate.
- A previously deployed, older image with the SDK existed, so the attacker tried lateral movement.
- NetworkPolicy blocked egress to the internet; only Postgres:5432 was allowed.
- IAM on the service role had zero
s3:*and no wildcard*permissions. - Istio AuthorizationPolicy blocked service-to-service calls not explicitly allowed.
CloudTrail and Istio telemetry showed multiple denied connection attempts, zero data egress beyond VPC endpoints, and no anomalous I/O on Aurora. We rotated the SDK, bumped the minor, and shipped a fix the same day.
Forensics boring? Good. That’s the goal.
Measurable outcomes the board actually cared about
Security theater is easy. Numbers are harder. Here’s what moved in 90 days:
- Critical vulns in prod: 78% reduction (from 32 to 7), zero exploitable paths per
trivyin running workloads. - Mean time to remediate (MTTR) vulns: 12 days → 1.8 days. P95 patch window: 17 days → 36 hours.
- Unsigned artifact deploys: from unknown to 0; Kyverno blocked 3 attempted deploys of unsigned images in week 1.
- Auth abuse incidents: 44% drop after tightening rate limits and bot detection at the edge.
- Egress blocks: 27,000+ outbound denies/month from non-approved namespaces, all false-positives resolved by adding explicit egress rules.
- Cost avoidance: Using IBM’s 2024 Cost of a Data Breach average (~$4.45M), we conservatively peg avoided exposure at $1.8–$3.2M based on the systems in blast radius.
- Delivery tempo: Deploys/day stayed flat (median 28), change failure rate down from 21% → 12%.
Translation: we got safer without slowing down.
Actionable patterns you can lift tomorrow
You don’t need a platform team to start. You need good defaults and a willingness to break builds.
- Enforce signatures: Start with
cosignin CI and Kyverno or Gatekeeper in the cluster.
cosign generate-key-pair
cosign sign ghcr.io/org/app:$(git rev-parse --short HEAD)
cosign verify ghcr.io/org/app:$(git rev-parse --short HEAD)- Produce SBOMs and wire notifications when criticals appear:
syft packages ghcr.io/org/app:latest -o cyclonedx-json > sbom.json
grype sbom:sbom.json --fail-on Critical- Kill wildcard IAM: Run
terraform-complianceorcheckovin CI; block onAction: "*". - Flip to deny-all networking and add rules as teams request them. Yes, it’s annoying for a week. It’s worth it.
- Short-lived creds with OIDC/IRSA and Vault. If your prod AWS keys live in 1Password, you’re one fat-finger from a breach.
- Gate merges on risk: Red/yellow/green based on SAST/SCA/DAST results and policy checks. Yellow requires a risk acceptance PR signed by an EM.
What we’d do differently next time
- Centralize exceptions with expiration. We had a few permanent allowlists that should’ve auto-expired.
- Automate SDK provenance: Use
Renovaterules to auto-open PRs only for vetted org registries.
// renovate.json
{
"extends": ["config:recommended"],
"packageRules": [
{ "matchDatasources": ["docker"], "registryUrls": ["https://ghcr.io/org"], "matchUpdateTypes": ["minor", "patch"] }
]
}- SLSA levels: We hit the basics; next is attestation for build provenance and isolated runners for high-sensitivity services.
- Threat modeling cadence: Quarterly is too slow for fast-moving teams; we’re moving to lightweight per-epic reviews.
If this sounds familiar, here’s how GitPlumbers engages
- 2–3 week assessment and hardening sprint: CI gates, signing, network policies, IAM cleanup.
- Month 2–3 platform patterns: GitOps verification, policy catalogs, self-serve templates.
- Ongoing coaching: make risk visible on the same dashboards as SLOs.
No silver bullets, just boring, repeatable guardrails that catch sharp edges before prod.
Key takeaways
- Bake security into the SDLC, not as an afterthought—gate merges on risk, not vibes.
- Sign and verify every artifact; block anything unsigned in CI/CD and at the cluster boundary.
- Use policy-as-code to make guardrails enforceable and reviewable just like app code.
- Least-privilege IAM plus egress restrictions stops 80% of lateral movement attempts cold.
- Measure what matters: time-to-patch, block rates, and signature coverage—not vanity scans.
Implementation checklist
- Create SBOMs on every build and store them alongside images.
- Sign container images with `cosign` and verify in CI and admission controls.
- Enforce `deny-all` network policies and layer Istio `STRICT` mTLS.
- Adopt OPA/Kyverno policies for namespaces, resources, and image provenance.
- Rotate secrets via Vault and use short-lived, workload identity-based AWS creds.
- Automate SAST, SCA, IaC, and DAST in pipelines with hard gates on criticals.
- Track MTTR for vulns and patch windows as first-class engineering KPIs.
Questions we hear from teams
- Did these controls slow developers down?
- No. We kept deploy volume flat and cut change failure rate by ~9 points. The trick is making security defaults invisible: templates, generators, and automated gates that run in seconds, not manual approvals.
- Why Cosign and Kyverno over other tools?
- Cosign is the de-facto for container signing and works well with public registries. Kyverno’s native K8s CRD approach made it easy for app teams to reason about policies versus authoring Rego. We still use OPA/Rego where we need expressiveness (e.g., Conftest in CI).
- How did you handle secrets and credentials?
- HashiCorp Vault for app secrets with short TTLs, and AWS IRSA for workload identity. No long-lived human access to prod; break-glass via SSM Session Manager with approvals and full session logs.
- What about legacy EC2 hosts and AMIs?
- We wrapped them: SSM for access, Inspector for CVEs, and a WAF for edge traffic. Then we put them on a deprecation clock and moved critical data paths behind services with the new guardrails.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
