The Payment API Rewrite That Finally Passed Audit: Threat Modeling Without Hitting the Brakes
Bake threat modeling into modernization sprints by turning policies into guardrails, checks, and automated proofs—without killing velocity.
“Make the fast path the safe path: policies as code, evidence as artifacts, and threat models as acceptance criteria.”Back to all posts
The sprint that shipped fast and failed audit
I’ve watched teams rip out a Java monolith, spin up a shiny Go service on EKS, and crush their sprint goals—only to get wrecked in a PCI pre-assessment two weeks later. At one fintech we helped, the payment API rewrite looked good in staging: green tests, fast p99s, zero prod incidents. Then audit asked for encryption proofs and data-flow diagrams. We had docs, but not evidence: no attestations, no IaC policies, no lineage of changes. Cue weeks of screenshots and spreadsheet archaeology.
Here’s the part folks miss: you don’t need a separate security waterfall to pass audit. You need to make security design decisions explicit in each story, enforce them automatically, and capture proof as a normal byproduct of the pipeline. That’s threat modeling that ships.
A two-hour threat model that fits every modernization story
Threat modeling doesn’t need a room of PhDs or a 30-page report. Treat it like a design control for each significant change.
Keep it lightweight: 30–90 minutes per story that changes trust boundaries (new service, new data store, new external dependency, new auth flows).
Use a one-page template per story:
System/context:
who calls this
,what data
,where it runs
.Data classification:
Restricted
(PII/PHI/PAN),Internal
,Public
.STRIDE prompts: spoofing, tampering, repudiation, info disclosure, DoS, elevation. List top 3 risks only.
Controls: what guardrails enforce this? (mTLS, TLS 1.2+, egress policy, KMS encryption, secret mgmt, authz).
Security acceptance criteria: concrete checks you’ll automate in CI/CD.
Bake it into the story template and DoD: if it changes data or trust, it gets a threat model and acceptance criteria.
Pro tip: tie risks to specific controls you can codify. “Protect PAN at rest” becomes “Terraform S3 buckets must enable SSE-KMS with key rotation; CI fails otherwise.”
I’ve had good luck with Threat Dragon
for quick diagrams or just a README
with an ASCII diagram. The key is linking design to automated checks, not prettying up Visio.
Turn policy into guardrails: policy-as-code you actually ship
Most orgs have binders full of policy text. Translate it into code that runs where changes happen: in PRs and admissions to the cluster.
- Kubernetes (EKS/GKE/AKS): enforce base controls with
OPA Gatekeeper
orKyverno
. Example: deny Ingress without TLS:
package kubernetes.network
deny[msg] {
input.kind.kind == "Ingress"
not input.spec.tls
msg := "Ingress must terminate TLS (spec.tls required)"
}
- Terraform/IaC: run
Conftest
orCheckov
in CI. Example: require S3 SSE-KMS:
package terraform.s3
deny[msg] {
resource := input.resource_changes[_]
resource.type == "aws_s3_bucket"
not resource.change.after.server_side_encryption_configuration
msg := sprintf("Bucket %s must enable SSE-KMS", [resource.address])
}
Containers:
Trivy
orGrype
for images; gate onHIGH
severity initially.App code:
Semgrep
rules for logging PII, insecure crypto, SSRF. Add a small custom ruleset that matches your stack (Go/Java/Node).Secrets:
gitleaks
in pre-commit and CI. Block merges on confirmed secrets; add a break-glass flow for false positives.Supply chain: sign and attest images with
cosign
. UseSLSA
-style provenance later; don’t boil the ocean on day one.
Wiring this into CI/CD is boring but critical. Example GitHub Actions snippet that: runs IaC policy checks, scans containers, emits SARIF, and produces an attestation:
name: ci-security
on: [pull_request]
jobs:
checks:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Conftest (Terraform)
run: conftest test -p policy/terraform terraform-plan.json
- name: Trivy (image)
run: |
trivy image --exit-code 0 --format sarif -o trivy.sarif ghcr.io/acme/payment:pr-${{ github.sha }}
trivy image --severity HIGH,CRITICAL --exit-code 1 ghcr.io/acme/payment:pr-${{ github.sha }}
- name: Upload SARIF
uses: github/codeql-action/upload-sarif@v3
with:
sarif_file: trivy.sarif
- name: Cosign attest
env:
COSIGN_EXPERIMENTAL: "1"
run: |
cosign attest --predicate build.json --type slsa-provenance ghcr.io/acme/payment:pr-${{ github.sha }}
In the cluster, apply admission guardrails early in “audit” mode so developers can see violations without blocking. Flip to “enforce” after two sprints.
Automated proofs: evidence or it didn’t happen
Auditors want evidence that controls are effective over time, not screenshots from last Thursday. Generate proofs automatically and store them per commit and release.
Evidence catalog per commit/tag:
SBOM (
trivy sbom
orsyft
)Vulnerability scan results (SARIF)
Policy evaluations (OPA/Conftest outputs)
Signed provenance (
cosign attest
)Deployment diff (ArgoCD app history)
Immutable evidence store: S3 bucket with versioning + retention (e.g., 3 years), or GCS bucket with Object Lock. Use a key like
/service/payment/commit/<sha>/
.Link evidence to change approvals: PR description includes links; Change Advisory can click and verify. No PDFs, no screenshots.
For GitOps: ArgoCD’s app status and sync history provide a deployment audit trail. Export periodically and park JSON in the evidence store.
That fintech I mentioned? We cut audit prep from weeks to hours because the pipeline produced everything they needed on every merge and release. No theater, just artifacts.
Regulated data without the brakes: patterns that work
If you handle PAN, PII, or PHI, you need to constrain blast radius without paralyzing delivery. The trick is consistent patterns and defaults that are easy to follow.
Data classification as code:
Tag data stores and topics with labels like
data.tier=restricted
and enforce policies (e.g.,no public egress
,KMS required
).Annotate services with
handles.pii=true
to trigger stricter admission policies (mTLS, sidecar, logging redaction).
Tokenize or mask at the edge:
Use
HashiCorp Vault Transform
or an in-house tokenization service for PAN before it hits downstream services.Keep the detokenization boundary tiny and well-guarded (HSM-backed KMS, limited egress).
Encrypt everything by default:
AWS KMS
/GCP KMS
for storage and secrets;Vault
for dynamic DB creds.Service-to-service
mTLS
(Istio/Linkerd) for restricted tiers. Start with ingress TLS if mesh rollout is too heavy on day one.
Redact telemetry:
OpenTelemetry Collector
processors to strip PII before export:
processors:
attributes/scrub:
actions:
- key: user.ssn
action: delete
- key: credit_card
action: update
value: "[REDACTED]"
Add
Semgrep
rules to block logging ofemail
,ssn
,pan
,address
fields.Access and egress:
Lock down egress in Kubernetes (
NetworkPolicy
) and cloud (VPC egress
,PrivateLink
).Per-tenant authz at the service layer with
OPA
sidecar or library. Cache decisions to avoid latency hits.
Testing without real data: synthetic datasets + realistic generators. If you must sample production, de-identify in a separate account and time-bound access with audit logs.
Combine these with feature flags (LaunchDarkly
/OpenFeature
) so you can canary restrictions safely. The defaults do the heavy lifting; developers keep shipping.
Keep pipelines fast: scan smart, gate smart
Security that slows delivery gets bypassed. Make the fast path the safe path.
Run cheap checks first:
gitleaks
,semgrep --config auto
,conftest
on diffs only.Parallelize heavy scans: image scans vs. IaC vs. code can run concurrently.
Cache vulnerability DBs (Trivy DB, pip/npm caches) to avoid cold-start pain.
Severity-based gates: warn on
MEDIUM
, block onHIGH/CRITICAL
for 2–3 sprints; then ratchet down as debt burns down.Incremental scanning: limit analysis to changed modules; run full scans nightly.
DAST without blocking PRs: spin an ephemeral env, run
zap-baseline
, and comment results on the PR; gate only on criticals in pre-merge or before release.Admission policies in “audit” first, “enforce” later. Developers need signal before you add friction.
Observability on your pipeline: metric every step. If CI time jumps by 5 minutes after adding a scanner, you need caching or scope control.
We routinely hold the line at <10 minutes added to CI for services, with full nightly jobs doing the heavier sweeps.
Metrics that matter and a rollout plan that sticks
If you can’t measure it, you’ll argue about it in the next QBR. Track delivery and security together.
Delivery (DORA): lead time, deployment frequency, change failure rate, MTTR.
Security:
Policy pass rate per PR (goal: >90% after two sprints).
Time to remediate
HIGH/CRITICAL
findings (goal: <7 days).Exception rate and aging (goal: <5% open >30 days).
CI duration and flake rate (goal: +<10% after controls).
Compliance: evidence completeness per release (goal: 100%).
Rollout in three phases:
Sprint 0–1: instrument in “audit” mode. Add lightweight threat-model to story templates. Publish guardrails and links in PR comments.
Sprint 2–3: flip on enforcement for a handful of high-value rules (TLS, no public S3, secrets). Gate on
HIGH/CRIT
only.Sprint 4+: expand rules, lower severities as debt drops, introduce provenance attestations into release promotion.
Real result: a payments team we supported moved from a Java monolith to Go on EKS. In six weeks, they:
Kept lead time steady (median PR-to-prod ~12 hours).
Dropped high-sev vulns in images by 82% via base-image pinned digests and Trivy gates.
Passed PCI pre-assessment with zero corrective actions because evidence was linked to every tag in the GitOps repo.
If you want help wiring this up without slowing your sprints, GitPlumbers has the scars and the playbooks.
Key takeaways
- Threat modeling can be a 30–90 minute ritual per story, not a multi-week workshop.
- Translate policy text into guardrails with `OPA/Gatekeeper`, `Kyverno`, `Conftest`, and CI checks that fail fast.
- Generate automated proofs (SBOMs, scan results, attestations) and store them as audit evidence per commit and release.
- Handle regulated data with design patterns: tokenization, masking at the edge, redaction in telemetry, and tiered data handling.
- Keep pipelines fast with incremental scans, parallel jobs, severity-based gates, and pre-commit hooks.
- Measure both delivery and security: DORA + time-to-remediate highs, exception rate, policy pass rate, and pipeline duration.
Implementation checklist
- Add a 30–90 minute threat-model step to story kickoff; use a 1-page template and STRIDE prompts.
- Write or adopt `policy-as-code` for IaC and Kubernetes; start in warn-only, then enforce.
- Generate SBOMs and vulnerability scan results for every build; upload SARIF to your repo.
- Attest images and releases with `cosign`; store evidence in an immutable bucket with retention.
- Classify data (Restricted vs. Internal) and enforce defaults: `no PII in logs`, `TLS everywhere`, `no plaintext secrets`.
- Speed up CI: parallelize scans, cache databases, and gate only on `high` severity initially.
- Track metrics weekly: policy pass rate, high-sev time-to-fix, exceptions, and CI duration.
Questions we hear from teams
- How do we keep threat modeling from becoming a time sink?
- Scope it to stories that change trust or data boundaries and cap it at 30–90 minutes. Use a one-page template, STRIDE prompts, and define security acceptance criteria you can automate. Defer deep dives to a backlog if needed; the point is to catch the top risks early and codify controls.
- Won’t adding scanners and policies balloon our CI times?
- Only if you run everything on every change. Run cheap checks first, parallelize heavy scans, cache DBs, and scan diffs. Gate initially only on HIGH/CRITICAL and move admission controllers from audit to enforce gradually. Keep added CI time under ~10 minutes.
- What counts as acceptable audit evidence?
- Artifacts generated by your pipeline: SBOMs, vulnerability scans (SARIF), policy evaluation outputs (OPA/Conftest), signed provenance (cosign/in-toto), and GitOps deployment history. Store them in an immutable bucket with retention and link them to PRs and releases.
- We’re not ready for a service mesh. Can we still meet encryption-in-transit requirements?
- Yes. Start with TLS at the edge and between services using mTLS libraries or sidecars where needed. Many teams meet PCI/SOC requirements with managed ingress + TLS 1.2+, cloud KMS for certs, and narrow egress policies. You can adopt a mesh later for consistency.
- How do exceptions work without becoming the Wild West?
- Use a standard exception template (risk, compensating controls, expiry), require owner approval, and track aging. Keep the bar high and visible; review exceptions weekly. Most teams can keep long-lived exceptions under 5% once good defaults are in place.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.