The Payment API Rewrite That Finally Passed Audit: Threat Modeling Without Hitting the Brakes

Bake threat modeling into modernization sprints by turning policies into guardrails, checks, and automated proofs—without killing velocity.

“Make the fast path the safe path: policies as code, evidence as artifacts, and threat models as acceptance criteria.”
Back to all posts

The sprint that shipped fast and failed audit

I’ve watched teams rip out a Java monolith, spin up a shiny Go service on EKS, and crush their sprint goals—only to get wrecked in a PCI pre-assessment two weeks later. At one fintech we helped, the payment API rewrite looked good in staging: green tests, fast p99s, zero prod incidents. Then audit asked for encryption proofs and data-flow diagrams. We had docs, but not evidence: no attestations, no IaC policies, no lineage of changes. Cue weeks of screenshots and spreadsheet archaeology.

Here’s the part folks miss: you don’t need a separate security waterfall to pass audit. You need to make security design decisions explicit in each story, enforce them automatically, and capture proof as a normal byproduct of the pipeline. That’s threat modeling that ships.

A two-hour threat model that fits every modernization story

Threat modeling doesn’t need a room of PhDs or a 30-page report. Treat it like a design control for each significant change.

  • Keep it lightweight: 30–90 minutes per story that changes trust boundaries (new service, new data store, new external dependency, new auth flows).

  • Use a one-page template per story:

    • System/context: who calls this, what data, where it runs.

    • Data classification: Restricted (PII/PHI/PAN), Internal, Public.

    • STRIDE prompts: spoofing, tampering, repudiation, info disclosure, DoS, elevation. List top 3 risks only.

    • Controls: what guardrails enforce this? (mTLS, TLS 1.2+, egress policy, KMS encryption, secret mgmt, authz).

    • Security acceptance criteria: concrete checks you’ll automate in CI/CD.

  • Bake it into the story template and DoD: if it changes data or trust, it gets a threat model and acceptance criteria.

Pro tip: tie risks to specific controls you can codify. “Protect PAN at rest” becomes “Terraform S3 buckets must enable SSE-KMS with key rotation; CI fails otherwise.”

I’ve had good luck with Threat Dragon for quick diagrams or just a README with an ASCII diagram. The key is linking design to automated checks, not prettying up Visio.

Turn policy into guardrails: policy-as-code you actually ship

Most orgs have binders full of policy text. Translate it into code that runs where changes happen: in PRs and admissions to the cluster.

  • Kubernetes (EKS/GKE/AKS): enforce base controls with OPA Gatekeeper or Kyverno. Example: deny Ingress without TLS:
package kubernetes.network

deny[msg] {
  input.kind.kind == "Ingress"
  not input.spec.tls
  msg := "Ingress must terminate TLS (spec.tls required)"
}
  • Terraform/IaC: run Conftest or Checkov in CI. Example: require S3 SSE-KMS:
package terraform.s3

deny[msg] {
  resource := input.resource_changes[_]
  resource.type == "aws_s3_bucket"
  not resource.change.after.server_side_encryption_configuration
  msg := sprintf("Bucket %s must enable SSE-KMS", [resource.address])
}
  • Containers: Trivy or Grype for images; gate on HIGH severity initially.

  • App code: Semgrep rules for logging PII, insecure crypto, SSRF. Add a small custom ruleset that matches your stack (Go/Java/Node).

  • Secrets: gitleaks in pre-commit and CI. Block merges on confirmed secrets; add a break-glass flow for false positives.

  • Supply chain: sign and attest images with cosign. Use SLSA-style provenance later; don’t boil the ocean on day one.

Wiring this into CI/CD is boring but critical. Example GitHub Actions snippet that: runs IaC policy checks, scans containers, emits SARIF, and produces an attestation:

name: ci-security
on: [pull_request]
jobs:
  checks:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Conftest (Terraform)
        run: conftest test -p policy/terraform terraform-plan.json
      - name: Trivy (image)
        run: |
          trivy image --exit-code 0 --format sarif -o trivy.sarif ghcr.io/acme/payment:pr-${{ github.sha }}
          trivy image --severity HIGH,CRITICAL --exit-code 1 ghcr.io/acme/payment:pr-${{ github.sha }}
      - name: Upload SARIF
        uses: github/codeql-action/upload-sarif@v3
        with:
          sarif_file: trivy.sarif
      - name: Cosign attest
        env:
          COSIGN_EXPERIMENTAL: "1"
        run: |
          cosign attest --predicate build.json --type slsa-provenance ghcr.io/acme/payment:pr-${{ github.sha }}

In the cluster, apply admission guardrails early in “audit” mode so developers can see violations without blocking. Flip to “enforce” after two sprints.

Automated proofs: evidence or it didn’t happen

Auditors want evidence that controls are effective over time, not screenshots from last Thursday. Generate proofs automatically and store them per commit and release.

  • Evidence catalog per commit/tag:

    1. SBOM (trivy sbom or syft)

    2. Vulnerability scan results (SARIF)

    3. Policy evaluations (OPA/Conftest outputs)

    4. Signed provenance (cosign attest)

    5. Deployment diff (ArgoCD app history)

  • Immutable evidence store: S3 bucket with versioning + retention (e.g., 3 years), or GCS bucket with Object Lock. Use a key like /service/payment/commit/<sha>/.

  • Link evidence to change approvals: PR description includes links; Change Advisory can click and verify. No PDFs, no screenshots.

  • For GitOps: ArgoCD’s app status and sync history provide a deployment audit trail. Export periodically and park JSON in the evidence store.

That fintech I mentioned? We cut audit prep from weeks to hours because the pipeline produced everything they needed on every merge and release. No theater, just artifacts.

Regulated data without the brakes: patterns that work

If you handle PAN, PII, or PHI, you need to constrain blast radius without paralyzing delivery. The trick is consistent patterns and defaults that are easy to follow.

  • Data classification as code:

    • Tag data stores and topics with labels like data.tier=restricted and enforce policies (e.g., no public egress, KMS required).

    • Annotate services with handles.pii=true to trigger stricter admission policies (mTLS, sidecar, logging redaction).

  • Tokenize or mask at the edge:

    • Use HashiCorp Vault Transform or an in-house tokenization service for PAN before it hits downstream services.

    • Keep the detokenization boundary tiny and well-guarded (HSM-backed KMS, limited egress).

  • Encrypt everything by default:

    • AWS KMS/GCP KMS for storage and secrets; Vault for dynamic DB creds.

    • Service-to-service mTLS (Istio/Linkerd) for restricted tiers. Start with ingress TLS if mesh rollout is too heavy on day one.

  • Redact telemetry:

    • OpenTelemetry Collector processors to strip PII before export:
processors:
  attributes/scrub:
    actions:
      - key: user.ssn
        action: delete
      - key: credit_card
        action: update
        value: "[REDACTED]"
  • Add Semgrep rules to block logging of email, ssn, pan, address fields.

  • Access and egress:

    • Lock down egress in Kubernetes (NetworkPolicy) and cloud (VPC egress, PrivateLink).

    • Per-tenant authz at the service layer with OPA sidecar or library. Cache decisions to avoid latency hits.

  • Testing without real data: synthetic datasets + realistic generators. If you must sample production, de-identify in a separate account and time-bound access with audit logs.

Combine these with feature flags (LaunchDarkly/OpenFeature) so you can canary restrictions safely. The defaults do the heavy lifting; developers keep shipping.

Keep pipelines fast: scan smart, gate smart

Security that slows delivery gets bypassed. Make the fast path the safe path.

  • Run cheap checks first: gitleaks, semgrep --config auto, conftest on diffs only.

  • Parallelize heavy scans: image scans vs. IaC vs. code can run concurrently.

  • Cache vulnerability DBs (Trivy DB, pip/npm caches) to avoid cold-start pain.

  • Severity-based gates: warn on MEDIUM, block on HIGH/CRITICAL for 2–3 sprints; then ratchet down as debt burns down.

  • Incremental scanning: limit analysis to changed modules; run full scans nightly.

  • DAST without blocking PRs: spin an ephemeral env, run zap-baseline, and comment results on the PR; gate only on criticals in pre-merge or before release.

  • Admission policies in “audit” first, “enforce” later. Developers need signal before you add friction.

  • Observability on your pipeline: metric every step. If CI time jumps by 5 minutes after adding a scanner, you need caching or scope control.

We routinely hold the line at <10 minutes added to CI for services, with full nightly jobs doing the heavier sweeps.

Metrics that matter and a rollout plan that sticks

If you can’t measure it, you’ll argue about it in the next QBR. Track delivery and security together.

  • Delivery (DORA): lead time, deployment frequency, change failure rate, MTTR.

  • Security:

    • Policy pass rate per PR (goal: >90% after two sprints).

    • Time to remediate HIGH/CRITICAL findings (goal: <7 days).

    • Exception rate and aging (goal: <5% open >30 days).

    • CI duration and flake rate (goal: +<10% after controls).

  • Compliance: evidence completeness per release (goal: 100%).

Rollout in three phases:

  1. Sprint 0–1: instrument in “audit” mode. Add lightweight threat-model to story templates. Publish guardrails and links in PR comments.

  2. Sprint 2–3: flip on enforcement for a handful of high-value rules (TLS, no public S3, secrets). Gate on HIGH/CRIT only.

  3. Sprint 4+: expand rules, lower severities as debt drops, introduce provenance attestations into release promotion.

Real result: a payments team we supported moved from a Java monolith to Go on EKS. In six weeks, they:

  • Kept lead time steady (median PR-to-prod ~12 hours).

  • Dropped high-sev vulns in images by 82% via base-image pinned digests and Trivy gates.

  • Passed PCI pre-assessment with zero corrective actions because evidence was linked to every tag in the GitOps repo.

If you want help wiring this up without slowing your sprints, GitPlumbers has the scars and the playbooks.

Related Resources

Key takeaways

  • Threat modeling can be a 30–90 minute ritual per story, not a multi-week workshop.
  • Translate policy text into guardrails with `OPA/Gatekeeper`, `Kyverno`, `Conftest`, and CI checks that fail fast.
  • Generate automated proofs (SBOMs, scan results, attestations) and store them as audit evidence per commit and release.
  • Handle regulated data with design patterns: tokenization, masking at the edge, redaction in telemetry, and tiered data handling.
  • Keep pipelines fast with incremental scans, parallel jobs, severity-based gates, and pre-commit hooks.
  • Measure both delivery and security: DORA + time-to-remediate highs, exception rate, policy pass rate, and pipeline duration.

Implementation checklist

  • Add a 30–90 minute threat-model step to story kickoff; use a 1-page template and STRIDE prompts.
  • Write or adopt `policy-as-code` for IaC and Kubernetes; start in warn-only, then enforce.
  • Generate SBOMs and vulnerability scan results for every build; upload SARIF to your repo.
  • Attest images and releases with `cosign`; store evidence in an immutable bucket with retention.
  • Classify data (Restricted vs. Internal) and enforce defaults: `no PII in logs`, `TLS everywhere`, `no plaintext secrets`.
  • Speed up CI: parallelize scans, cache databases, and gate only on `high` severity initially.
  • Track metrics weekly: policy pass rate, high-sev time-to-fix, exceptions, and CI duration.

Questions we hear from teams

How do we keep threat modeling from becoming a time sink?
Scope it to stories that change trust or data boundaries and cap it at 30–90 minutes. Use a one-page template, STRIDE prompts, and define security acceptance criteria you can automate. Defer deep dives to a backlog if needed; the point is to catch the top risks early and codify controls.
Won’t adding scanners and policies balloon our CI times?
Only if you run everything on every change. Run cheap checks first, parallelize heavy scans, cache DBs, and scan diffs. Gate initially only on HIGH/CRITICAL and move admission controllers from audit to enforce gradually. Keep added CI time under ~10 minutes.
What counts as acceptable audit evidence?
Artifacts generated by your pipeline: SBOMs, vulnerability scans (SARIF), policy evaluation outputs (OPA/Conftest), signed provenance (cosign/in-toto), and GitOps deployment history. Store them in an immutable bucket with retention and link them to PRs and releases.
We’re not ready for a service mesh. Can we still meet encryption-in-transit requirements?
Yes. Start with TLS at the edge and between services using mTLS libraries or sidecars where needed. Many teams meet PCI/SOC requirements with managed ingress + TLS 1.2+, cloud KMS for certs, and narrow egress policies. You can adopt a mesh later for consistency.
How do exceptions work without becoming the Wild West?
Use a standard exception template (risk, compensating controls, expiry), require owner approval, and track aging. Keep the bar high and visible; review exceptions weekly. Most teams can keep long-lived exceptions under 5% once good defaults are in place.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Schedule a working session Download the threat-model one-pager

Related resources