Zero Trust That Ships: Turning Policies Into Guardrails, Checks, and Proofs

War stories and working patterns for building zero-trust into distributed systems without grinding delivery to a halt.

Zero trust that ships isn’t a firewall—it’s an identity plane, a policy engine, and an evidence store that developers barely notice.
Back to all posts

The outage that sold me on zero trust

A few years ago at a fintech on AWS, one compromised CI runner pivoted into a shared Kubernetes cluster. No mTLS. Nodes shared IAM instance roles. Lateral movement was a bash one-liner. We spent a weekend rotating keys and explaining to auditors why “private VPC” didn’t mean “private.”

What fixed it wasn’t a new firewall. It was treating identity, authorization, and evidence as first-class product features:

  • Workload identity via SPIFFE/SPIRE and Istio mTLS
  • Policy as code with OPA/Gatekeeper and Kyverno
  • Supply-chain proofs with Sigstore Cosign, SBOMs, and SLSA provenance
  • GitOps so the secure path became the fast path

If you’ve been burned by slideware zero trust, this is the version that actually ships.

Principles that survive contact with prod

Keep the poster on the wall if you want; here’s what matters in distributed systems:

  • Strong workload identity: Every workload gets an identity (spiffe://…) tied to a service account, not a host. Certs are short-lived and auto-rotated.
  • AuthZ everywhere: Default deny with precise AuthorizationPolicy and least privilege IAM. No shared instance roles.
  • Encrypted and attested: STRICT mTLS service-to-service; artifacts and config changes are signed and verifiable.
  • Automated, auditable controls: Policies enforced at build, deploy, and runtime—leaving an immutable evidence trail.
  • Developer speed as a requirement: Guardrails, not roadblocks. Fast feedback in PR, automated exceptions with expiries.

This isn’t theoretical. It’s AWS/GCP/Azure, K8s ≥1.25, Istio ≥1.19, SPIRE ≥1.8, OPA/Gatekeeper ≥3.12, Kyverno ≥1.12, ArgoCD ≥2.11, Cosign ≥2.2.

Translate policy into guardrails

Start with policy language your auditors care about, then encode it where it bites the risk with minimal developer friction.

  1. Classify data and label it
    • Define data-classification labels: public, internal, restricted, regulated (PII/PHI/PCI).
    • Namespaces and workloads carry the label; policies key off it.
apiVersion: v1
kind: Namespace
metadata:
  name: payments
  labels:
    data-classification: regulated
  1. Admission guardrails with Gatekeeper (Rego)
    • Example: block privileged containers and require runAsNonRoot for anything regulated.
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredLabels
metadata:
  name: ns-must-have-classification
spec:
  match:
    kinds:
      - apiGroups: [""]
        kinds: ["Namespace"]
  parameters:
    labels: ["data-classification"]
---
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sPSP
metadata:
  name: regulated-no-privileged
spec:
  match:
    namespaces: ["payments"]
  parameters:
    privileged: false
    runAsUser:
      rule: "MustRunAsNonRoot"
  1. Signature verification with Kyverno
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: verify-signed-images
spec:
  validationFailureAction: Enforce
  rules:
    - name: check-cosign
      match:
        any:
          - resources:
              kinds: [Deployment, StatefulSet]
      verifyImages:
        - imageReferences:
            - "ghcr.io/acme/*"
          attestors:
            - entries:
                - keys:
                    publicKeys: |
                      -----BEGIN PUBLIC KEY-----
                      ...cosignpub...
                      -----END PUBLIC KEY-----
  1. Terraform plan checks with Conftest
# policy/s3_public.rego
package terraform.deny

deny[msg] {
  input.resource_changes[_].type == "aws_s3_bucket_public_access_block"
  some i
  input.resource_changes[i].change.after.block_public_acls == false
  msg := "S3 bucket allows public ACLs"
}
terraform plan -out=tfplan
terraform show -json tfplan > plan.json
conftest test plan.json -p policy/

Automated proofs: make audits queryable

Auditors don’t want vibes; they want evidence with timestamps. Automate it.

  • SBOM + signatures + provenance
# Build
syft packages -o spdx-json > sbom.json
cosign sign --key cosign.key ghcr.io/acme/payments@sha256:…
cosign attest --key cosign.key \
  --predicate sbom.json --type spdx \
  ghcr.io/acme/payments@sha256:…
  • Admission must verify

    • Use Kyverno verifyImages (above) or Gatekeeper with an external data provider.
  • SLSA provenance (Tekton + in-toto)

apiVersion: tekton.dev/v1
kind: Pipeline
metadata:
  name: build-sign-provenance
spec:
  tasks:
    - name: build
      taskRef: { name: kaniko }
    - name: sbom
      runAfter: [build]
      taskRef: { name: syft }
    - name: sign
      runAfter: [sbom]
      taskRef: { name: cosign-sign }
    - name: attest
      runAfter: [sign]
      taskRef: { name: cosign-attest }
  • Evidence store

    • Push SBOMs, signatures, and in-toto attestations to an OCI registry and a WORM bucket (immutability) with lifecycle rules.
    • Index by image digest and Git commit SHA. Queryable in minutes, not weeks.
  • Runtime attestation

    • SPIRE issues SVIDs to workloads; Istio enforces STRICT mTLS and authZ by spiffe:// identity.
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: istio-system
spec:
  mtls: { mode: STRICT }
---
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: payments-allow-from-api
  namespace: payments
spec:
  selector:
    matchLabels:
      app: payments
  rules:
    - from:
        - source:
            principals: ["spiffe://cluster.local/ns/api/sa/api-sa"]
      to:
        - operation:
            ports: ["8443"]
  • SPIRE registration
spire-server entry create \
  -spiffeID spiffe://cluster.local/ns/api/sa/api-sa \
  -selector k8s:sa:api-sa \
  -selector k8s:ns:api

Regulated data without killing delivery

Most teams get stuck here. You don’t need a separate cluster for every acronym. You need boundaries and defaults.

  • Data egress control

    • Egress gateways with Envoy filters; only allow endpoints on an allowlist per classification.
    • Cloud org policies/SCPs to block public storage and keys without rotation.
  • Encryption and secrets

    • App-layer encryption for regulated data; keys in KMS/CloudHSM wrapped via Vault with short-lived tokens.
    • Turn on envelope encryption for queues, topics, and DBs; rotate keys quarterly with automation.
  • No-logs zones

    • For PII/PHI, enforce logging redaction policies and sampling; block trace exports that include payloads.
  • Safe-by-default service template (golden path)

    • A repo template with:
      • SPIFFE-enabled deployment
      • ServiceAccount mapped to least-privilege IAM
      • Kyverno/Gatekeeper labels pre-set
      • OTel with headers-only tracing
  • Exception flow with timers

    • JIT access and policy exceptions via tickets that expire in hours/days, not months. Capture the reason + compensating controls.
# Example exception CRD
apiVersion: compliance.gitplumbers.io/v1
kind: PolicyException
metadata:
  name: allow-debug-shell
spec:
  policy: regulated-no-privileged
  subjectRef: deployment/payments
  reason: "Prod break-glass, P1 incident"
  expiresAt: "2025-01-02T12:00:00Z"
  approvers: ["sec-lead", "sr-sre"]

Wire it together with GitOps

Don’t rely on humans clicking in consoles. Make the secure path the only path.

  • PR-time checks
    • conftest on Terraform plans
    • kubeconform + Gatekeeper dry-run
    • Image signature verification against staging key
opa eval --input k8s.yaml --data policy/ "data.violation"
cosign verify --key cosign.pub ghcr.io/acme/payments:pr-123
  • ArgoCD with policy sync waves
    • Policies sync first, then namespaces, then apps. Block app sync if policy app isn’t healthy.
apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
  name: regulated
spec:
  destinations:
    - namespace: payments
      server: https://kubernetes.default.svc
  namespaceResourceWhitelist:
    - group: "*"
      kind: "*"
  • Progressive delivery with checks

    • Canary via Argo Rollouts gated on:
      • Error budget remaining
      • No new policy violations
      • Signature verified
  • Evidence stamps on merge

    • When Argo syncs, a controller writes an attestation: commit SHA, image digest, policy set, and approvers. Hello, audit trail.

Metrics that prove it works

Security without delivery speed is just a very expensive IDS.

Track both security and flow:

  • Security
    • Policy violation rate (per env)
    • Percentage of signed images running
    • mTLS coverage (% of mesh requests encrypted)
    • Time to produce audit evidence (target: minutes)
  • Delivery
    • DORA: lead time for changes, deployment frequency, change failure rate, MTTR
    • Time-in-PR for policy issues (target: <15 min from push)

What we’ve seen after 90 days:

  • 100% mTLS in mesh; lateral movement attempts drop to zero in detections
  • 95%+ images signed and verified at admission; the rest blocked before prod
  • Audit prep shrinks from weeks to hours; “show me all changes touching PII” becomes one query
  • No measurable increase in lead time when using golden paths; actually faster in high-churn teams

What I’d do differently next time

  • Don’t start with the mesh. Start with identity and signatures, then layer in authZ rules.
  • Keep Rego readable. If your security team can’t maintain it, you’ll accumulate policy debt.
  • Avoid exception creep. Every exception must expire with a reason and a follow-up story to close the gap.
  • Run tabletop incident drills that include revoking SVIDs, rotating signing keys, and blocking egress.
  • Put someone in charge of the evidence store. It’s production, not a junk drawer.

If this sounds like the platform you want but don’t have time to build, GitPlumbers has done it in banks, adtech, and healthcare. We’ll pair with your platform team, not parachute in with a slide deck.

Related Resources

Key takeaways

  • Zero trust is a product capability, not a slide—tie it to identity, authZ, segmentation, and verifiable automation.
  • Translate policies into code at build, deploy, and runtime; don’t centralize all checks at admission and call it a day.
  • Automate evidence: signatures, SBOMs, provenance, and runtime attestation—so audits become a query, not a fire drill.
  • Use GitOps and golden paths to make the secure path the fast path; bake in exceptions with time bounds and evidence.
  • Measure both security and delivery: change failure rate, MTTR, policy violation rate, and audit lead time.

Implementation checklist

  • Classify data domains and label namespaces/workloads with sensitivity levels.
  • Issue workload identity via SPIFFE/SPIRE and enforce STRICT mTLS service-to-service.
  • Gate builds with SBOM + signature + provenance (Sigstore Cosign, SLSA).
  • Use OPA/Gatekeeper or Kyverno to block bad manifests and verify image signatures.
  • Run `conftest` against Terraform plans to stop public data exposure before apply.
  • Adopt GitOps (ArgoCD/Flux) with policy checks in PR and at admission.
  • Store attestations and logs (S3/GCS + immutability) to answer audits quickly.
  • Define JIT access and a documented exception workflow with expiry and compensating controls.

Questions we hear from teams

Do we need a service mesh to do zero trust?
Not on day one. Start with workload identity (SPIFFE/SPIRE) and signed artifacts. Add mesh when you need fine-grained authZ, mTLS everywhere, and traffic policy. You can enforce image signatures and Terraform checks without a mesh.
How do we handle third-party services (payments, LLM APIs) in a zero-trust model?
Terminate through egress gateways with per-service identities, outbound allowlists, and rate limits. Use separate secrets and keys per service, rotate automatically, and log request metadata only (no payloads for regulated data).
Won’t this slow down developers?
If you push checks to PR time and provide golden-path templates, it speeds teams up by reducing rework. We target <15 minutes feedback for policy issues and automate exceptions with expiries.
What about multi-cloud?
Keep the control planes portable: SPIRE for identity, OPA/Kyverno for policy, Git as the source of truth, Sigstore for signing. Cloud-specific enforcement stays at the edge (SCPs, org policies).
How do we prove compliance to auditors?
Store SBOMs, signatures, provenance, admission logs, and deployment attestations in an immutable bucket and/or OCI registry. Build dashboards that answer: who deployed what, where, under which policy set, and with which approvals.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to GitPlumbers about a zero-trust rollout Download our Zero-Trust Guardrails Checklist

Related resources