Zero Trust That Ships: Turning Policies Into Guardrails, Checks, and Proofs
War stories and working patterns for building zero-trust into distributed systems without grinding delivery to a halt.
Zero trust that ships isn’t a firewall—it’s an identity plane, a policy engine, and an evidence store that developers barely notice.Back to all posts
The outage that sold me on zero trust
A few years ago at a fintech on AWS, one compromised CI runner pivoted into a shared Kubernetes cluster. No mTLS. Nodes shared IAM instance roles. Lateral movement was a bash one-liner. We spent a weekend rotating keys and explaining to auditors why “private VPC” didn’t mean “private.”
What fixed it wasn’t a new firewall. It was treating identity, authorization, and evidence as first-class product features:
- Workload identity via
SPIFFE/SPIREandIstiomTLS - Policy as code with
OPA/GatekeeperandKyverno - Supply-chain proofs with
Sigstore Cosign, SBOMs, and SLSA provenance - GitOps so the secure path became the fast path
If you’ve been burned by slideware zero trust, this is the version that actually ships.
Principles that survive contact with prod
Keep the poster on the wall if you want; here’s what matters in distributed systems:
- Strong workload identity: Every workload gets an identity (
spiffe://…) tied to a service account, not a host. Certs are short-lived and auto-rotated. - AuthZ everywhere: Default deny with precise
AuthorizationPolicyand least privilege IAM. No shared instance roles. - Encrypted and attested: STRICT mTLS service-to-service; artifacts and config changes are signed and verifiable.
- Automated, auditable controls: Policies enforced at build, deploy, and runtime—leaving an immutable evidence trail.
- Developer speed as a requirement: Guardrails, not roadblocks. Fast feedback in PR, automated exceptions with expiries.
This isn’t theoretical. It’s AWS/GCP/Azure, K8s ≥1.25, Istio ≥1.19, SPIRE ≥1.8, OPA/Gatekeeper ≥3.12, Kyverno ≥1.12, ArgoCD ≥2.11, Cosign ≥2.2.
Translate policy into guardrails
Start with policy language your auditors care about, then encode it where it bites the risk with minimal developer friction.
- Classify data and label it
- Define
data-classificationlabels:public,internal,restricted,regulated(PII/PHI/PCI). - Namespaces and workloads carry the label; policies key off it.
- Define
apiVersion: v1
kind: Namespace
metadata:
name: payments
labels:
data-classification: regulated- Admission guardrails with Gatekeeper (Rego)
- Example: block privileged containers and require
runAsNonRootfor anythingregulated.
- Example: block privileged containers and require
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredLabels
metadata:
name: ns-must-have-classification
spec:
match:
kinds:
- apiGroups: [""]
kinds: ["Namespace"]
parameters:
labels: ["data-classification"]
---
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sPSP
metadata:
name: regulated-no-privileged
spec:
match:
namespaces: ["payments"]
parameters:
privileged: false
runAsUser:
rule: "MustRunAsNonRoot"- Signature verification with Kyverno
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: verify-signed-images
spec:
validationFailureAction: Enforce
rules:
- name: check-cosign
match:
any:
- resources:
kinds: [Deployment, StatefulSet]
verifyImages:
- imageReferences:
- "ghcr.io/acme/*"
attestors:
- entries:
- keys:
publicKeys: |
-----BEGIN PUBLIC KEY-----
...cosignpub...
-----END PUBLIC KEY------ Terraform plan checks with Conftest
# policy/s3_public.rego
package terraform.deny
deny[msg] {
input.resource_changes[_].type == "aws_s3_bucket_public_access_block"
some i
input.resource_changes[i].change.after.block_public_acls == false
msg := "S3 bucket allows public ACLs"
}terraform plan -out=tfplan
terraform show -json tfplan > plan.json
conftest test plan.json -p policy/Automated proofs: make audits queryable
Auditors don’t want vibes; they want evidence with timestamps. Automate it.
- SBOM + signatures + provenance
# Build
syft packages -o spdx-json > sbom.json
cosign sign --key cosign.key ghcr.io/acme/payments@sha256:…
cosign attest --key cosign.key \
--predicate sbom.json --type spdx \
ghcr.io/acme/payments@sha256:…Admission must verify
- Use Kyverno verifyImages (above) or Gatekeeper with an external data provider.
SLSA provenance (Tekton + in-toto)
apiVersion: tekton.dev/v1
kind: Pipeline
metadata:
name: build-sign-provenance
spec:
tasks:
- name: build
taskRef: { name: kaniko }
- name: sbom
runAfter: [build]
taskRef: { name: syft }
- name: sign
runAfter: [sbom]
taskRef: { name: cosign-sign }
- name: attest
runAfter: [sign]
taskRef: { name: cosign-attest }Evidence store
- Push SBOMs, signatures, and in-toto attestations to an OCI registry and a WORM bucket (immutability) with lifecycle rules.
- Index by image digest and Git commit SHA. Queryable in minutes, not weeks.
Runtime attestation
SPIREissues SVIDs to workloads;Istioenforces STRICT mTLS and authZ byspiffe://identity.
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: default
namespace: istio-system
spec:
mtls: { mode: STRICT }
---
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
name: payments-allow-from-api
namespace: payments
spec:
selector:
matchLabels:
app: payments
rules:
- from:
- source:
principals: ["spiffe://cluster.local/ns/api/sa/api-sa"]
to:
- operation:
ports: ["8443"]- SPIRE registration
spire-server entry create \
-spiffeID spiffe://cluster.local/ns/api/sa/api-sa \
-selector k8s:sa:api-sa \
-selector k8s:ns:apiRegulated data without killing delivery
Most teams get stuck here. You don’t need a separate cluster for every acronym. You need boundaries and defaults.
Data egress control
- Egress gateways with
Envoyfilters; only allow endpoints on an allowlist per classification. - Cloud org policies/SCPs to block public storage and keys without rotation.
- Egress gateways with
Encryption and secrets
- App-layer encryption for
regulateddata; keys inKMS/CloudHSMwrapped viaVaultwith short-lived tokens. - Turn on envelope encryption for queues, topics, and DBs; rotate keys quarterly with automation.
- App-layer encryption for
No-logs zones
- For PII/PHI, enforce logging redaction policies and sampling; block trace exports that include payloads.
Safe-by-default service template (golden path)
- A repo template with:
- SPIFFE-enabled deployment
- ServiceAccount mapped to least-privilege IAM
- Kyverno/Gatekeeper labels pre-set
- OTel with headers-only tracing
- A repo template with:
Exception flow with timers
- JIT access and policy exceptions via tickets that expire in hours/days, not months. Capture the reason + compensating controls.
# Example exception CRD
apiVersion: compliance.gitplumbers.io/v1
kind: PolicyException
metadata:
name: allow-debug-shell
spec:
policy: regulated-no-privileged
subjectRef: deployment/payments
reason: "Prod break-glass, P1 incident"
expiresAt: "2025-01-02T12:00:00Z"
approvers: ["sec-lead", "sr-sre"]Wire it together with GitOps
Don’t rely on humans clicking in consoles. Make the secure path the only path.
- PR-time checks
confteston Terraform planskubeconform+ Gatekeeper dry-run- Image signature verification against staging key
opa eval --input k8s.yaml --data policy/ "data.violation"
cosign verify --key cosign.pub ghcr.io/acme/payments:pr-123- ArgoCD with policy sync waves
- Policies sync first, then namespaces, then apps. Block app sync if policy app isn’t healthy.
apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
name: regulated
spec:
destinations:
- namespace: payments
server: https://kubernetes.default.svc
namespaceResourceWhitelist:
- group: "*"
kind: "*"Progressive delivery with checks
- Canary via
Argo Rolloutsgated on:- Error budget remaining
- No new policy violations
- Signature verified
- Canary via
Evidence stamps on merge
- When Argo syncs, a controller writes an attestation: commit SHA, image digest, policy set, and approvers. Hello, audit trail.
Metrics that prove it works
Security without delivery speed is just a very expensive IDS.
Track both security and flow:
- Security
- Policy violation rate (per env)
- Percentage of signed images running
- mTLS coverage (% of mesh requests encrypted)
- Time to produce audit evidence (target: minutes)
- Delivery
- DORA: lead time for changes, deployment frequency, change failure rate, MTTR
- Time-in-PR for policy issues (target: <15 min from push)
What we’ve seen after 90 days:
- 100% mTLS in mesh; lateral movement attempts drop to zero in detections
- 95%+ images signed and verified at admission; the rest blocked before prod
- Audit prep shrinks from weeks to hours; “show me all changes touching PII” becomes one query
- No measurable increase in lead time when using golden paths; actually faster in high-churn teams
What I’d do differently next time
- Don’t start with the mesh. Start with identity and signatures, then layer in authZ rules.
- Keep Rego readable. If your security team can’t maintain it, you’ll accumulate policy debt.
- Avoid exception creep. Every exception must expire with a reason and a follow-up story to close the gap.
- Run tabletop incident drills that include revoking SVIDs, rotating signing keys, and blocking egress.
- Put someone in charge of the evidence store. It’s production, not a junk drawer.
If this sounds like the platform you want but don’t have time to build, GitPlumbers has done it in banks, adtech, and healthcare. We’ll pair with your platform team, not parachute in with a slide deck.
Key takeaways
- Zero trust is a product capability, not a slide—tie it to identity, authZ, segmentation, and verifiable automation.
- Translate policies into code at build, deploy, and runtime; don’t centralize all checks at admission and call it a day.
- Automate evidence: signatures, SBOMs, provenance, and runtime attestation—so audits become a query, not a fire drill.
- Use GitOps and golden paths to make the secure path the fast path; bake in exceptions with time bounds and evidence.
- Measure both security and delivery: change failure rate, MTTR, policy violation rate, and audit lead time.
Implementation checklist
- Classify data domains and label namespaces/workloads with sensitivity levels.
- Issue workload identity via SPIFFE/SPIRE and enforce STRICT mTLS service-to-service.
- Gate builds with SBOM + signature + provenance (Sigstore Cosign, SLSA).
- Use OPA/Gatekeeper or Kyverno to block bad manifests and verify image signatures.
- Run `conftest` against Terraform plans to stop public data exposure before apply.
- Adopt GitOps (ArgoCD/Flux) with policy checks in PR and at admission.
- Store attestations and logs (S3/GCS + immutability) to answer audits quickly.
- Define JIT access and a documented exception workflow with expiry and compensating controls.
Questions we hear from teams
- Do we need a service mesh to do zero trust?
- Not on day one. Start with workload identity (SPIFFE/SPIRE) and signed artifacts. Add mesh when you need fine-grained authZ, mTLS everywhere, and traffic policy. You can enforce image signatures and Terraform checks without a mesh.
- How do we handle third-party services (payments, LLM APIs) in a zero-trust model?
- Terminate through egress gateways with per-service identities, outbound allowlists, and rate limits. Use separate secrets and keys per service, rotate automatically, and log request metadata only (no payloads for regulated data).
- Won’t this slow down developers?
- If you push checks to PR time and provide golden-path templates, it speeds teams up by reducing rework. We target <15 minutes feedback for policy issues and automate exceptions with expiries.
- What about multi-cloud?
- Keep the control planes portable: SPIRE for identity, OPA/Kyverno for policy, Git as the source of truth, Sigstore for signing. Cloud-specific enforcement stays at the edge (SCPs, org policies).
- How do we prove compliance to auditors?
- Store SBOMs, signatures, provenance, admission logs, and deployment attestations in an immutable bucket and/or OCI registry. Build dashboards that answer: who deployed what, where, under which policy set, and with which approvals.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
