Zero Trust Without Killing Velocity: Guardrails, Proofs, and Shipping Regulated Data
How to design zero-trust for distributed systems that auditors love and engineers don’t hate—using policy-as-code, identity, and automated evidence.
Policies that don’t execute as code are just opinions. Zero trust starts when your pipeline can prove what it enforces.Back to all posts
The breach that taught me zero trust the hard way
A few years back, I watched an internal service account token leak from a misconfigured CI job. It wasn’t “nation state” level—just a bored contractor poking around a flat Kubernetes cluster. With no mTLS, permissive NetworkPolicy, and wildcard RBAC, lateral movement took under 10 minutes. Fortunately, we caught it early. Unfortunately, we still had to explain “why our staging cluster could write to a production S3 bucket.”
I’ve seen this fail over and over: expensive zero-trust slide decks, then a parking lot full of tickets no one closes. What actually works is treating policies as code, identity as the perimeter, and proofs as first-class artifacts. Do that, and you can lock down regulated data without turning delivery into molasses.
Turn policy into guardrails, not tickets
If your policy lives in Confluence, your engineers will trip over it. If it lives in CI/admission as code, it becomes a guardrail. The pattern we deploy:
- Shift-left checks in CI: OPA/Rego via
conftestfor Terraform;kubeconformand Kyverno tests for K8s;cosignfor signatures. - Shift-right enforcement at cluster boundaries: Kyverno or OPA Gatekeeper admission policies; image signature verification; runtime authz in the mesh (Istio + OPA Envoy plugin).
- Single source of truth via GitOps (ArgoCD), so the evidence is the repo history.
Example 1: block risky K8s specs at admission with Kyverno.
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: baseline-pod-security
spec:
validationFailureAction: enforce
background: true
rules:
- name: require-run-as-nonroot
match:
resources:
kinds: ["Pod","Deployment","StatefulSet"]
validate:
message: "Containers must not run as root."
pattern:
spec:
securityContext:
runAsNonRoot: true
- name: deny-host-network
match:
resources:
kinds: ["Pod","Deployment","StatefulSet"]
validate:
message: "hostNetwork is not allowed."
deny:
conditions:
any:
- key: "{{ request.object.spec.hostNetwork }}"
operator: Equals
value: trueExample 2: keep Terraform from opening the world.
# policy/terraform/security_group.rego
package terraform.aws.security_group
deny[msg] {
some sg
sg := input.resource.aws_security_group[_]
some ing
ing := sg.ingress[_]
ing.cidr_blocks[_] == "0.0.0.0/0"
ing.from_port < 1024
msg := "Public ingress to privileged ports is forbidden"
}Wire these into GitHub Actions so merges fail fast:
name: ci
on: { push: { branches: [ main ] } }
jobs:
verify:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Terraform plan + OPA
run: |
terraform init -backend=false
terraform plan -out tf.plan
terraform show -json tf.plan > plan.json
conftest test plan.json -p policy/terraform
- name: Build, sign, verify image
env:
COSIGN_EXPERIMENTAL: "1"
run: |
docker build -t ghcr.io/acme/payments:${{ github.sha }} .
cosign sign --key $COSIGN_KEY ghcr.io/acme/payments:${{ github.sha }}
cosign verify ghcr.io/acme/payments:${{ github.sha }} --key $COSIGN_PUBResult: the developer gets a precise failure message in under a minute, fixes it in code, and tries again—no compliance Slack drama.
Identity first: SPIFFE, mTLS, and least privilege
Flat networks and IP allowlists don’t scale in microservices. Make identity the new perimeter:
- Workload identity: SPIFFE IDs (
spiffe://cluster.local/ns/<ns>/sa/<sa>) via SPIRE or Istio. - mTLS everywhere: mesh policy set to
STRICT. - AuthZ by principal: services talk because policies say they can, not because the network is flat.
- Cloud IAM bindings for pods: IRSA (AWS), Workload Identity (GCP), Azure Managed Identity.
Istio mTLS + allow-only-from-catalog to payments:
# Enable STRICT mTLS for the namespace
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: default
namespace: payments
spec:
mtls:
mode: STRICT
---
# Only allow calls from catalog service account on 8443
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
name: payments-allow-catalog
namespace: payments
spec:
rules:
- from:
- source:
principals: ["spiffe://cluster.local/ns/catalog/sa/catalog-svc"]
to:
- operation:
ports: ["8443"]Pod-to-AWS with least privilege (IRSA):
apiVersion: v1
kind: ServiceAccount
metadata:
name: payments
namespace: payments
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/payments-saThis kills the class of “stolen node credentials can write to prod S3” incidents. You can still misconfigure it—but at least mistakes are localized and observable.
Regulated data without slowing delivery
You don’t need a separate cluster for every data class (though sometimes you will). What you need is consistent segmentation and golden paths.
- Label namespaces by data class:
data.class: pii|phi|pci|public. - Apply stronger defaults to sensitive classes: deny-all egress, restricted images, secrets only from Vault, runtime profiling on.
- Kustomize overlays + ArgoCD AppSets: the secure template is the only template.
- Network egress controls: only allow traffic to approved services or CIDRs.
Minimal egress policy for a PII namespace:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: pii-egress-allowlist
namespace: pii
labels:
data.class: pii
spec:
podSelector: {}
policyTypes: ["Egress"]
egress:
- to:
- namespaceSelector:
matchLabels:
istio-injection: enabled
- ipBlock:
cidr: 10.0.0.0/16Bundle that with a Kyverno policy to reject unsigned images, and a SecretStore reference to Vault. Expose it as a one-line scaffold (backstage template or cookiecutter). Delivery speed comes from paved roads—not from bypassing controls.
Automated proofs: signatures, attestations, and decision logs
Auditors don’t want promises; they want evidence. Make the pipeline produce machine-verifiable proofs:
- Sign everything: containers with
cosign, manifests withgitsign, Terraform plans with checksums. - Attest supply chain: SLSA provenance, SBOMs (CycloneDX), vulnerability scan results.
- Verify at admission: reject unsigned images with Kyverno
verifyImagesor Chainguardcosigned. - Log policy decisions: OPA/Kyverno decision logs to Loki/S3 with retention and immutability (object lock).
Verify only signed images run:
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: verify-signed-images
spec:
validationFailureAction: enforce
rules:
- name: require-cosign
match:
resources:
kinds: ["Pod","Deployment","StatefulSet"]
verifyImages:
- image: "ghcr.io/acme/*"
attestors:
- entries:
- keys:
publicKeys: |
-----BEGIN PUBLIC KEY-----
...
-----END PUBLIC KEY-----Create attestations during build:
cosign attest \
--predicate sbom.cdx.json \
--type cyclonedx \
--key $COSIGN_KEY \
ghcr.io/acme/payments:${GITHUB_SHA}Now your change request links to immutable proofs: signed image digest, SLSA provenance, passing policy decisions, and ArgoCD sync history. For SOC 2/ISO/HIPAA, that’s gold.
Roll out controls safely: canaries, SLOs, and chaos
I’ve seen teams flip enforce on day one and nuke availability. Don’t. Treat security controls like any other risky change:
- Canary policies: start
auditmode, surface violations, then enforce for 10%, 50%, 100% of namespaces. - Measure impact: watch SLOs and error budgets (Prometheus, Grafana). If MTTR spikes, you went too hard.
- Progressive delivery: Argo Rollouts with guardrails tied to error rate/latency.
- Circuit breakers: keep cascading failures from becoming incidents when authz gets too tight.
Canary a sensitive rollout:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: payments
spec:
strategy:
canary:
steps:
- setWeight: 10
- pause: { duration: 300 }
- setWeight: 50
- pause: { duration: 600 }
- setWeight: 100Protect callers with Istio circuit breaking:
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: payments
spec:
host: payments
trafficPolicy:
outlierDetection:
consecutive5xxErrors: 5
interval: 5s
baseEjectionTime: 30s
connectionPool:
http:
http1MaxPendingRequests: 100
maxRequestsPerConnection: 100Zero trust should reduce incident blast radius and improve MTTR, not tank your SLOs. If it does, your rollout plan—not the principle—is the problem.
What actually works (and what doesn’t)
What works:
- Golden paths: pre-approved modules/templates with sane defaults (Terraform, Helm, Kustomize). New services ship fast and compliant by default.
- GitOps: ArgoCD gives a visible, auditable diff of intended vs actual. It’s remarkable how much that calms auditors.
- Tight feedback loops: CI fails within minutes with actionable messages; admission policies echo the same rules.
- Identity-first policies: authorize by SPIFFE principal and namespace labels, not IPs and ports scribbled in a wiki.
What fails every time:
- Big-bang enforcement: turning on hard fail everywhere. Start with
audit, phase toenforcewith canaries. - “Security owns it”: platform/security write policies; app teams discover them in prod. Bring app owners into the policy tests.
- Forked exceptions: bespoke bypasses. Make exceptions time-bound with explicit expiry and alerts.
- Vibe coding configs: I’ve seen AI-generated YAML that “looks right” but disables mTLS or opens egress. Run
vibe code cleanupchecks in CI and admission, or expect a long weekend.
If the secure way isn’t the easiest way, your engineers will route around it—usually at 2 a.m.
A pragmatic starting plan (30–60 days)
- Inventory and label: tag namespaces/apps with
data.class. Identify crown jewels. - Mesh + identity: enable Istio mTLS
STRICT; adopt SPIFFE IDs; wire IRSA/Workload Identity. - Policy-as-code: add OPA/Kyverno tests to CI; enable matching admission policies in
auditmode. - Sign + attest:
cosignsign images; add SBOM and provenance attestations. - GitOps: manage infra/app manifests with ArgoCD; require PR reviews for sync waves.
- Canary to enforce: progressively enforce policies on low-risk namespaces; measure SLOs.
- Evidence pipeline: ship decision logs and attestations to an immutable bucket; document the link in change templates.
If you want someone who’s peeled AI-generated YAML off the prod floor and unwound decade-old RBAC cruft, GitPlumbers has done this dance. We turn policies into shipping lanes, not roadblocks.
Key takeaways
- Translate written policies into policy-as-code that runs in CI and at cluster admission.
- Make identity the new perimeter: SPIFFE IDs, mTLS everywhere, least-privilege RBAC/IAM.
- Prove compliance automatically: signed artifacts, attestations, and decision logs.
- Segment data classes (PII/PHI/PCI) with namespace labels, egress policies, and golden paths.
- Roll out zero-trust controls progressively with canaries and SLO guardrails.
- Keep auditors and developers in the same loop with GitOps and machine-verifiable evidence.
Implementation checklist
- Inventory data classes and crown-jewel services; label them in code (namespaces, apps).
- Enable mTLS mesh-wide; authorize by SPIFFE principal, not IPs.
- Adopt policy-as-code: OPA/Kyverno in CI and admission; Terraform checks with Conftest.
- Sign and attest every artifact with Cosign; verify at admission.
- Gate deployments with Argo Rollouts canaries; watch SLOs and error budgets.
- Log policy decisions and store proofs (attestations, approvals, SBOMs) in an evidence bucket.
- Create golden paths (templates/modules) that make the secure way the easiest way.
Questions we hear from teams
- Do I need service mesh for zero trust?
- You need authenticated, encrypted, and authorized service-to-service calls. A mesh like Istio makes mTLS and authz practical at scale, but you can start with sidecars or library-based mTLS for a subset. The key is workload identity (SPIFFE IDs) and policy-driven authorization.
- Will zero trust slow my team down?
- Not if you ship it as paved roads. Policy-as-code in CI, golden templates, and GitOps keep velocity high. Teams slow down when controls are manual, inconsistent, or discovered only at deploy time.
- How do I handle AI-generated configs safely?
- Treat AI output as untrusted. Run K8s and Terraform through OPA/Kyverno policies, kubeconform, and signature verification. We routinely do vibe code cleanup passes and add guardrails that block risky defaults (e.g., privileged pods, `0.0.0.0/0` ingress).
- What evidence do auditors actually want?
- Immutable proofs: signed image digests, SLSA provenance, SBOMs, passing policy decisions, and a GitOps change history. Link those artifacts in your change requests—no screenshots, just verifiable objects.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
