Do I need service mesh for zero trust?

You need authenticated, encrypted, and authorized service-to-service calls. A mesh like Istio makes mTLS and authz practical at scale, but you can start with sidecars or library-based mTLS for a subset. The key is workload identity (SPIFFE IDs) and policy-driven authorization.

Will zero trust slow my team down?

Not if you ship it as paved roads. Policy-as-code in CI, golden templates, and GitOps keep velocity high. Teams slow down when controls are manual, inconsistent, or discovered only at deploy time.

How do I handle AI-generated configs safely?

Treat AI output as untrusted. Run K8s and Terraform through OPA/Kyverno policies, kubeconform, and signature verification. We routinely do vibe code cleanup passes and add guardrails that block risky defaults (e.g., privileged pods, `0.0.0.0/0` ingress).

What evidence do auditors actually want?

Immutable proofs: signed image digests, SLSA provenance, SBOMs, passing policy decisions, and a GitOps change history. Link those artifacts in your change requests—no screenshots, just verifiable objects.

Security-compliance · Nov 29, 2025 · 10 minute read

Zero Trust Without Killing Velocity: Guardrails, Proofs, and Shipping Regulated Data

How to design zero-trust for distributed systems that auditors love and engineers don’t hate—using policy-as-code, identity, and automated evidence.

Alex Weber

Partner, Security & Platform Engineering at GitPlumbers

20 years hardening distributed systems from pre-cloud racks to service meshes. Led platform and SRE teams through PCI and HIPAA audits, migrated brownfield monoliths to Kubernetes, and cleaned up more AI-generated YAML than I’d like to admit.

Policies that don’t execute as code are just opinions. Zero trust starts when your pipeline can prove what it enforces.

Back to all posts

The breach that taught me zero trust the hard way

A few years back, I watched an internal service account token leak from a misconfigured CI job. It wasn’t “nation state” level—just a bored contractor poking around a flat Kubernetes cluster. With no mTLS, permissive NetworkPolicy, and wildcard RBAC, lateral movement took under 10 minutes. Fortunately, we caught it early. Unfortunately, we still had to explain “why our staging cluster could write to a production S3 bucket.”

I’ve seen this fail over and over: expensive zero-trust slide decks, then a parking lot full of tickets no one closes. What actually works is treating policies as code, identity as the perimeter, and proofs as first-class artifacts. Do that, and you can lock down regulated data without turning delivery into molasses.

Turn policy into guardrails, not tickets

If your policy lives in Confluence, your engineers will trip over it. If it lives in CI/admission as code, it becomes a guardrail. The pattern we deploy:

Shift-left checks in CI: OPA/Rego via conftest for Terraform; kubeconform and Kyverno tests for K8s; cosign for signatures.
Shift-right enforcement at cluster boundaries: Kyverno or OPA Gatekeeper admission policies; image signature verification; runtime authz in the mesh (Istio + OPA Envoy plugin).
Single source of truth via GitOps (ArgoCD), so the evidence is the repo history.

Example 1: block risky K8s specs at admission with Kyverno.

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: baseline-pod-security
spec:
  validationFailureAction: enforce
  background: true
  rules:
    - name: require-run-as-nonroot
      match:
        resources:
          kinds: ["Pod","Deployment","StatefulSet"]
      validate:
        message: "Containers must not run as root."
        pattern:
          spec:
            securityContext:
              runAsNonRoot: true
    - name: deny-host-network
      match:
        resources:
          kinds: ["Pod","Deployment","StatefulSet"]
      validate:
        message: "hostNetwork is not allowed."
        deny:
          conditions:
            any:
              - key: "{{ request.object.spec.hostNetwork }}"
                operator: Equals
                value: true

Example 2: keep Terraform from opening the world.

# policy/terraform/security_group.rego
package terraform.aws.security_group

deny[msg] {
  some sg
  sg := input.resource.aws_security_group[_]
  some ing
  ing := sg.ingress[_]
  ing.cidr_blocks[_] == "0.0.0.0/0"
  ing.from_port < 1024
  msg := "Public ingress to privileged ports is forbidden"
}

Wire these into GitHub Actions so merges fail fast:

name: ci
on: { push: { branches: [ main ] } }
jobs:
  verify:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Terraform plan + OPA
        run: |
          terraform init -backend=false
          terraform plan -out tf.plan
          terraform show -json tf.plan > plan.json
          conftest test plan.json -p policy/terraform
      - name: Build, sign, verify image
        env:
          COSIGN_EXPERIMENTAL: "1"
        run: |
          docker build -t ghcr.io/acme/payments:${{ github.sha }} .
          cosign sign --key $COSIGN_KEY ghcr.io/acme/payments:${{ github.sha }}
          cosign verify ghcr.io/acme/payments:${{ github.sha }} --key $COSIGN_PUB

Result: the developer gets a precise failure message in under a minute, fixes it in code, and tries again—no compliance Slack drama.

Identity first: SPIFFE, mTLS, and least privilege

Flat networks and IP allowlists don’t scale in microservices. Make identity the new perimeter:

Workload identity: SPIFFE IDs (spiffe://cluster.local/ns/<ns>/sa/<sa>) via SPIRE or Istio.
mTLS everywhere: mesh policy set to STRICT.
AuthZ by principal: services talk because policies say they can, not because the network is flat.
Cloud IAM bindings for pods: IRSA (AWS), Workload Identity (GCP), Azure Managed Identity.

Istio mTLS + allow-only-from-catalog to payments:

# Enable STRICT mTLS for the namespace
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: payments
spec:
  mtls:
    mode: STRICT
---
# Only allow calls from catalog service account on 8443
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: payments-allow-catalog
  namespace: payments
spec:
  rules:
    - from:
        - source:
            principals: ["spiffe://cluster.local/ns/catalog/sa/catalog-svc"]
      to:
        - operation:
            ports: ["8443"]

Pod-to-AWS with least privilege (IRSA):

apiVersion: v1
kind: ServiceAccount
metadata:
  name: payments
  namespace: payments
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/payments-sa

This kills the class of “stolen node credentials can write to prod S3” incidents. You can still misconfigure it—but at least mistakes are localized and observable.

Regulated data without slowing delivery

You don’t need a separate cluster for every data class (though sometimes you will). What you need is consistent segmentation and golden paths.

Label namespaces by data class: data.class: pii|phi|pci|public.
Apply stronger defaults to sensitive classes: deny-all egress, restricted images, secrets only from Vault, runtime profiling on.
Kustomize overlays + ArgoCD AppSets: the secure template is the only template.
Network egress controls: only allow traffic to approved services or CIDRs.

Minimal egress policy for a PII namespace:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: pii-egress-allowlist
  namespace: pii
  labels:
    data.class: pii
spec:
  podSelector: {}
  policyTypes: ["Egress"]
  egress:
    - to:
        - namespaceSelector:
            matchLabels:
              istio-injection: enabled
        - ipBlock:
            cidr: 10.0.0.0/16

Bundle that with a Kyverno policy to reject unsigned images, and a SecretStore reference to Vault. Expose it as a one-line scaffold (backstage template or cookiecutter). Delivery speed comes from paved roads—not from bypassing controls.

Automated proofs: signatures, attestations, and decision logs

Auditors don’t want promises; they want evidence. Make the pipeline produce machine-verifiable proofs:

Sign everything: containers with cosign, manifests with gitsign, Terraform plans with checksums.
Attest supply chain: SLSA provenance, SBOMs (CycloneDX), vulnerability scan results.
Verify at admission: reject unsigned images with Kyverno verifyImages or Chainguard cosigned.
Log policy decisions: OPA/Kyverno decision logs to Loki/S3 with retention and immutability (object lock).

Verify only signed images run:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: verify-signed-images
spec:
  validationFailureAction: enforce
  rules:
    - name: require-cosign
      match:
        resources:
          kinds: ["Pod","Deployment","StatefulSet"]
      verifyImages:
        - image: "ghcr.io/acme/*"
          attestors:
            - entries:
                - keys:
                    publicKeys: |
                      -----BEGIN PUBLIC KEY-----
                      ...
                      -----END PUBLIC KEY-----

Create attestations during build:

cosign attest \
  --predicate sbom.cdx.json \
  --type cyclonedx \
  --key $COSIGN_KEY \
  ghcr.io/acme/payments:${GITHUB_SHA}

Now your change request links to immutable proofs: signed image digest, SLSA provenance, passing policy decisions, and ArgoCD sync history. For SOC 2/ISO/HIPAA, that’s gold.

Roll out controls safely: canaries, SLOs, and chaos

I’ve seen teams flip enforce on day one and nuke availability. Don’t. Treat security controls like any other risky change:

Canary policies: start audit mode, surface violations, then enforce for 10%, 50%, 100% of namespaces.
Measure impact: watch SLOs and error budgets (Prometheus, Grafana). If MTTR spikes, you went too hard.
Progressive delivery: Argo Rollouts with guardrails tied to error rate/latency.
Circuit breakers: keep cascading failures from becoming incidents when authz gets too tight.

Canary a sensitive rollout:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: payments
spec:
  strategy:
    canary:
      steps:
        - setWeight: 10
        - pause: { duration: 300 }
        - setWeight: 50
        - pause: { duration: 600 }
        - setWeight: 100

Protect callers with Istio circuit breaking:

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: payments
spec:
  host: payments
  trafficPolicy:
    outlierDetection:
      consecutive5xxErrors: 5
      interval: 5s
      baseEjectionTime: 30s
    connectionPool:
      http:
        http1MaxPendingRequests: 100
        maxRequestsPerConnection: 100

Zero trust should reduce incident blast radius and improve MTTR, not tank your SLOs. If it does, your rollout plan—not the principle—is the problem.

What actually works (and what doesn’t)

What works:

Golden paths: pre-approved modules/templates with sane defaults (Terraform, Helm, Kustomize). New services ship fast and compliant by default.
GitOps: ArgoCD gives a visible, auditable diff of intended vs actual. It’s remarkable how much that calms auditors.
Tight feedback loops: CI fails within minutes with actionable messages; admission policies echo the same rules.
Identity-first policies: authorize by SPIFFE principal and namespace labels, not IPs and ports scribbled in a wiki.

What fails every time:

Big-bang enforcement: turning on hard fail everywhere. Start with audit, phase to enforce with canaries.
“Security owns it”: platform/security write policies; app teams discover them in prod. Bring app owners into the policy tests.
Forked exceptions: bespoke bypasses. Make exceptions time-bound with explicit expiry and alerts.
Vibe coding configs: I’ve seen AI-generated YAML that “looks right” but disables mTLS or opens egress. Run vibe code cleanup checks in CI and admission, or expect a long weekend.

If the secure way isn’t the easiest way, your engineers will route around it—usually at 2 a.m.

A pragmatic starting plan (30–60 days)

Inventory and label: tag namespaces/apps with data.class. Identify crown jewels.
Mesh + identity: enable Istio mTLS STRICT; adopt SPIFFE IDs; wire IRSA/Workload Identity.
Policy-as-code: add OPA/Kyverno tests to CI; enable matching admission policies in audit mode.
Sign + attest: cosign sign images; add SBOM and provenance attestations.
GitOps: manage infra/app manifests with ArgoCD; require PR reviews for sync waves.
Canary to enforce: progressively enforce policies on low-risk namespaces; measure SLOs.
Evidence pipeline: ship decision logs and attestations to an immutable bucket; document the link in change templates.

If you want someone who’s peeled AI-generated YAML off the prod floor and unwound decade-old RBAC cruft, GitPlumbers has done this dance. We turn policies into shipping lanes, not roadblocks.

Related Resources

Key takeaways

Translate written policies into policy-as-code that runs in CI and at cluster admission.
Make identity the new perimeter: SPIFFE IDs, mTLS everywhere, least-privilege RBAC/IAM.
Prove compliance automatically: signed artifacts, attestations, and decision logs.
Segment data classes (PII/PHI/PCI) with namespace labels, egress policies, and golden paths.
Roll out zero-trust controls progressively with canaries and SLO guardrails.
Keep auditors and developers in the same loop with GitOps and machine-verifiable evidence.

Implementation checklist

Inventory data classes and crown-jewel services; label them in code (namespaces, apps).
Enable mTLS mesh-wide; authorize by SPIFFE principal, not IPs.
Adopt policy-as-code: OPA/Kyverno in CI and admission; Terraform checks with Conftest.
Sign and attest every artifact with Cosign; verify at admission.
Gate deployments with Argo Rollouts canaries; watch SLOs and error budgets.
Log policy decisions and store proofs (attestations, approvals, SBOMs) in an evidence bucket.
Create golden paths (templates/modules) that make the secure way the easiest way.

Questions we hear from teams

Do I need service mesh for zero trust?: You need authenticated, encrypted, and authorized service-to-service calls. A mesh like Istio makes mTLS and authz practical at scale, but you can start with sidecars or library-based mTLS for a subset. The key is workload identity (SPIFFE IDs) and policy-driven authorization.
Will zero trust slow my team down?: Not if you ship it as paved roads. Policy-as-code in CI, golden templates, and GitOps keep velocity high. Teams slow down when controls are manual, inconsistent, or discovered only at deploy time.
How do I handle AI-generated configs safely?: Treat AI output as untrusted. Run K8s and Terraform through OPA/Kyverno policies, kubeconform, and signature verification. We routinely do vibe code cleanup passes and add guardrails that block risky defaults (e.g., privileged pods, `0.0.0.0/0` ingress).
What evidence do auditors actually want?: Immutable proofs: signed image digests, SLSA provenance, SBOMs, passing policy decisions, and a GitOps change history. Link those artifacts in your change requests—no screenshots, just verifiable objects.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Get a Zero-Trust Readiness Assessment See how we modernize regulated stacks