The Night the SOC Missed It: Real‑Time Detections, Guardrails, and Audit‑Ready Proofs Without Slowing Delivery
Build detections that fire in minutes, encode policies as code, and produce automated evidence—while keeping PII out of your telemetry and shipping on schedule.
Real-time security isn’t another dashboard; it’s guardrails that prevent dumb mistakes, detectors that page in minutes, and proofs that speak auditor.Back to all posts
The 2 a.m. page that shouldn’t have happened
Two summers ago, I watched a consumer fintech’s SOC miss an interactive shell spawned in a prod container. Cloud logs were flowing, metrics looked healthy, and dashboards were very green. The attacker lived for 52 minutes—long enough to scrape env vars and hit an internal API. The postmortem wasn’t about heroics; it was about gaps. No admission guardrails. No runtime detections. No automated proofs. And worst, the SIEM was full of PII because “we needed context.” Sound familiar?
Here’s what we’ve seen actually work across fintech, healthtech, and adtech: translate policies into guardrails that run at CI and admission, wire real‑time detections that trigger in minutes, and produce automated proofs auditors accept—without slowing delivery or leaking PII.
Translate policy into guardrails that actually run
Policies that live in a wiki don’t block anything. Put them in code and make them binary: pass/fail.
- Use OPA Gatekeeper or Kyverno for Kubernetes admission controls
- Use Conftest to fail CI on Terraform/K8s/Helm misconfig
- For Terraform Cloud/Enterprise, consider Sentinel if you’re already invested, but we prefer OPA for portability
Example: block privileged pods, :latest tags, and force runAsNonRoot.
package kubernetes.admission
deny[msg] {
input.kind.kind == "Pod"
some c
container := input.review.object.spec.containers[c]
container.securityContext.privileged == true
msg := sprintf("privileged container %s is not allowed", [container.name])
}
deny[msg] {
input.kind.kind == "Pod"
some c
container := input.review.object.spec.containers[c]
endswith(container.image, ":latest")
msg := sprintf("container %s uses :latest tag", [container.name])
}
deny[msg] {
input.kind.kind == "Pod"
not input.review.object.spec.securityContext.runAsNonRoot
msg := "runAsNonRoot must be set at pod or container level"
}Gatekeeper constraint (trimmed):
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sAllowed
metadata:
name: gp-no-privileged-no-latest
spec:
match:
kinds:
- apiGroups: [""]
kinds: ["Pod"]CI guard with conftest:
# Fail PR if k8s/ or terraform/ violates policies
conftest test k8s/ -p policy/
conftest test terraform/ -p policy/Pro tip: keep the same policy library for CI and admission. Drift between “what we test” and “what we allow” is how exceptions creep in.
Wire up real-time detections without drowning
You don’t need a seven-figure SIEM to get fast MTTD. You need the right signals and sane routing.
- Runtime: Falco (syscall) or Cilium Tetragon (eBPF) for process/network anomalies
- Control plane: Kubernetes Audit Logs, OPA/Gatekeeper decision logs
- Cloud: AWS CloudTrail + GuardDuty, GCP Audit Logs + Security Command Center, Azure Activity Logs + Defender for Cloud
- IdP: Okta sign‑ins and admin events
Falco rule: alert on bash spawned inside containers and kubeconfig reads.
# falco_rules_local.yaml
- rule: Terminal shell in container
desc: Detect bash/sh spawned in a container
condition: spawned_process and container and proc.name in (bash, sh)
output: "Terminal shell spawned (user=%user.name container=%container.name cmd=%proc.cmdline)"
priority: WARNING
- rule: Read kubeconfig in container
desc: Detect reads of kubeconfig
condition: (open_read) and fd.name startswith "/root/.kube/"
output: "Kubeconfig read in container (user=%user.name container=%container.name file=%fd.name)"
priority: WARNINGShip detections to your router (Kafka/HTTP) with Fluent Bit.
# fluent-bit.conf (snippet)
[INPUT]
Name kmsg
Tag falco.*
[INPUT]
Name tcp
Listen 0.0.0.0
Port 2801
Tag falco.json
[FILTER]
Name throttle
Match falco.*
Rate 1000
[OUTPUT]
Name es
Match falco.*
Host elasticsearch
Port 9200
Index falco-%Y.%m.%dRoute provider findings directly too; don’t re‑implement GuardDuty. Dedup at the edge and tag with env, service, owner to keep false positives manageable.
Automated proofs: evidence auditors accept without screenshots
You can ship fast and still have receipts. Make the pipeline produce cryptographic attestations, store decision logs, and keep immutable artifacts.
- Sign images with Sigstore cosign and publish SLSA provenance
- Emit OPA/Kyverno decision logs to an append‑only bucket
- Record deploys (ArgoCD/Flux) and link commit SHAs to releases
GitHub Actions example that runs policy checks, signs the image, and publishes provenance:
name: build-and-prove
on:
push:
branches: [ main ]
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Policy check (Conftest)
run: |
conftest test k8s/ -p policy/
conftest test terraform/ -p policy/
- name: Build image
run: |
docker build -t ghcr.io/acme/payments:${{ github.sha }} .
echo "IMAGE=ghcr.io/acme/payments:${{ github.sha }}" >> $GITHUB_ENV
- name: Login to GHCR
run: echo ${{ secrets.GITHUB_TOKEN }} | docker login ghcr.io -u $ --password-stdin
- name: Push
run: docker push $IMAGE
- name: Cosign sign
env:
COSIGN_EXPERIMENTAL: "true"
run: cosign sign --key ${{ secrets.COSIGN_KEY }} $IMAGE
- name: SLSA provenance
uses: slsa-framework/slsa-github-generator/actions/generator@v2
with:
base64-subjects: ${{ env.IMAGE }}Evidence you keep:
- Signed image digest, SLSA provenance blob
- Conftest pass report (artifact), Gatekeeper decision logs (bucket)
- ArgoCD deploy event linking commit -> environment
This is the “automated proof” you hand to auditors instead of a screen‑recording marathon.
Regulated data vs speed: keep PII out, keep signals rich
If you’re piping raw PII into your SIEM to “debug incidents faster,” you’re building your own breach headline. Tokenize at the edge and keep context with reversible or keyed hashes.
- Define a policy: PII never leaves prod VPC untransformed
- Tokenize emails/phones in logs at collection time
- Use Postgres RLS and Snowflake masking policies to scope who can read raw data
Hash PII in Fluent Bit using a Lua filter with a secret salt (from Vault):
-- pii_hash.lua
local openssl = require('openssl')
function hash(s, salt)
return openssl.digest.new('sha256'):final(s .. salt)
end
function cb(tag, timestamp, record)
local salt = os.getenv('PII_SALT') or ''
if record['user_email'] then
record['user_email_hash'] = hash(record['user_email'], salt)
record['user_email'] = nil
end
return 1, timestamp, record
endWire it in Fluent Bit:
[FILTER]
Name lua
Match app.*
script /fluent-bit/scripts/pii_hash.lua
call cbDatabase protections that won’t slow you down:
- Postgres RLS: enforce tenant/user scoping at the DB layer
- Snowflake:
MASKING POLICYfor columns likeemail,ssn - mTLS with a mesh (Istio/Cilium) and egress policies to stop data drip
You still correlate using the hash, you just don’t leak PII into every log store and dashboard.
Ship detections like code: test, canary, promote, rollback
Most teams toss rules into the SIEM and hope. Treat detections like features.
- Write detection with tests (sample events) in Git
- Run unit tests in CI; spin a sandbox to replay prod‑like data
- Canary to 10% of namespaces or one cluster
- Measure precision/recall for a week; gate on false positive SLO
- Promote with a version tag; rollback by revert
Example test for a Falco rule using replay:
falco --rules falco_rules_local.yaml --trace-file sample_syscalls.scap --disable-source k8s_auditHoneytokens are cheap wins: drop a Canarytoken API key in your private repo; any use should page immediately. We’ve caught red‑teamers and one unlucky contractor this way.
Metrics that matter and a 60‑day plan
You can’t improve what you don’t measure.
- MTTD (p95): target < 5 minutes for critical events
- MTTR (p95): target < 30 minutes with runbooks
- False positive rate: < 5% per rule over 7 days
- Coverage: % workloads under guardrails, % clusters with runtime sensors
- Evidence freshness: time from deploy to attestation available (< 5 minutes)
A pragmatic 60‑day rollout:
- Days 1–10: pick top 10 guardrails; implement OPA/Kyverno; add Conftest to CI
- Days 11–20: deploy Falco/Tetragon; route to SIEM; integrate GuardDuty/SCC
- Days 21–30: add cosign + SLSA; store OPA decision logs; wire ArgoCD evidence
- Days 31–45: tokenize PII at the edge; enable RLS/masking; backfill docs
- Days 46–60: build rule testing/canary pipeline; set SLOs; quarterly review cadence
At a payments client, this cut p95 MTTD from 47 min to 3 min and eliminated PII in their SIEM in three sprints. Zero slowdown in deploy frequency (still ~40/day).
What GitPlumbers does on these engagements
We’ve done this at seed‑stage startups and at public fintechs. The pattern works.
- Rapid policy pack: OPA/Kyverno guardrails aligned to your stack
- Runtime detections: Falco or Tetragon, tuned to your risk model
- Evidence plumbing: cosign/SLSA, decision logs, ArgoCD hooks
- Data hygiene: edge tokenization, RLS/masking, mesh egress
- Detection SLOs and a rule lifecycle that ops will actually maintain
If you want a partner who’s burned their hands on this stuff and still ships, we’ll help you wire it in without blowing up your roadmap.
Key takeaways
- Translate policies into code with OPA/Kyverno and enforce them at both CI and admission to prevent drift.
- Use eBPF/syscall detectors (Falco/Tetragon) plus cloud-native findings (GuardDuty/SCC) to reduce MTTD to minutes.
- Generate automated proofs with cosign, SLSA provenance, and OPA decision logs—no screenshots for auditors.
- Protect speed and privacy by tokenizing PII at the edge and enforcing data-scoped logging policies.
- Treat detections like code: test, canary, promote, and rollback with clear SLOs and ownership.
Implementation checklist
- Define top 10 guardrails as OPA/Kyverno policies and enforce them in CI and at cluster admission.
- Deploy Falco or Tetragon for container runtime detections; forward to SIEM with dedup and routing.
- Hook cloud logs (CloudTrail, Audit Logs, Activity Logs) and provider detectors (GuardDuty, SCC, Defender).
- Create a GitHub Actions pipeline that signs images with cosign and emits SLSA provenance.
- Instrument OPA decision logging and store immutable evidence in an append-only bucket.
- Add data tokenization at the logging edge; block PII in SIEM with policy-based sinks.
- Set detection SLOs (MTTD, false positive rate) and build a rule promotion workflow.
Questions we hear from teams
- What if we’re not on Kubernetes?
- You can still apply the model: use OPA/Conftest for IaC (Terraform/CloudFormation), cloud-native detections (GuardDuty/SCC/Defender), host-based sensors (OSQuery/Elastic Agent), and sign artifacts with cosign. The primitives are the same.
- Will this slow down our deploys?
- Not if you design it right. Policy checks run in seconds and admission controls are cheap. Our clients maintain deploy frequencies of 20–100/day with guardrails and runtime detections in place.
- Do we need a SIEM?
- You need a place to search and alert. Datadog, Elastic, Splunk, or Chronicle all work. Start with what your team knows. The key is clean routing, PII hygiene, and rule lifecycle management.
- How do we keep false positives under control?
- Test rules against real data, canary them, and set an explicit false positive SLO (<5%). Add ownership: every rule has an on-call team and a rollback plan.
- How do auditors trust automated proofs?
- Because they’re cryptographically verifiable and repeatable. Cosign signatures, SLSA provenance, and immutable decision logs provide stronger evidence than screenshots. We map them to your control framework (SOC 2, ISO 27001, HIPAA).
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
