Real-Time Security Monitoring Without Slowing You Down: Turning Policy Into Guardrails, Checks, and Proofs
Stop shipping blind. Wire your SDLC for real-time signals, automated enforcement, and evidence you can hand to auditors without killing velocity.
If your monitoring can’t answer “what changed” in under a minute, you don’t have monitoring—you have an archive.Back to all posts
The on-call page that changed the roadmap
We watched a prod cluster start spawning bash
in a sidecar at 2:13 AM. kube-audit showed a just-created ClusterRole
with *
on secrets. The deploy came from a temp branch no one recognized. Classic: a well-meaning engineer debugging a data pipeline, accidentally punching a hole you could drive a semi through. The SIEM got the logs. It didn’t get us the save.
What did? A Falco
rule firing within seconds, a Kyverno
admission policy that blocked the second bad deploy, and a signed artifact trail that proved which pipeline produced what. That night convinced leadership to stop treating security as a quarterly audit and start treating it as a real-time system.
This is how we wire that system without grinding delivery to a halt.
Real-time means signals across the whole SDLC
If your “real-time monitoring” is just a Splunk
or Datadog
dashboard on CloudTrail, you’re blind to 80% of the attack surface: pre-merge changes, CI, artifact promotion, and K8s control plane events.
You need a graph of events from code to prod:
- Code:
git
commits, PR reviews, branch protections, secrets scanning (gitleaks
). - Build: CI workflows (
GitHub Actions
,GitLab CI
), provenance (slsa-github-generator
), signatures (cosign
), SBOMs (syft
), vuln scan (grype
). - Deploy: CD events (
ArgoCD
/Flux
), policy gates (OPA
/Kyverno
), canaries (Flagger
), change windows. - Runtime: Kubernetes audit logs,
Falco
/eBPF,GuardDuty
/Security Hub
,Istio
mTLS anomalies,Prometheus
/Alertmanager
.
Pipe them to a correlation layer (Datadog, Elastic, Snowflake, or even Loki
+ Tempo
) with consistent IDs:
- Annotate everything with
trace_id
,build_id
,commit_sha
,artifact_digest
. - Emit OpenTelemetry from CI and CD so deploys connect to runtime alerts.
The goal: when an alert fires, you can answer “what changed, who approved it, and is the artifact trustworthy?” in under 60 seconds.
Turn policy into guardrails, checks, and proofs
Most policies die as PDFs. Make them executable.
- Guardrails: pre-merge checks and default configs that steer engineers right.
- Checks: hard gates that block risky changes where it matters.
- Proofs: cryptographic evidence that a control ran and passed.
Examples that actually work:
- Infrastructure policy as code with
OPA
/Rego
viaconftest
orCheckov
in CI. - Kubernetes admission with
Kyverno
orGatekeeper
for runtime enforcement. - Artifact provenance and signature with
cosign
+ SLSA attestations.
Rego for Terraform (block public S3 unless tagged public-approved
):
package terraform.s3
default allow = false
allow {
input.resource_type == "aws_s3_bucket"
input.config.acl == "public-read"
input.config.tags["public-approved"] == "true"
}
Kyverno to deny privileged pods unless approved:
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: deny-privileged
spec:
validationFailureAction: enforce
rules:
- name: no-privileged
match:
resources:
kinds: [Pod]
validate:
message: "Privileged containers require risk-approval label"
pattern:
spec:
containers:
- =(securityContext):
=(privileged): false
=(nodeSelector):
=(risk-approval): "approved"
Proofs during build:
# Generate SBOM and vulnerability scan
syft dir:. -o spdx-json > sbom.json
grype sbom:sbom.json -o json > vuln.json
# Generate SLSA provenance and sign
slsa-github-generator --predicate-type slsaprovenance --output provenance.json
cosign sign --key $COSIGN_KEY $IMAGE_DIGEST
cosign attest --predicate sbom.json --type spdx $IMAGE_DIGEST
cosign attest --predicate provenance.json --type slsaprovenance $IMAGE_DIGEST
Store proofs alongside artifacts (e.g., ghcr.io
or ECR
with attached attestations) and index in your data platform.
Instrument the pipeline: code, build, deploy
You can’t protect what you can’t see. Wire events where attackers (or rushed engineers) make mistakes.
Code
- Enable branch protections and required reviews; log review metadata.
- Run
gitleaks
andtrufflehog
pre-commit and in CI; block on high-confidence hits. Dependabot
/Renovate
with security-only auto-merge under tests + policy gates.
Build
- Standardize CI with reusable workflows.
- Example
GitHub Actions
workflow snippet:
name: secure-build
on: [push]
jobs:
build:
permissions:
id-token: write # OIDC for cosign keyless
contents: read
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: sigstore/cosign-installer@v3
- run: |
docker build -t $IMAGE .
digest=$(docker inspect --format='{{index .RepoDigests 0}}' $IMAGE)
echo "IMAGE_DIGEST=$digest" >> $GITHUB_ENV
- run: syft $IMAGE -o spdx-json > sbom.json
- run: grype sbom:sbom.json -o json --fail-on high
- run: cosign sign --yes ${{ env.IMAGE_DIGEST }}
- run: cosign attest --yes --predicate sbom.json --type spdx ${{ env.IMAGE_DIGEST }}
- Deploy
ArgoCD
withSync Waves
andSync Windows
; feed events to the bus.- Admission policies (
Kyverno
) enforceimage: $DIGEST
only (no mutable tags). - Canary with
Flagger
+ auto-rollback onSLO
breach.
Admission to block unsigned images:
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: verify-image-signatures
spec:
rules:
- name: verify-cosign
match:
resources:
kinds: [Pod]
verifyImages:
- image: "ghcr.io/org/*"
attestors:
- entries:
- keys:
publicKeys: |-
-----BEGIN PUBLIC KEY-----
...
-----END PUBLIC KEY-----
Runtime signals that actually catch bad days
Catching the blast radius early is cheaper than incident review therapy.
- Kubernetes audit logs -> central store with queries like: create/update of RBAC and Secrets by service accounts not in an allowlist.
Falco
/eBPF for syscall-level detections (crypto mining, shell spawn, package install in containers).- Cloud-native services:
GuardDuty
,Security Hub
,CloudTrail
with anomaly detection. - Network and identity:
Istio
mTLS failures, unexpected external egress, OIDC/OAuth anomalies.
Falco rule example (shell in container):
- rule: Terminal shell in container
desc: A shell was spawned in a container
condition: spawned_process and container and proc.name in (bash, sh, zsh)
output: "Shell spawned in container (user=%user.name container=%container.id image=%container.image.repository)"
priority: WARNING
Prometheus alert for privilege escalation attempts (from audit log exporter):
- alert: K8sPrivilegeEscalation
expr: sum(rate(kube_audit_event_total{verb="create",resource="clusterrolebindings"}[5m])) by (user) > 0
for: 1m
labels:
severity: critical
annotations:
summary: Possible privilege escalation by {{ $labels.user }}
Connect the dots: when Falco
fires on shell spawn, enrich with deployment
, commit_sha
, image_digest
, and latest cosign
status. If the image is unsigned or missing SBOM, auto-quarantine the namespace.
Auto-remediation pattern:
- Detect -> Tag workload with
quarantine=true
. Kyverno
policy denies network egress for quarantined pods.- PagerDuty alert routed with enriched context and rollback command link.
Keep auditors happy without killing velocity
Regulated data changes the calculus, but it doesn’t have to stall delivery. The trick is risk tiers and progressive enforcement.
Risk tiers
- Tier 0 (PCI/HIPAA prod): block on any failed policy, require signed artifacts, mandatory peer review, mTLS, DLP egress scanning.
- Tier 1 (prod non-regulated): block on criticals; warn on mediums with 7-day SLAs.
- Tier 2 (staging/dev): warn-only, but still collect evidence.
Progressive enforcement
- Week 1–2: emit warnings (no blocks), measure noise.
- Week 3–4: block in Tier 0, warn in Tier 1–2.
- Week 5+: ratchet thresholds, add auto-remediation.
Time-bound exceptions
- All waivers recorded as
Exception
CRDs with owner, scope, risk, expiry.
- All waivers recorded as
apiVersion: security.gitplumbers.dev/v1
kind: Exception
metadata:
name: allow-nodeport-temporary
spec:
control: deny-nodeport
owner: team-ml
scope: namespace:ml-inference
expires: 2025-01-31
justification: "Partner demo; VPN cutover pending"
- Data-aware controls
- Tag resources with data classification (
public
,internal
,restricted
). - DLP on egress for
restricted
namespaces; block to unknown destinations. - Vaulted secrets (
HashiCorp Vault
orAWS KMS
+Secrets Manager
), rotated automatically; forbid inlineSecret
manifests.
- Tag resources with data classification (
Evidence for auditors
- Attestations: SBOM, vuln scan, SLSA provenance, signature.
- Policy results: pass/fail with policy versions and links to PRs.
- Change approvals: PR review logs, change tickets, and deploy metadata.
- Retention: 1–3 years in cold storage (
S3 Glacier
) with immutability (S3 Object Lock
).
What to measure (and how you’ll know it works)
If it doesn’t change your graphs, it didn’t happen.
- MTTD for critical runtime events: target < 2 minutes from first signal.
- MTTR for policy-violating deploys: target < 15 minutes to rollback or remediate.
- False-positive rate for high-severity alerts: < 5%.
- Blocked deploy rate: < 2% in Tier 1; 0% in Tier 2 (warn-only).
- Exception debt: count, aging, and percent expired.
- Evidence completeness: % of prod images with SBOM + signature + provenance (> 98%).
Dashboards worth staring at
- “What changed in the last 60 minutes?” (deploys, infra changes, RBAC updates)
- Top policy violations by team/service
- Runtime criticals mapped to commit SHAs
- Exception queue with SLA timers
A 30/60/90 you can actually ship
30 days (prove the loop works)
- CI: add SBOM (
syft
), scan (grype --fail-on high
), andcosign
signing in one service. - Admission: enforce digest pins and signature verify in one cluster.
- Runtime: enable
Falco
and onePrometheus
alert; route to PagerDuty. - Evidence: store attestations and CI logs; index by
commit_sha
andimage_digest
.
60 days (turn up the lights)
- Expand to top 10 services; add
Checkov
/conftest
to Terraform repos. - Ingest K8s audit logs; add privilege escalation alert.
- Roll out progressive enforcement in Tier 1; start exception CRDs.
- Dashboards for MTTD, MTTR, blocked deploys; weekly tuning.
90 days (make it boring and durable)
- Cover 80% of prod images with SBOM, signature, provenance.
- Canary+auto-rollback for Tier 0 with
Flagger
SLO hooks. - DLP egress policy for restricted namespaces; enforce Vault-only secrets.
- Quarterly control reviews as code PRs, not meetings; auditors get read-only dashboards.
What this looks like when it works
- A Datadog alert fires for unexpected
ClusterRoleBinding
creation. - The event is enriched with
deployment=payments-api
,commit=abc123
,digest=sha256:...
,signed=true
,sbom=true
. Kyverno
quarantines the namespace;Flagger
rolls traffic back.- On-call clicks the evidence link: SLSA provenance, SBOM, scan results, PR approvals.
- Postmortem: 7 minutes MTTD, 11 minutes MTTR, zero customer impact. Audit trail closed itself.
I’ve seen the opposite too: 3-hour MTTD because logs trickled into a SIEM, no idea who changed what, and a painful all-hands the next day. The difference isn’t budget—it’s wiring policy into the operational fabric and insisting on proofs.
Key takeaways
- Real-time detection starts with event coverage across code, CI, deploy, and runtime—not just a SIEM feed.
- Translate policies into machine-enforceable rules (OPA/Kyverno), not PDF checklists, and collect cryptographic proofs (attestations, SBOMs).
- Use progressive enforcement: warn in dev, block in prod; risk-tier controls to keep delivery moving.
- Standardize evidence pipelines: provenance, signatures, SBOMs, and policy pass/fail recorded per artifact.
- Measure outcomes: MTTD, MTTR, false-positive rate, and time-to-exception-closure drive trust.
- Start small: wire 3–5 critical controls end-to-end, then iterate with dashboards and auto-remediation.
Implementation checklist
- Instrument code-to-prod event streams: `git`, CI, artifact registry, deploy, Kubernetes audit, cloud logs.
- Encode policies as code using `OPA`/`Rego` or `Kyverno` and gate them in CI and admission controllers.
- Generate and store attestations: SBOM (`syft`), vuln scan (`grype`), provenance (`slsa-generator`), signature (`cosign`).
- Implement runtime detection: `Falco`/eBPF, `GuardDuty`, `CloudTrail`, `Istio` mTLS anomalies, and `Prometheus` alert rules.
- Adopt progressive enforcement with risk tiers and time-bound exceptions.
- Build dashboards for MTTD, MTTR, blocked deploys, and exception debt; tune weekly.
Questions we hear from teams
- How do we avoid drowning in false positives?
- Start with a narrow ruleset tied to change events and high-signal runtime detections (RBAC changes, unsigned images, shell spawns). Run in observe mode for two weeks, label every alert as actionable/not, and only then enable blocks. Track false-positive rate and require a tuning PR for every noisy rule.
- Is this overkill for a non-regulated SaaS?
- No—reduce scope. Keep signatures, SBOMs, and one or two runtime detections. The payback is faster incident triage and fewer Friday night rollbacks. You can still do warn-only in lower tiers and reserve hard blocks for prod.
- We’re on EKS with multiple clusters. Where do we start?
- Pick one cluster and one service on the critical path. Add digest pinning + signature verification in admission, and Falco with two rules. In CI, generate SBOM and sign artifacts. Ingest K8s audit logs. Expand by namespace, not by cluster, to keep blast radius contained.
- Do we need a full SIEM to do this?
- Helpful, not required. You can get far with OpenTelemetry, Loki/Tempo, and a modest Elasticsearch or Datadog footprint. The critical piece is correlation: consistent IDs across CI/CD and runtime so you can stitch events together.
- What about AI-generated code and hallucinated dependencies?
- Treat AI like a junior dev: guardrails plus review. Enforce dependency allowlists, scan SBOMs, and block new packages without a ticket. Use repo-level policies to require PR descriptions referencing tasks. Provenance attestations help prove where code came from and which pipeline built it.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.