How do we avoid drowning in false positives?

Start with a narrow rule set targeting high-signal events (exec in containers, policy violations, public exposure). Track a false-positive budget and kill or tune noisy rules weekly. Route only CRITICAL to PagerDuty; send lower severities to Slack with thresholds.

Which policy engine should we standardize on?

Use OPA/Rego for IaC and service-level decisions, and Kyverno for Kubernetes admission—its UX is friendlier for cluster policies. Avoid running both for the same control. Keep a single policy repo and publish versioned bundles.

How do we produce audit evidence without manual work?

Automate: sign builds and SBOMs with Sigstore, export CI scan results, and store admission decisions/policy evaluations in immutable storage with lifecycle policies. Generate reports from those artifacts—not from memory.

Won’t this slow our delivery teams?

Done right, no. Pre-commit hooks and PR checks catch issues before reviews. Admission controllers prevent bad deploys. Break-glass paths exist but are time-bound and audited. Teams move faster because the rules are clear and automated.

Where should we start if our stack is messy and includes AI-generated code?

Triage critical controls first (identity, secrets, network egress). Run a vibe code cleanup on AI-generated configs and policies in a staging sandbox. Add guardrails before detection: block obviously unsafe patterns, then layer runtime sensors. GitPlumbers can help prioritize and refactor without halting releases.

Security-compliance · Nov 16, 2025 · 9 minute read

When Your SIEM Sleeps Through Production: Building Real-Time Detection and Automated Proofs Without Killing Delivery

No more PDF policies gathering dust. Turn intent into guardrails, real-time signals, and auditable proofs that ship with your software.

Alex Ramirez

Partner, Platform & Security Engineering

20 years building and fixing distributed systems at scale. Ex-Cloudflare SRE, led platform security at a regulated fintech, and has rescued more than a few clusters from AI-generated “vibe” configs.

I don’t need another dashboard—I need a page within five minutes and artifacts that make my auditor nod once and say, “Thanks.”

Back to all posts

The 2 a.m. page you don’t forget

We had EKS, ArgoCD, Prometheus, and a shiny SIEM license. Still, the page came from a customer, not our tooling: an exfil attempt riding on a compromised CI token. The SIEM ingested logs… fifteen minutes late. Our “policy” lived in a Confluence wiki and a quarterly audit ritual. I’ve seen this failure pattern at startups and Fortune 100s. The fix wasn’t buying another dashboard—it was translating policy into code, wiring real-time sensors, and generating automated proofs as a byproduct of delivery.

This is the playbook we apply at GitPlumbers when teams need signal in seconds, not slides next quarter.

Translate policy to guardrails, checks, and proofs

Most orgs stop at checkbox compliance. Instead, implement policy as three layers that ship with your system:

Guardrails (prevent): Make the paved road easy and the shoulder painful. Think pre-commit, admission controllers, required checks in GitHub/GitLab.
Checks (detect): Real-time sensors and anomaly detection (Falco, GuardDuty, WAF, app signals).
Proofs (document): Machine-generated evidence for auditors and incident reviews (signed attestations, policy evaluation logs).

A few concrete examples:

Policy intent: “No public S3 buckets.” Implement guardrail with IaC checks, detect with cloud findings, prove with stored policy evaluations.

package s3.public

deny["Bucket is public"] {
  input.resource_type == "aws_s3_bucket"
  some acl
  acl := input.acl[_]
  acl.grantee == "AllUsers"
}

Admission rule: “Images must be pinned and signed.” Kyverno makes this painless:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-tag-and-signature
spec:
  validationFailureAction: Enforce
  rules:
    - name: require-nonlatest
      match:
        resources:
          kinds: [Pod]
      validate:
        message: "Images must use a non-latest tag"
        pattern:
          spec:
            containers:
              - image: "*!:latest"
    - name: verify-signature
      match:
        resources:
          kinds: [Pod]
      verifyImages:
        - image: "registry.example.com/*"
          key: |-
            cosign.pub
          attestations:
            - predicateType: slsbom
              attestors:
                - entries:
                    - keyless:
                        issuer: https://token.actions.githubusercontent.com
                        subject: repo:org/repo:ref:refs/heads/main

Proofs: Store the Kyverno admission decision and OPA evaluations alongside build artifacts for audit.

Real-time detection without the SIEM lag

Your SIEM can stay; it’s just not the first responder. You need low-latency detection with sane routing. The pattern that works:

Runtime sensors on the nodes/containers: Falco (eBPF) flags suspicious syscalls within milliseconds.
Cloud-native findings: GuardDuty/Security Hub or GCP Security Command Center for account-level threats.
App-level signals: OpenTelemetry traces/logs with security attributes (user.role, pii.redacted=true).
Event router: Falcosidekick or Vector fans out to PagerDuty/Slack and a data lake.
Single normalized pipeline to the SIEM for correlation and history.

A minimal Falco rule that’s saved us more than once:

- rule: Unexpected Binary in Container
  desc: Detect new executable dropped and executed in a running container
  condition: container and evt.type=execve and not proc.name in (known_binaries)
  output: "Falco: unexpected exec in container (user=%user.name command=%proc.cmdline image=%container.image.repository)"
  priority: WARNING
  tags: [container, process]

Wire alerts fast and loud:

# falcosidekick config snippet
customfields:
  team: "platform-security"
  env: "prod"
outputs:
  pagerduty:
    enable: true
    routingkey: ${PAGERDUTY_ROUTING_KEY}
  slack:
    enable: true
    webhookurl: ${SLACK_WEBHOOK}
  loki:
    enable: true
    hostport: http://loki.grafana:3100

And set a real SLO for detection:

MTTD SLO: 95% of critical threats paged within 5 minutes from event time.
False positive budget: < 10% of paged incidents per month. Reduce noisy rules, not the pager.

Prometheus also belongs here. If your auth failure rate spikes or egress jumps 10x, page it like an outage. That’s real-time detection from your system’s heartbeat.

# Prometheus alert: suspicious egress spike
- alert: EgressTrafficSpike
  expr: rate(node_network_transmit_bytes_total{instance=~"prod.*"}[5m]) > 1e8
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Egress spike on {{ $labels.instance }}"
    description: "Potential exfiltration or runaway job; investigate now."

Keep velocity: shift-left guardrails in CI and GitOps

Don’t make developers guess what security wants. Bake it into the tooling. We usually start with GitHub Actions and ArgoCD policies:

IaC scanning: Checkov, tfsec on Terraform. Fail on high severity.
Container scanning: Trivy in CI and Trivy Operator in-cluster.
Required checks: Block merges if scans or policy evaluations fail.
GitOps drift detection: ArgoCD health + OPA policy bundles blocking unsafe changes.

Here’s a GitHub Actions workflow that catches most foot-guns early:

name: security-ci
on: [pull_request]
jobs:
  scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Trivy image scan
        uses: aquasecurity/trivy-action@0.20.0
        with:
          image-ref: ghcr.io/org/app:${{ github.sha }}
          format: table
          exit-code: '1'
          severity: CRITICAL,HIGH
      - name: Checkov Terraform scan
        uses: bridgecrewio/checkov-action@v12
        with:
          directory: ./infra
          framework: terraform
          skip_download: true
          quiet: true
      - name: OPA policy evaluation
        uses: open-policy-agent/conftest-action@v1
        with:
          files: ./k8s
          policy: ./policy

For delivery, keep your ArgoCD app-of-apps model but add policy gates:

Admission controllers enforce image signing and namespace controls.
Time-bound exceptions via a break-glass label that requires a change ticket and expires via Kyverno’s ttl pattern.
Canary security: deploy detection rules to staging and run chaos drills (attempt to exec into pods, open egress) before production rollout.

Automated proofs: evidence that builds itself

Auditors don’t need a novel—they need consistent, immutable artifacts. Generate them as part of the pipeline:

Sign everything: containers, SBOMs, and attestations with cosign (Sigstore).

cosign sign --key cosign.key ghcr.io/org/app@sha256:...
cosign attach sbom --sbom sbom.spdx.json ghcr.io/org/app@sha256:...
cosign attest --predicate build.json --type slsaprovenance ghcr.io/org/app@sha256:...

Store policy evaluations: push OPA/Kyverno results to an immutable S3 bucket with bucket lock or GCS with retention policies.
In-toto attestations: encode that “these checks ran with these results” and attach to the image digest.
Export control run logs: nightly job writes a summary (controls passed/failed, exceptions) to a versioned folder.

A lightweight exporter pattern:

# after CI completes
jq -n --arg commit "$GITHUB_SHA" --arg ts "$(date -Is)" \
  --argfile opa opa_results.json --argfile trivy trivy.json \
  '{commit:$commit, ts:$ts, opa:$opa, trivy:$trivy}' \
| aws s3 cp - s3://audit-evidence/prod/$GITHUB_SHA.json \
    --acl bucket-owner-full-control --sse AES256

When the auditor asks, you don’t schedule a doc-writing sprint—you give them the bucket URL and a report generator.

Handling regulated data without blocking releases

Regulated environments (HIPAA, PCI) fail when logs leak PII or access is too permissive. Fix it at the edge and keep moving:

Redact in the pipeline: use OpenTelemetry/Vector processors to strip PII before it hits disk.
Tokenize/seal secrets: Vault or cloud KMS; no raw secrets in env vars or logs.
Scoped access: short-lived credentials via OIDC federated workload identities; no static keys in CI.
Environment controls: data seeding with synthetic datasets; no prod dumps in staging.

Example OpenTelemetry Collector redaction pipeline:

receivers:
  otlp:
    protocols:
      http:
processors:
  attributes/pii_redact:
    actions:
      - key: http.request.body
        action: update
        value: "[REDACTED]"
      - key: user.email
        action: delete
exporters:
  loki:
    endpoint: http://loki.grafana:3100/loki/api/v1/push
  otlphttp/splunk:
    endpoint: https://ingest.splunk.example.com
service:
  pipelines:
    logs:
      receivers: [otlp]
      processors: [attributes/pii_redact]
      exporters: [loki, otlphttp/splunk]

And lock down secrets with workload identity instead of long-lived keys:

# GKE Workload Identity example annotation
apiVersion: v1
kind: ServiceAccount
metadata:
  name: app-sa
  annotations:
    iam.gke.io/gcp-service-account: app-prod@project.iam.gserviceaccount.com

Run it like SRE: SLOs, drills, and dull runbooks

Security needs the same operational rigor as reliability:

SLOs for detection and response (MTTD, MTTR, false positive rate).
Chaos security: monthly drills to validate rules and runbooks (simulate credential theft, egress anomalies, container breakout attempts).
DORA meets security: track “deployment frequency with security gate pass rate” so leadership sees impact (or lack thereof) on velocity.
Versioned runbooks: every alert has a runbook with kubectl/aws commands and rollback steps.

A brutally simple runbook snippet we keep handy:

# Investigate Falco unexpected exec
kubectl -n falco logs deploy/falco --since=10m | grep "unexpected exec"
# Identify offending pod and isolate
POD=$(kubectl get pods -A -o json | jq -r '.items[] | select(.metadata.annotations["falco.alert"]=="true") | .metadata.name')
kubectl -n prod label pod/$POD quarantine=true --overwrite
kubectl -n prod apply -f deny-egress-networkpolicy.yaml
# Snapshot container filesystem
kubectl -n prod exec -it $POD -- tar czf - / > /tmp/$POD.tar.gz

What we’ve seen work (and not)

What works:

Policy as code with one engine per layer (Kyverno in-cluster, OPA for IaC). Less is more.
Real-time runtime sensing with Falco, plus cloud findings. Page from the sensors, not the SIEM.
Evidence automated from CI/CD and admission logs; no spreadsheet archaeology.
Redaction at the edge and workload identity instead of shared secrets.

What fails:

“We’ll fix it in the SIEM.” You won’t, not in real-time.
Dozens of overlapping scanners with nobody owning the noise.
AI-generated policies pasted into production. We clean up this vibe code weekly; test policies in staging with replayed traffic.
Break-glass with no expiry or audit. That’s just “permanent exception” with extra steps.

If you need a partner who’s done this under pressure, GitPlumbers builds this stack, de-noises the alerts, and leaves you with runbooks your team actually uses.

Related Resources

Key takeaways

Translate policy into three layers: guardrails (prevent), checks (detect), proofs (document).
Use admission controllers (Kyverno/Gatekeeper) and runtime sensors (Falco/eBPF) for real-time signals.
Stream normalized events to a single pipeline (OTel/Vector) and set paging SLOs for security alerts.
Automate evidence: sign artifacts (Sigstore), store policy evaluations, and export audit-ready reports.
Balance speed and compliance with pre-commit rules, break-glass workflows, and data redaction at the edge.

Implementation checklist

Define top 10 policy intents as code (Rego/Kyverno).
Instrument real-time runtime sensors (Falco) and cloud detectors (GuardDuty/Security Hub).
Normalize events with OpenTelemetry or Vector and route to your SIEM plus PagerDuty.
Enforce at the door: admission policies for images, secrets, and network egress.
Add CI checks for IaC and containers (Checkov/Trivy) with failing thresholds.
Sign builds and SBOMs with Sigstore and attach in-toto attestations.
Store policy evaluations and control run logs in an immutable bucket for audit.
Set alert SLOs (MTTD < 5m) and run monthly chaos drills for detection rules.

Questions we hear from teams

How do we avoid drowning in false positives?: Start with a narrow rule set targeting high-signal events (exec in containers, policy violations, public exposure). Track a false-positive budget and kill or tune noisy rules weekly. Route only CRITICAL to PagerDuty; send lower severities to Slack with thresholds.
Which policy engine should we standardize on?: Use OPA/Rego for IaC and service-level decisions, and Kyverno for Kubernetes admission—its UX is friendlier for cluster policies. Avoid running both for the same control. Keep a single policy repo and publish versioned bundles.
How do we produce audit evidence without manual work?: Automate: sign builds and SBOMs with Sigstore, export CI scan results, and store admission decisions/policy evaluations in immutable storage with lifecycle policies. Generate reports from those artifacts—not from memory.
Won’t this slow our delivery teams?: Done right, no. Pre-commit hooks and PR checks catch issues before reviews. Admission controllers prevent bad deploys. Break-glass paths exist but are time-bound and audited. Teams move faster because the rules are clear and automated.
Where should we start if our stack is messy and includes AI-generated code?: Triage critical controls first (identity, secrets, network egress). Run a vibe code cleanup on AI-generated configs and policies in a staging sandbox. Add guardrails before detection: block obviously unsafe patterns, then layer runtime sensors. GitPlumbers can help prioritize and refactor without halting releases.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to GitPlumbers about real-time security monitoring See our compliance-as-code templates