The Night Falco Saved Prod: Real‑Time Detection, Guardrails, and Proofs Without Slowing Delivery

Turn your security policies into code that blocks the dumb stuff, detects the sneaky stuff, and proves compliance automatically.

“If you can’t detect it in under five minutes and prove it in under five clicks, you don’t have real-time security—you have theater.”
Back to all posts

The page-out that changed our approach

I’ve been on too many 2 a.m. calls where everyone insists, “We passed the audit last quarter.” That night on EKS, an engineer popped a debug shell in a prod pod to chase a latency spike. Falco fired within seconds: Terminal shell in container. Slack lit up, PagerDuty paged, and we cordoned the node before credentials could walk. If we were relying on our weekly audit scripts or a SIEM dashboard refresh, we’d have caught it Monday—maybe.

Real-time security monitoring isn’t a tool, it’s a pipeline: guardrails that prevent foot-guns, detections that catch what slips through, and proofs that show auditors (and yourself) that controls exist and work. The trick is doing all three without turning your devs into ticket clerks.

From policy PDF to code: guardrails, checks, and automated proofs

Your policy doc says “No public S3 buckets” and “Only signed images in prod.” Translate that into three layers:

  • Guardrails (preventative): non-negotiable policies enforced in CI/CD and admission. Think OPA/Gatekeeper or Kyverno, branch protection, CODEOWNERS, and Conftest on terraform plan.
  • Checks (detective in CI): fast SAST/SCA/IaC scanning (Semgrep, Trivy, Checkov/Tfsec, Grype) with severity-based gating.
  • Proofs (attested evidence): signatures, SBOMs, provenance (cosign, in-toto, SLSA), PR approvals, and immutable logs stored with retention.

Example: enforce no-public S3 at plan-time with Rego and Conftest:

package s3.guardrails

deny[msg] {
  some r
  input.resource_changes[r].type == "aws_s3_bucket_public_access_block"
  violates_public_access(input.resource_changes)
  msg := "S3 bucket is public or missing public access block"
}

violates_public_access(changes) {
  some i
  changes[i].type == "aws_s3_bucket"
  not changes[i].change.after.acl
  # or explicitly block acl values
  changes[i].change.after.acl == "public-read"
}

Run it in CI:

terraform plan -out tf.plan
terraform show -json tf.plan > plan.json
conftest test plan.json

Make it visible but not painful: fail on critical, warn on medium. Keep developers moving while you fix the baseline.

Runtime that actually catches badness (in minutes)

The runtime stack that consistently works for k8s and cloud:

  • Containers/Kubernetes: Falco (eBPF) or Tetragon for syscall/network detection.
  • Cloud: AWS GuardDuty + CloudTrail + EKS Audit routed to Security Hub.
  • Pipelines & apps: OpenTelemetry for structured security events; ship via Fluent Bit or Vector to Datadog/Splunk.

A Falco rule that’s saved my bacon more than once:

- rule: Terminal shell in container
  desc: Detect a shell running inside a container
  condition: spawned_process and container and proc.name in (bash, zsh, sh)
  output: "Shell spawned in container (user=%user.name container=%container.id image=%container.image.repository)"
  priority: WARNING
  tags: [container, shell]

Install with Helm and wire to Slack/PagerDuty via falcosidekick:

helm repo add falcosecurity https://falcosecurity.github.io/charts
helm upgrade --install falco falcosecurity/falco --namespace falco --create-namespace \
  --set falcosidekick.enabled=true \
  --set falcosidekick.config.slack.webhookurl=$SLACK_WEBHOOK \
  --set falcosidekick.config.pagerduty.integrationkey=$PD_KEY

On AWS, enable the basics in one pass:

aws guardduty create-detector --enable
aws securityhub enable-security-hub
aws cloudtrail create-trail --name org-trail --is-multi-region-trail --enable-log-file-validation \
  --s3-bucket-name org-cloudtrail-logs

Tie containment to detections. Example: label a suspect pod and apply a restrictive NetworkPolicy automatically:

kubectl label pod $POD quarantine=true --overwrite
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: quarantine-egress-deny
spec:
  podSelector:
    matchLabels:
      quarantine: "true"
  policyTypes: [Egress]
  egress: []

Low drama, high impact. You can roll this out in hours, not weeks.

Prove it: signatures, SBOMs, and provenance in admission

If you can’t prove it, it didn’t happen. We require three things per deploy:

  1. Signed images with cosign.
  2. SBOMs built with Syft and scanned with Trivy/Grype.
  3. Provenance (in-toto/SLSA) showing what built what.

Generate and sign in CI:

syft packages -o spdx-json -q -p myapp:$(git rev-parse --short HEAD) > sbom.spdx.json
cosign sign --key cosign.key ghcr.io/acme/myapp:$(git rev-parse --short HEAD)
cosign attest --predicate sbom.spdx.json --type spdx \
  --key cosign.key ghcr.io/acme/myapp:$(git rev-parse --short HEAD)

Enforce at the cluster with Kyverno verifyImages:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: verify-signed-images
spec:
  validationFailureAction: Enforce
  rules:
    - name: require-cosign
      match:
        any:
          - resources:
              kinds: [Pod, Deployment]
      verifyImages:
        - imageReferences:
            - "ghcr.io/acme/*"
          attestations:
            - type: https://spdx.dev/Document
          verifyDigest: true
          attestations:
            - type: spdx
              conditions:
                - key: "{{ regex_match('^SPDX-.*', '{{ attestation.type }}') }}"
                  operator: Equals
                  value: true
          roots: |-
            -----BEGIN PUBLIC KEY-----
            ...
            -----END PUBLIC KEY-----

Store evidence immutably: shove attestations, PR approvals, and terraform plan/apply logs into S3 with Object Lock (compliance mode) and lifecycle policies. When the SOC 2 auditor asks, you search, attach, done.

Ship fast under HIPAA/PCI/SOC 2: keep data safe and out of logs

Real talk: most “security incidents” in regulated orgs are self-inflicted PII leaks via logs and debug dumps. Fix the plumbing at the edge.

  • Redact at collectors. Fluent Bit filter to strip emails and SSNs before they ever hit your SIEM:
[FILTER]
  Name                rewrite_tag
  Match               kube.*
  Rule                ".*"  "kube.redacted.$TAG"  true

[FILTER]
  Name                grep
  Match               kube.redacted.*
  Regex               log    ^(?:(?!\b\d{3}-\d{2}-\d{4}\b|[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}).)*$
  • Tokenize sensitive fields. Use Vault Transform for format-preserving tokenization—let devs see realistic shapes, not real data:
vault write transform/encode/ccn role=payments value=4111111111111111
  • No prod data in dev. Synthetic datasets (Tonic.ai, Mockaroo) and contract tests instead of prod snapshots. If you must mirror schemas, use data masking and separate accounts/projects by boundary.

  • DLP and secret scanning. Turn on GitHub Advanced Security secret scanning or GitGuardian for repos and logs. Better to be embarrassed in CI than compromised in prod.

  • Access is a feature. Short-lived creds (Vault, AWS SSO) and just-in-time elevation. Log access decisions centrally.

The result: less compliance drag, fewer oops moments, faster incident response.

Make detections an SLO-backed product

If you can’t operate it, you don’t own it. Treat security like SRE:

  • SLOs that matter:
    • MTTA < 5 minutes for high-severity detections.
    • MTTR < 30 minutes to containment for known playbooks.
    • False-positive rate < 10% for paging alerts.
  • Runbooks in code: a docs/runbooks/*.md linked in alerts with kubectl snippets and rollback steps.
  • Noise budgets: weekly tuning. If an alert pages 3x with no action, demote or fix the rule.
  • Detection coverage map: list “crown jewels” (secrets manager, CI runners, DBs) and ensure at least one prevention and one detection per asset.
  • Honeytokens: plant Canarytokens.org AWS keys and file beacons in private repos and S3. If they fire, you know you’ve got exfil or sloppy handling.

Route alerts where work happens: PagerDuty for Sev1, Slack for warn-level, ticket creation automated via Jira for follow-ups. No spreadsheets.

A pragmatic 30/60/90 rollout plan

30 days:

  1. Ship Falco + falcosidekick in prod clusters; wire to Slack/PD.
  2. Turn on GuardDuty, Security Hub, org-level CloudTrail with object lock.
  3. Add Conftest on terraform plan + Trivy on images in CI (fail critical only).
  4. Add basic PII redaction in Fluent Bit/Vector.

60 days:

  1. Require cosign signatures and Syft SBOMs; enforce with Kyverno in staging, then prod.
  2. Add Semgrep with a curated ruleset; gate on high severity in PRs.
  3. Quarantine workflow: label + NetworkPolicy automation from alerts.
  4. Start honeytokens and coverage map; set MTTA/MTTR SLOs.

90 days:

  1. Provenance (in-toto/SLSA) and immutable evidence store with retention.
  2. Expand runtime rules (credential file access, unexpected outbound, kernel module loads).
  3. Tune weekly; publish a security ops dashboard (MTTA, MTTR, FP rate, top detections).
  4. Audit dry-run: pull proofs for a random change, confirm you can answer “who/what/when/why.”

If this feels like a lot, it is—but it replaces three worse things: endless tickets, noisy SIEMs, and audit fire drills.

Related Resources

Key takeaways

  • Translate policy PDFs into code: guardrails in CI/CD, runtime detections, and automated proofs tied to controls.
  • Use eBPF-based runtime tools like Falco or Tetragon to catch shells-in-containers, privilege escalation, and data exfil in minutes, not days.
  • Gate production with signatures, SBOMs, and provenance: verify with Kyverno/Gatekeeper and Sigstore in admission.
  • Keep regulated data out of logs with redaction at the edge (Fluent Bit/Vector) and tokenization (Vault Transform).
  • Treat detections like product features: own MTTA/MTTR SLOs, tune rules weekly, and automate containment.
  • Roll out in 90 days with a minimal viable detection pipeline that proves its value via measurable risk reduction.

Implementation checklist

  • Instrument clusters and clouds for runtime detections (Falco/Tetragon, GuardDuty, CloudTrail).
  • Enforce policy-as-code in CI for IaC and containers (Conftest/Rego, Checkov/Tfsec, Trivy/Semgrep).
  • Require signed images and SBOMs; verify in Kubernetes admission (Kyverno/Gatekeeper + cosign).
  • Centralize logs with PII redaction at collectors (Fluent Bit/Vector) and ship to your SIEM/Datadog.
  • Set alert routing and on-call with runbooks; measure MTTA/MTTR and false-positive rate.
  • Store immutable proof artifacts (attestations, approvals, logs) with retention and object lock.
  • Start with a 30/60/90 rollout plan; tune weekly; expand detections based on postmortems.

Questions we hear from teams

Will this slow my engineers down?
Not if you tier it. Gate only high/critical issues in CI and admission. Everything else is a visible warning with a fix link. Guardrails catch the foot-guns; runtime detections handle the edge cases. We’ve rolled this out in orgs shipping daily without increasing lead time.
We already have a SIEM—why add Falco/Kyverno/cosign?
SIEMs aggregate and search. You still need sensors (Falco/Tetragon/GuardDuty), prevention (OPA/Kyverno), and provenance/signatures (cosign/SBOM) to make the SIEM useful. Think sensors + policy + proofs feeding the SIEM, not replacing it.
How do we handle multi-cloud and hybrid?
Standardize the patterns: policy-as-code (Rego/Kyverno), eBPF runtime where you run Kubernetes, cloud-native findings into a common bus (Security Hub or Datadog), and evidence to a single immutable store. Use OpenTelemetry to normalize app security events across environments.
What about cost and alert fatigue?
Set a noise budget and SLOs up front. Start with a minimal ruleset targeting crown jewels, tune weekly, and only page for high-confidence detections. Shipping fewer, better alerts costs less than drowning your team and ignoring everything.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to GitPlumbers about real-time detection Download the 30/60/90 rollout checklist

Related resources