Real-Time Security Monitoring Without Slowing You Down: Turning Policy Into Guardrails, Checks, and Proofs

Stop shipping blind. Wire your SDLC for real-time signals, automated enforcement, and evidence you can hand to auditors without killing velocity.

If your monitoring can’t answer “what changed” in under a minute, you don’t have monitoring—you have an archive.
Back to all posts

The on-call page that changed the roadmap

We watched a prod cluster start spawning bash in a sidecar at 2:13 AM. kube-audit showed a just-created ClusterRole with * on secrets. The deploy came from a temp branch no one recognized. Classic: a well-meaning engineer debugging a data pipeline, accidentally punching a hole you could drive a semi through. The SIEM got the logs. It didn’t get us the save.

What did? A Falco rule firing within seconds, a Kyverno admission policy that blocked the second bad deploy, and a signed artifact trail that proved which pipeline produced what. That night convinced leadership to stop treating security as a quarterly audit and start treating it as a real-time system.

This is how we wire that system without grinding delivery to a halt.

Real-time means signals across the whole SDLC

If your “real-time monitoring” is just a Splunk or Datadog dashboard on CloudTrail, you’re blind to 80% of the attack surface: pre-merge changes, CI, artifact promotion, and K8s control plane events.

You need a graph of events from code to prod:

  • Code: git commits, PR reviews, branch protections, secrets scanning (gitleaks).
  • Build: CI workflows (GitHub Actions, GitLab CI), provenance (slsa-github-generator), signatures (cosign), SBOMs (syft), vuln scan (grype).
  • Deploy: CD events (ArgoCD/Flux), policy gates (OPA/Kyverno), canaries (Flagger), change windows.
  • Runtime: Kubernetes audit logs, Falco/eBPF, GuardDuty/Security Hub, Istio mTLS anomalies, Prometheus/Alertmanager.

Pipe them to a correlation layer (Datadog, Elastic, Snowflake, or even Loki + Tempo) with consistent IDs:

  • Annotate everything with trace_id, build_id, commit_sha, artifact_digest.
  • Emit OpenTelemetry from CI and CD so deploys connect to runtime alerts.

The goal: when an alert fires, you can answer “what changed, who approved it, and is the artifact trustworthy?” in under 60 seconds.

Turn policy into guardrails, checks, and proofs

Most policies die as PDFs. Make them executable.

  • Guardrails: pre-merge checks and default configs that steer engineers right.
  • Checks: hard gates that block risky changes where it matters.
  • Proofs: cryptographic evidence that a control ran and passed.

Examples that actually work:

  • Infrastructure policy as code with OPA/Rego via conftest or Checkov in CI.
  • Kubernetes admission with Kyverno or Gatekeeper for runtime enforcement.
  • Artifact provenance and signature with cosign + SLSA attestations.

Rego for Terraform (block public S3 unless tagged public-approved):

package terraform.s3

default allow = false

allow {
  input.resource_type == "aws_s3_bucket"
  input.config.acl == "public-read"
  input.config.tags["public-approved"] == "true"
}

Kyverno to deny privileged pods unless approved:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: deny-privileged
spec:
  validationFailureAction: enforce
  rules:
    - name: no-privileged
      match:
        resources:
          kinds: [Pod]
      validate:
        message: "Privileged containers require risk-approval label"
        pattern:
          spec:
            containers:
              - =(securityContext):
                  =(privileged): false
            =(nodeSelector):
              =(risk-approval): "approved"

Proofs during build:

# Generate SBOM and vulnerability scan
syft dir:. -o spdx-json > sbom.json
grype sbom:sbom.json -o json > vuln.json

# Generate SLSA provenance and sign
slsa-github-generator --predicate-type slsaprovenance --output provenance.json
cosign sign --key $COSIGN_KEY $IMAGE_DIGEST
cosign attest --predicate sbom.json --type spdx $IMAGE_DIGEST
cosign attest --predicate provenance.json --type slsaprovenance $IMAGE_DIGEST

Store proofs alongside artifacts (e.g., ghcr.io or ECR with attached attestations) and index in your data platform.

Instrument the pipeline: code, build, deploy

You can’t protect what you can’t see. Wire events where attackers (or rushed engineers) make mistakes.

  1. Code

    • Enable branch protections and required reviews; log review metadata.
    • Run gitleaks and trufflehog pre-commit and in CI; block on high-confidence hits.
    • Dependabot/Renovate with security-only auto-merge under tests + policy gates.
  2. Build

    • Standardize CI with reusable workflows.
    • Example GitHub Actions workflow snippet:
name: secure-build
on: [push]
jobs:
  build:
    permissions:
      id-token: write   # OIDC for cosign keyless
      contents: read
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: sigstore/cosign-installer@v3
      - run: |
          docker build -t $IMAGE .
          digest=$(docker inspect --format='{{index .RepoDigests 0}}' $IMAGE)
          echo "IMAGE_DIGEST=$digest" >> $GITHUB_ENV
      - run: syft $IMAGE -o spdx-json > sbom.json
      - run: grype sbom:sbom.json -o json --fail-on high
      - run: cosign sign --yes ${{ env.IMAGE_DIGEST }}
      - run: cosign attest --yes --predicate sbom.json --type spdx ${{ env.IMAGE_DIGEST }}
  1. Deploy
    • ArgoCD with Sync Waves and Sync Windows; feed events to the bus.
    • Admission policies (Kyverno) enforce image: $DIGEST only (no mutable tags).
    • Canary with Flagger + auto-rollback on SLO breach.

Admission to block unsigned images:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: verify-image-signatures
spec:
  rules:
    - name: verify-cosign
      match:
        resources:
          kinds: [Pod]
      verifyImages:
      - image: "ghcr.io/org/*"
        attestors:
        - entries:
          - keys:
              publicKeys: |-
                -----BEGIN PUBLIC KEY-----
                ...
                -----END PUBLIC KEY-----

Runtime signals that actually catch bad days

Catching the blast radius early is cheaper than incident review therapy.

  • Kubernetes audit logs -> central store with queries like: create/update of RBAC and Secrets by service accounts not in an allowlist.
  • Falco/eBPF for syscall-level detections (crypto mining, shell spawn, package install in containers).
  • Cloud-native services: GuardDuty, Security Hub, CloudTrail with anomaly detection.
  • Network and identity: Istio mTLS failures, unexpected external egress, OIDC/OAuth anomalies.

Falco rule example (shell in container):

- rule: Terminal shell in container
  desc: A shell was spawned in a container
  condition: spawned_process and container and proc.name in (bash, sh, zsh)
  output: "Shell spawned in container (user=%user.name container=%container.id image=%container.image.repository)"
  priority: WARNING

Prometheus alert for privilege escalation attempts (from audit log exporter):

- alert: K8sPrivilegeEscalation
  expr: sum(rate(kube_audit_event_total{verb="create",resource="clusterrolebindings"}[5m])) by (user) > 0
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: Possible privilege escalation by {{ $labels.user }}

Connect the dots: when Falco fires on shell spawn, enrich with deployment, commit_sha, image_digest, and latest cosign status. If the image is unsigned or missing SBOM, auto-quarantine the namespace.

Auto-remediation pattern:

  • Detect -> Tag workload with quarantine=true.
  • Kyverno policy denies network egress for quarantined pods.
  • PagerDuty alert routed with enriched context and rollback command link.

Keep auditors happy without killing velocity

Regulated data changes the calculus, but it doesn’t have to stall delivery. The trick is risk tiers and progressive enforcement.

  • Risk tiers

    • Tier 0 (PCI/HIPAA prod): block on any failed policy, require signed artifacts, mandatory peer review, mTLS, DLP egress scanning.
    • Tier 1 (prod non-regulated): block on criticals; warn on mediums with 7-day SLAs.
    • Tier 2 (staging/dev): warn-only, but still collect evidence.
  • Progressive enforcement

    1. Week 1–2: emit warnings (no blocks), measure noise.
    2. Week 3–4: block in Tier 0, warn in Tier 1–2.
    3. Week 5+: ratchet thresholds, add auto-remediation.
  • Time-bound exceptions

    • All waivers recorded as Exception CRDs with owner, scope, risk, expiry.
apiVersion: security.gitplumbers.dev/v1
kind: Exception
metadata:
  name: allow-nodeport-temporary
spec:
  control: deny-nodeport
  owner: team-ml
  scope: namespace:ml-inference
  expires: 2025-01-31
  justification: "Partner demo; VPN cutover pending"
  • Data-aware controls
    • Tag resources with data classification (public, internal, restricted).
    • DLP on egress for restricted namespaces; block to unknown destinations.
    • Vaulted secrets (HashiCorp Vault or AWS KMS + Secrets Manager), rotated automatically; forbid inline Secret manifests.

Evidence for auditors

  • Attestations: SBOM, vuln scan, SLSA provenance, signature.
  • Policy results: pass/fail with policy versions and links to PRs.
  • Change approvals: PR review logs, change tickets, and deploy metadata.
  • Retention: 1–3 years in cold storage (S3 Glacier) with immutability (S3 Object Lock).

What to measure (and how you’ll know it works)

If it doesn’t change your graphs, it didn’t happen.

  • MTTD for critical runtime events: target < 2 minutes from first signal.
  • MTTR for policy-violating deploys: target < 15 minutes to rollback or remediate.
  • False-positive rate for high-severity alerts: < 5%.
  • Blocked deploy rate: < 2% in Tier 1; 0% in Tier 2 (warn-only).
  • Exception debt: count, aging, and percent expired.
  • Evidence completeness: % of prod images with SBOM + signature + provenance (> 98%).

Dashboards worth staring at

  • “What changed in the last 60 minutes?” (deploys, infra changes, RBAC updates)
  • Top policy violations by team/service
  • Runtime criticals mapped to commit SHAs
  • Exception queue with SLA timers

A 30/60/90 you can actually ship

30 days (prove the loop works)

  1. CI: add SBOM (syft), scan (grype --fail-on high), and cosign signing in one service.
  2. Admission: enforce digest pins and signature verify in one cluster.
  3. Runtime: enable Falco and one Prometheus alert; route to PagerDuty.
  4. Evidence: store attestations and CI logs; index by commit_sha and image_digest.

60 days (turn up the lights)

  1. Expand to top 10 services; add Checkov/conftest to Terraform repos.
  2. Ingest K8s audit logs; add privilege escalation alert.
  3. Roll out progressive enforcement in Tier 1; start exception CRDs.
  4. Dashboards for MTTD, MTTR, blocked deploys; weekly tuning.

90 days (make it boring and durable)

  1. Cover 80% of prod images with SBOM, signature, provenance.
  2. Canary+auto-rollback for Tier 0 with Flagger SLO hooks.
  3. DLP egress policy for restricted namespaces; enforce Vault-only secrets.
  4. Quarterly control reviews as code PRs, not meetings; auditors get read-only dashboards.

What this looks like when it works

  • A Datadog alert fires for unexpected ClusterRoleBinding creation.
  • The event is enriched with deployment=payments-api, commit=abc123, digest=sha256:..., signed=true, sbom=true.
  • Kyverno quarantines the namespace; Flagger rolls traffic back.
  • On-call clicks the evidence link: SLSA provenance, SBOM, scan results, PR approvals.
  • Postmortem: 7 minutes MTTD, 11 minutes MTTR, zero customer impact. Audit trail closed itself.

I’ve seen the opposite too: 3-hour MTTD because logs trickled into a SIEM, no idea who changed what, and a painful all-hands the next day. The difference isn’t budget—it’s wiring policy into the operational fabric and insisting on proofs.

Related Resources

Key takeaways

  • Real-time detection starts with event coverage across code, CI, deploy, and runtime—not just a SIEM feed.
  • Translate policies into machine-enforceable rules (OPA/Kyverno), not PDF checklists, and collect cryptographic proofs (attestations, SBOMs).
  • Use progressive enforcement: warn in dev, block in prod; risk-tier controls to keep delivery moving.
  • Standardize evidence pipelines: provenance, signatures, SBOMs, and policy pass/fail recorded per artifact.
  • Measure outcomes: MTTD, MTTR, false-positive rate, and time-to-exception-closure drive trust.
  • Start small: wire 3–5 critical controls end-to-end, then iterate with dashboards and auto-remediation.

Implementation checklist

  • Instrument code-to-prod event streams: `git`, CI, artifact registry, deploy, Kubernetes audit, cloud logs.
  • Encode policies as code using `OPA`/`Rego` or `Kyverno` and gate them in CI and admission controllers.
  • Generate and store attestations: SBOM (`syft`), vuln scan (`grype`), provenance (`slsa-generator`), signature (`cosign`).
  • Implement runtime detection: `Falco`/eBPF, `GuardDuty`, `CloudTrail`, `Istio` mTLS anomalies, and `Prometheus` alert rules.
  • Adopt progressive enforcement with risk tiers and time-bound exceptions.
  • Build dashboards for MTTD, MTTR, blocked deploys, and exception debt; tune weekly.

Questions we hear from teams

How do we avoid drowning in false positives?
Start with a narrow ruleset tied to change events and high-signal runtime detections (RBAC changes, unsigned images, shell spawns). Run in observe mode for two weeks, label every alert as actionable/not, and only then enable blocks. Track false-positive rate and require a tuning PR for every noisy rule.
Is this overkill for a non-regulated SaaS?
No—reduce scope. Keep signatures, SBOMs, and one or two runtime detections. The payback is faster incident triage and fewer Friday night rollbacks. You can still do warn-only in lower tiers and reserve hard blocks for prod.
We’re on EKS with multiple clusters. Where do we start?
Pick one cluster and one service on the critical path. Add digest pinning + signature verification in admission, and Falco with two rules. In CI, generate SBOM and sign artifacts. Ingest K8s audit logs. Expand by namespace, not by cluster, to keep blast radius contained.
Do we need a full SIEM to do this?
Helpful, not required. You can get far with OpenTelemetry, Loki/Tempo, and a modest Elasticsearch or Datadog footprint. The critical piece is correlation: consistent IDs across CI/CD and runtime so you can stitch events together.
What about AI-generated code and hallucinated dependencies?
Treat AI like a junior dev: guardrails plus review. Enforce dependency allowlists, scan SBOMs, and block new packages without a ticket. Use repo-level policies to require PR descriptions referencing tasks. Provenance attestations help prove where code came from and which pipeline built it.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to an engineer about real-time detection See how we wire policy-as-code without blocking deploys

Related resources