How often should we test breach-ready DR?

Quarterly, minimum. Alternate between tabletop and full restore to a clean room. Include identity/key rotation in at least one drill per year.

What if immutable backups slow down restores?

Use tiered backups: frequent incremental snapshots for speed, periodic immutable copies for safety. Practice both. Your RTO-C should account for verification time, not just restore time.

Can we skip GitOps for the clean room?

You can, but you’ll reinvent it. GitOps gives you declarative, repeatable state, drift detection, and a clean audit trail. In a breach, you want zero manual snowflakes.

Which tools do auditors actually accept as evidence?

Signed policy reports (OPA/Kyverno), IaC plans, SBOMs and image signatures (cosign/in-toto), backup retention configs, and posture queries (steampipe). The key is immutability and provenance—sign and store on WORM.

Security-compliance · Oct 9, 2025 · 9 minute read

Your DR Plan Won’t Save You From a Breach (Unless You Do This)

Most DR runbooks assume failed disks and bad deploys. Breaches behave differently. Here’s how to turn policies into guardrails, checks, and automated proofs—without turning delivery into molasses.

Alex Mercer

Principal Architect, GitPlumbers

20 years shipping and salvaging systems across fintech, health, and SaaS. Former SRE lead at a unicorn that learned the hard way that DR without security is theater.

If your DR plan doesn’t start with isolation and end with auditable proof of cleanliness, it’s theater.

Back to all posts

The DR runbook that choked on a breach

A few years back, I watched a team ace every quarterly DR test—then face-plant during a credential-stuffing breach. Their DR plan assumed dead AZs and flaky disks. The attacker had persistence in an app node, lateral movement via a too-wide iam:PassRole, and was exfiltrating S3 through an allowed egress. Failing over just moved the malware and the role abuse to a new region faster.

I’ve seen this fail more than once. Traditional DR is about uptime. Breach-ready DR is about isolation, integrity, and provable cleanliness. Different playbook. Different metrics. Different automation.

If your DR plan doesn’t start with isolation and end with auditable proof of cleanliness, it’s theater.

Make security incidents first-class in DR

When you plan for burst pipes but not arson, your insurance is fiction. For breach scenarios, you need extra objectives beyond classic RTO and RPO:

RTO-S (Time to Secure/Isolate): Time to cut persistence and stop exfiltration.
RTO-C (Time to Clean): Time to restore to a known-good, verified state in a sterile environment.
Data Integrity SLO: Evidence that restored data wasn’t tampered with (hashes, signatures).

Concrete breach DR steps I’ve seen work:

Isolate first. Toggle pre-built kill switches: network egress blocks, revoke OAuth tokens, disable vulnerable service accounts, rotate KMS keys where practical.
Freeze and fork. Snapshot compromised assets for forensics; fork operations to a clean-room environment built from Git and immutable artifacts.
Rehydrate cleanly. Restore from immutable backups; redeploy infra/app from code; replay data with integrity checks before reopening egress.
Prove it. Attach attestations, backup hashes, and policy-evaluation reports to the incident record.

Metrics to track:

MTTC (Mean Time to Contain), RTO-S, RTO-C
% of controls enforced as code vs. docs
Restore verification pass rate (checksums, DB consistency)

Policies to guardrails: encode, enforce, and prove

Your auditor doesn’t want a PDF; they want evidence your controls actually run. Translate policy into:

Guardrails: Pre-approved configs and golden modules.
Checks: Policy-as-code in CI, admission controllers, and drift detection.
Automated proofs: Signed reports stored on WORM (write-once-read-many) storage.

Here’s a simple OPA/Rego example validating S3 encryption in Terraform plans:

package terraform.s3

default allow = false

allow {
  some r
  input.resource_changes[r].type == "aws_s3_bucket"
  encryption_enabled(input.resource_changes[r])
}

encryption_enabled(rc) {
  some after
  rc.change.after.server_side_encryption_configuration.rule.apply_server_side_encryption_by_default.sse_algorithm != ""
}

Wire it into CI with conftest and produce an artifact:

# .github/workflows/policy.yml
name: policy
on: [pull_request]
jobs:
  policy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Terraform Plan JSON
        run: |
          terraform -chdir=infra init
          terraform -chdir=infra plan -out tf.plan
          terraform -chdir=infra show -json tf.plan > tfplan.json
      - name: OPA Policy Check
        run: |
          conftest test tfplan.json --policy policies/rego --output junit > policy.xml
      - name: Upload Evidence
        uses: actions/upload-artifact@v4
        with:
          name: policy-evidence
          path: policy.xml

For Kubernetes, use Kyverno or Gatekeeper to reject risky workloads at admission:

# Kyverno: deny egress to 0.0.0.0/0 in regulated namespaces
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata: { name: deny-wide-egress }
spec:
  validationFailureAction: enforce
  rules:
    - name: no-wide-egress
      match:
        resources:
          kinds: [NetworkPolicy]
          namespaces: ["pii-*"]
      validate:
        message: "Egress to 0.0.0.0/0 is not allowed in regulated namespaces"
        pattern:
          spec:
            egress:
              - to:
                  - ipBlock:
                      cidr: "!0.0.0.0/0"

Automated proofs your auditor will actually accept:

OPA/Kyverno evaluation reports
terraform plan JSON + tfsec/checkov results
Signed SBOMs (CycloneDX) via syft + attestations with cosign/in-toto
Cloud config queries (e.g., steampipe) stored with hashes

Tip: Have CI sign evidence with cosign and push to an evidence bucket with Object Lock. Auditors love WORM. Attackers hate it.

Clean-room failover: restore sterile, not fast-and-dirty

Failing over to an infected environment is speed-running your own postmortem. Build a clean room:

Immutable backups: S3 Object Lock and AWS Backup Vault Lock (governance or compliance mode). Same story on Azure with Immutable Blob, on GCP with Bucket Lock.
GitOps redeploy: Infra and apps rehydrated via ArgoCD or Flux from signed manifests.
No trust from prod: No peering, no shared secrets, no shared IAM roles.

Minimal Terraform to get AWS Backup with Vault Lock:

resource "aws_backup_vault" "main" {
  name = "dr-vault"
}

resource "aws_backup_vault_lock_configuration" "lock" {
  backup_vault_name   = aws_backup_vault.main.name
  min_retention_days  = 30
  max_retention_days  = 365
  changeable_for_days = 3
}

resource "aws_backup_plan" "daily" {
  name = "daily-immutable"
  rule {
    rule_name         = "daily"
    target_vault_name = aws_backup_vault.main.name
    schedule          = "cron(0 3 * * ? *)" # daily 03:00 UTC
    lifecycle { delete_after = 90 }
  }
}

Rehydrate with ArgoCD into the clean room:

# app-of-apps ArgoCD pattern
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: platform
spec:
  project: default
  source:
    repoURL: https://github.com/org/platform.git
    targetRevision: main
    path: clusters/cleanroom
  destination:
    server: https://kubernetes.default.svc
    namespace: argocd
  syncPolicy:
    automated: { prune: true, selfHeal: true }
    syncOptions: [CreateNamespace=true]

Guardrail: only allow signed images in clean room:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata: { name: verify-signed-images }
spec:
  validationFailureAction: enforce
  rules:
    - name: cosign-verify
      match:
        resources:
          kinds: [Pod]
      verifyImages:
        - image: "ghcr.io/org/*"
          key: "k8s://openshift-pipelines/signing-keys"

Keep delivery fast under regulated data constraints

This is where teams either slow to a crawl or cheat. The path that works:

Golden path for regulated workloads: Pre-approved base images, Terraform modules, K8s charts, and CI templates with guardrails baked in. If you’re HIPAA/PCI, your developers shouldn’t debate TLS or KMS—those are defaults.
Dual lanes:
- Fast lane: non-regulated features using synthetic data and ephemeral environments.
- Governed lane: regulated data, with extra checks and pre-approval baked into templates.
Data tactics: tokenization, field-level encryption, and synthetic datasets for tests. Keep real PII/PAN/PHI only in the governed lane.
Mesh egress policies: lock down Istio/Linkerd egress for regulated namespaces; only approved destinations.

Sample GitHub Actions matrix with different lanes:

jobs:
  build-test:
    strategy:
      matrix:
        lane: [fast, governed]
    steps:
      - uses: actions/checkout@v4
      - run: make build
      - if: matrix.lane == 'fast'
        run: make test-synthetic
      - if: matrix.lane == 'governed'
        run: |
          make test-synthetic
          conftest test tfplan.json --policy policies/rego
          kubectl apply -f policies/kyverno

Break-glass access (time-bound, logged):

Use AWS IAM Identity Center with just-in-time elevation and iam:PassRole scoped to tickets.
Auto-expire credentials; dump CloudTrail and SSO audit logs to a locked bucket.

DLP isn’t a silver bullet, but VPC egress allowlists + macvtap/egress gateways in mesh + Git pre-commit secret scans (gitleaks, trufflehog) keep you safe without killing velocity.

Test like you mean it: isolation, restore, verification

Your plan is only as good as the last drill. Blend tabletop with hands-on chaos:

Containment drill: simulate GuardDuty finding exfil; flip egress-deny NetworkPolicies, revoke tokens, and rotate critical secrets.
Restore drill: rebuild clean room from scratch; restore DB; verify integrity; run smoke tests.
Chaos: use AWS FIS or Gremlin to inject failures while restoring.

Quick DB restore verification example:

#!/usr/bin/env bash
set -euo pipefail
pg_restore -h cleanroom-db -U restore -d app < /backup/immutable.dump
psql -h cleanroom-db -U restore -d app -c "SELECT COUNT(*) FROM orders;" | tee /evidence/rowcount.txt
sha256sum /backup/immutable.dump | tee /evidence/backup.sha256

Track what matters:

RTO-S: isolation in <10 minutes
RTO-C: sterile restore <2 hours
RPO: <15 minutes (binlog/CDC)
Integrity: checksum match rate 100%, app smoke tests green
SLOs: “Time to policy-compliant state after restore” <15 minutes

Prometheus note: expose a dr_last_restore_ok metric and alert if stale.

Evidence or it didn’t happen: automated proofs for auditors

Don’t hand-wave. Produce durable artifacts every run:

steampipe snapshot of cloud posture
Policy results (OPA/Kyverno/Gatekeeper)
SBOMs + signatures
Backup inventories (AWS Backup/Azure Backup) with retention configs

Example steampipe query to prove S3 encryption:

select name, encryption from aws_s3_bucket where encryption is null;

Automate and store on WORM:

steampipe query --output json "$(cat queries/s3-encryption.sql)" > s3_encryption.json
cosign sign-blob --key cosign.key s3_encryption.json > s3_encryption.sig
aws s3 cp s3_encryption.json s3://evidence-worm/$(date +%F)/ --request-payer requester
aws s3 cp s3_encryption.sig s3://evidence-worm/$(date +%F)/ --request-payer requester

For software supply chain, sign images and attest builds:

cosign sign ghcr.io/org/svc:$(git rev-parse --short HEAD)
attest --predicate sbom.cdx.json --key cosign.key ghcr.io/org/svc:$(git rev-parse --short HEAD)

When auditors ask, you show immutable logs, signatures, and passing policies—not a SharePoint graveyard.

A 30-60-90 that actually ships

30 days:

Define breach-specific objectives: RTO-S, RTO-C, integrity checks.
Implement 3 high-value guardrails (KMS on, no public buckets, restricted egress).
Turn on immutable backups (Object Lock/Vault Lock) for crown jewels.

60 days:

Stand up clean-room environment; wire ArgoCD to rebuild from Git.
Add CI policy gates (OPA/Kyverno), secrets scanning, and evidence signing.
Run first tabletop + restore drill; capture metrics.

90 days:

Tighten mesh egress; implement tokenization/synthetic data path.
Integrate automated proofs into audit process (SOC 2/HIPAA/PCI).
Run a breach game day with chaos injection; iterate on RTO-S/RTO-C.

I’ve watched teams go from “we pass DR but would fail a breach” to “we isolate in minutes and restore sterile in under two hours” with this approach. It’s not flashy. It’s boring-by-design and it works. And yes—delivery gets faster when decisions are encoded, not debated.

Related Resources

Key takeaways

Treat breaches as a distinct DR class: isolate first, then restore to a clean room.
Translate policies into guardrails and automated proofs using OPA/Rego, Kyverno/Gatekeeper, and CI gates.
Use immutable, air-gapped backups (Object Lock/Vault Lock) and rehearse clean-room rebuilds via GitOps.
Keep delivery fast by creating a regulated “golden path” with pre-approved components and synthetic data.
Continuously test: game days, chaos, and restore drills with measurable RTO/RPO and auditor-ready evidence.

Implementation checklist

Define RTO/RPO plus RTO-S (time to secure/isolate) for breach scenarios.
Codify guardrails with OPA/Kyverno and enforce via CI/CD and admission controllers.
Implement immutable backups (S3 Object Lock, AWS Backup Vault Lock) and document break-glass.
Stand up a clean-room environment and practice rehydration via ArgoCD/Flux from Git.
Create a regulated golden path with tokenization, synthetic data, and egress controls.
Automate evidence capture (attestations, SBOMs, policy reports) and store on WORM.
Run quarterly breach game days and restore drills; track MTTR, RTO, RPO, and data integrity KPIs.

Questions we hear from teams

How often should we test breach-ready DR?: Quarterly, minimum. Alternate between tabletop and full restore to a clean room. Include identity/key rotation in at least one drill per year.
What if immutable backups slow down restores?: Use tiered backups: frequent incremental snapshots for speed, periodic immutable copies for safety. Practice both. Your RTO-C should account for verification time, not just restore time.
Can we skip GitOps for the clean room?: You can, but you’ll reinvent it. GitOps gives you declarative, repeatable state, drift detection, and a clean audit trail. In a breach, you want zero manual snowflakes.
Which tools do auditors actually accept as evidence?: Signed policy reports (OPA/Kyverno), IaC plans, SBOMs and image signatures (cosign/in-toto), backup retention configs, and posture queries (steampipe). The key is immutability and provenance—sign and store on WORM.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Build a breach-ready DR plan with GitPlumbers See our Security & Compliance services