Your DR Plan Won’t Save You From a Breach (Unless You Do This)
Most DR runbooks assume failed disks and bad deploys. Breaches behave differently. Here’s how to turn policies into guardrails, checks, and automated proofs—without turning delivery into molasses.
If your DR plan doesn’t start with isolation and end with auditable proof of cleanliness, it’s theater.Back to all posts
The DR runbook that choked on a breach
A few years back, I watched a team ace every quarterly DR test—then face-plant during a credential-stuffing breach. Their DR plan assumed dead AZs and flaky disks. The attacker had persistence in an app node, lateral movement via a too-wide iam:PassRole, and was exfiltrating S3 through an allowed egress. Failing over just moved the malware and the role abuse to a new region faster.
I’ve seen this fail more than once. Traditional DR is about uptime. Breach-ready DR is about isolation, integrity, and provable cleanliness. Different playbook. Different metrics. Different automation.
If your DR plan doesn’t start with isolation and end with auditable proof of cleanliness, it’s theater.
Make security incidents first-class in DR
When you plan for burst pipes but not arson, your insurance is fiction. For breach scenarios, you need extra objectives beyond classic RTO and RPO:
- RTO-S (Time to Secure/Isolate): Time to cut persistence and stop exfiltration.
- RTO-C (Time to Clean): Time to restore to a known-good, verified state in a sterile environment.
- Data Integrity SLO: Evidence that restored data wasn’t tampered with (hashes, signatures).
Concrete breach DR steps I’ve seen work:
- Isolate first. Toggle pre-built kill switches: network egress blocks, revoke OAuth tokens, disable vulnerable service accounts, rotate KMS keys where practical.
- Freeze and fork. Snapshot compromised assets for forensics; fork operations to a clean-room environment built from Git and immutable artifacts.
- Rehydrate cleanly. Restore from immutable backups; redeploy infra/app from code; replay data with integrity checks before reopening egress.
- Prove it. Attach attestations, backup hashes, and policy-evaluation reports to the incident record.
Metrics to track:
- MTTC (Mean Time to Contain), RTO-S, RTO-C
- % of controls enforced as code vs. docs
- Restore verification pass rate (checksums, DB consistency)
Policies to guardrails: encode, enforce, and prove
Your auditor doesn’t want a PDF; they want evidence your controls actually run. Translate policy into:
- Guardrails: Pre-approved configs and golden modules.
- Checks: Policy-as-code in CI, admission controllers, and drift detection.
- Automated proofs: Signed reports stored on WORM (write-once-read-many) storage.
Here’s a simple OPA/Rego example validating S3 encryption in Terraform plans:
package terraform.s3
default allow = false
allow {
some r
input.resource_changes[r].type == "aws_s3_bucket"
encryption_enabled(input.resource_changes[r])
}
encryption_enabled(rc) {
some after
rc.change.after.server_side_encryption_configuration.rule.apply_server_side_encryption_by_default.sse_algorithm != ""
}Wire it into CI with conftest and produce an artifact:
# .github/workflows/policy.yml
name: policy
on: [pull_request]
jobs:
policy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Terraform Plan JSON
run: |
terraform -chdir=infra init
terraform -chdir=infra plan -out tf.plan
terraform -chdir=infra show -json tf.plan > tfplan.json
- name: OPA Policy Check
run: |
conftest test tfplan.json --policy policies/rego --output junit > policy.xml
- name: Upload Evidence
uses: actions/upload-artifact@v4
with:
name: policy-evidence
path: policy.xmlFor Kubernetes, use Kyverno or Gatekeeper to reject risky workloads at admission:
# Kyverno: deny egress to 0.0.0.0/0 in regulated namespaces
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata: { name: deny-wide-egress }
spec:
validationFailureAction: enforce
rules:
- name: no-wide-egress
match:
resources:
kinds: [NetworkPolicy]
namespaces: ["pii-*"]
validate:
message: "Egress to 0.0.0.0/0 is not allowed in regulated namespaces"
pattern:
spec:
egress:
- to:
- ipBlock:
cidr: "!0.0.0.0/0"Automated proofs your auditor will actually accept:
- OPA/Kyverno evaluation reports
terraform planJSON +tfsec/checkovresults- Signed SBOMs (CycloneDX) via
syft+ attestations withcosign/in-toto - Cloud config queries (e.g.,
steampipe) stored with hashes
Tip: Have CI sign evidence with
cosignand push to an evidence bucket with Object Lock. Auditors love WORM. Attackers hate it.
Clean-room failover: restore sterile, not fast-and-dirty
Failing over to an infected environment is speed-running your own postmortem. Build a clean room:
- Immutable backups:
S3 Object LockandAWS Backup Vault Lock(governance or compliance mode). Same story on Azure with Immutable Blob, on GCP with Bucket Lock. - GitOps redeploy: Infra and apps rehydrated via
ArgoCDorFluxfrom signed manifests. - No trust from prod: No peering, no shared secrets, no shared IAM roles.
Minimal Terraform to get AWS Backup with Vault Lock:
resource "aws_backup_vault" "main" {
name = "dr-vault"
}
resource "aws_backup_vault_lock_configuration" "lock" {
backup_vault_name = aws_backup_vault.main.name
min_retention_days = 30
max_retention_days = 365
changeable_for_days = 3
}
resource "aws_backup_plan" "daily" {
name = "daily-immutable"
rule {
rule_name = "daily"
target_vault_name = aws_backup_vault.main.name
schedule = "cron(0 3 * * ? *)" # daily 03:00 UTC
lifecycle { delete_after = 90 }
}
}Rehydrate with ArgoCD into the clean room:
# app-of-apps ArgoCD pattern
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: platform
spec:
project: default
source:
repoURL: https://github.com/org/platform.git
targetRevision: main
path: clusters/cleanroom
destination:
server: https://kubernetes.default.svc
namespace: argocd
syncPolicy:
automated: { prune: true, selfHeal: true }
syncOptions: [CreateNamespace=true]Guardrail: only allow signed images in clean room:
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata: { name: verify-signed-images }
spec:
validationFailureAction: enforce
rules:
- name: cosign-verify
match:
resources:
kinds: [Pod]
verifyImages:
- image: "ghcr.io/org/*"
key: "k8s://openshift-pipelines/signing-keys"Keep delivery fast under regulated data constraints
This is where teams either slow to a crawl or cheat. The path that works:
- Golden path for regulated workloads: Pre-approved base images, Terraform modules, K8s charts, and CI templates with guardrails baked in. If you’re HIPAA/PCI, your developers shouldn’t debate TLS or KMS—those are defaults.
- Dual lanes:
- Fast lane: non-regulated features using synthetic data and ephemeral environments.
- Governed lane: regulated data, with extra checks and pre-approval baked into templates.
- Data tactics: tokenization, field-level encryption, and synthetic datasets for tests. Keep real PII/PAN/PHI only in the governed lane.
- Mesh egress policies: lock down
Istio/Linkerdegress for regulated namespaces; only approved destinations.
Sample GitHub Actions matrix with different lanes:
jobs:
build-test:
strategy:
matrix:
lane: [fast, governed]
steps:
- uses: actions/checkout@v4
- run: make build
- if: matrix.lane == 'fast'
run: make test-synthetic
- if: matrix.lane == 'governed'
run: |
make test-synthetic
conftest test tfplan.json --policy policies/rego
kubectl apply -f policies/kyvernoBreak-glass access (time-bound, logged):
- Use
AWS IAM Identity Centerwith just-in-time elevation andiam:PassRolescoped to tickets. - Auto-expire credentials; dump CloudTrail and SSO audit logs to a locked bucket.
DLP isn’t a silver bullet, but VPC egress allowlists + macvtap/egress gateways in mesh + Git pre-commit secret scans (gitleaks, trufflehog) keep you safe without killing velocity.
Test like you mean it: isolation, restore, verification
Your plan is only as good as the last drill. Blend tabletop with hands-on chaos:
- Containment drill: simulate GuardDuty finding exfil; flip egress-deny NetworkPolicies, revoke tokens, and rotate critical secrets.
- Restore drill: rebuild clean room from scratch; restore DB; verify integrity; run smoke tests.
- Chaos: use AWS FIS or Gremlin to inject failures while restoring.
Quick DB restore verification example:
#!/usr/bin/env bash
set -euo pipefail
pg_restore -h cleanroom-db -U restore -d app < /backup/immutable.dump
psql -h cleanroom-db -U restore -d app -c "SELECT COUNT(*) FROM orders;" | tee /evidence/rowcount.txt
sha256sum /backup/immutable.dump | tee /evidence/backup.sha256Track what matters:
- RTO-S: isolation in <10 minutes
- RTO-C: sterile restore <2 hours
- RPO: <15 minutes (binlog/CDC)
- Integrity: checksum match rate 100%, app smoke tests green
- SLOs: “Time to policy-compliant state after restore” <15 minutes
Prometheus note: expose a dr_last_restore_ok metric and alert if stale.
Evidence or it didn’t happen: automated proofs for auditors
Don’t hand-wave. Produce durable artifacts every run:
steampipesnapshot of cloud posture- Policy results (OPA/Kyverno/Gatekeeper)
- SBOMs + signatures
- Backup inventories (AWS Backup/Azure Backup) with retention configs
Example steampipe query to prove S3 encryption:
select name, encryption from aws_s3_bucket where encryption is null;Automate and store on WORM:
steampipe query --output json "$(cat queries/s3-encryption.sql)" > s3_encryption.json
cosign sign-blob --key cosign.key s3_encryption.json > s3_encryption.sig
aws s3 cp s3_encryption.json s3://evidence-worm/$(date +%F)/ --request-payer requester
aws s3 cp s3_encryption.sig s3://evidence-worm/$(date +%F)/ --request-payer requesterFor software supply chain, sign images and attest builds:
cosign sign ghcr.io/org/svc:$(git rev-parse --short HEAD)
attest --predicate sbom.cdx.json --key cosign.key ghcr.io/org/svc:$(git rev-parse --short HEAD)When auditors ask, you show immutable logs, signatures, and passing policies—not a SharePoint graveyard.
A 30-60-90 that actually ships
30 days:
- Define breach-specific objectives: RTO-S, RTO-C, integrity checks.
- Implement 3 high-value guardrails (KMS on, no public buckets, restricted egress).
- Turn on immutable backups (Object Lock/Vault Lock) for crown jewels.
60 days:
- Stand up clean-room environment; wire ArgoCD to rebuild from Git.
- Add CI policy gates (OPA/Kyverno), secrets scanning, and evidence signing.
- Run first tabletop + restore drill; capture metrics.
90 days:
- Tighten mesh egress; implement tokenization/synthetic data path.
- Integrate automated proofs into audit process (SOC 2/HIPAA/PCI).
- Run a breach game day with chaos injection; iterate on RTO-S/RTO-C.
I’ve watched teams go from “we pass DR but would fail a breach” to “we isolate in minutes and restore sterile in under two hours” with this approach. It’s not flashy. It’s boring-by-design and it works. And yes—delivery gets faster when decisions are encoded, not debated.
Key takeaways
- Treat breaches as a distinct DR class: isolate first, then restore to a clean room.
- Translate policies into guardrails and automated proofs using OPA/Rego, Kyverno/Gatekeeper, and CI gates.
- Use immutable, air-gapped backups (Object Lock/Vault Lock) and rehearse clean-room rebuilds via GitOps.
- Keep delivery fast by creating a regulated “golden path” with pre-approved components and synthetic data.
- Continuously test: game days, chaos, and restore drills with measurable RTO/RPO and auditor-ready evidence.
Implementation checklist
- Define RTO/RPO plus RTO-S (time to secure/isolate) for breach scenarios.
- Codify guardrails with OPA/Kyverno and enforce via CI/CD and admission controllers.
- Implement immutable backups (S3 Object Lock, AWS Backup Vault Lock) and document break-glass.
- Stand up a clean-room environment and practice rehydration via ArgoCD/Flux from Git.
- Create a regulated golden path with tokenization, synthetic data, and egress controls.
- Automate evidence capture (attestations, SBOMs, policy reports) and store on WORM.
- Run quarterly breach game days and restore drills; track MTTR, RTO, RPO, and data integrity KPIs.
Questions we hear from teams
- How often should we test breach-ready DR?
- Quarterly, minimum. Alternate between tabletop and full restore to a clean room. Include identity/key rotation in at least one drill per year.
- What if immutable backups slow down restores?
- Use tiered backups: frequent incremental snapshots for speed, periodic immutable copies for safety. Practice both. Your RTO-C should account for verification time, not just restore time.
- Can we skip GitOps for the clean room?
- You can, but you’ll reinvent it. GitOps gives you declarative, repeatable state, drift detection, and a clean audit trail. In a breach, you want zero manual snowflakes.
- Which tools do auditors actually accept as evidence?
- Signed policy reports (OPA/Kyverno), IaC plans, SBOMs and image signatures (cosign/in-toto), backup retention configs, and posture queries (steampipe). The key is immutability and provenance—sign and store on WORM.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
