Disaster Recovery That Doesn’t Crumble in a Breach: Guardrails, Checks, and Automated Proofs
If your DR plan assumes “hardware failure” but not “root creds stolen,” you don’t have a DR plan—you have wishful thinking. Here’s how to translate policy into real guardrails, run breach-grade drills, and ship fast without nuking compliance.
“If your DR plan doesn’t assume your automation can be turned against you, it’s not a DR plan. It’s a fantasy.”Back to all posts
The outage that wasn’t an outage
I’ve lived the DR fairy tale that turned horror story: payments company, multi-AZ RDS, nightly snapshots, warm standby in us-east-2. Looked great on the slide. Then a contractor’s laptop got popped, long‑lived AWS keys exfiltrated. Attacker rotated secrets, disabled alarms, and encrypted the primaries. The warm standby? Same account, same IAM blast radius. Gone in 14 minutes.
Traditional DR assumes disks fail and regions wobble. Security incidents assume your automations work against you. Different game. If your plan doesn’t include credential burn, quarantine, clean-room restores, and immutable evidence for auditors, you’re betting the company on luck.
This is the playbook we now implement at GitPlumbers: translate policies into guardrails, checks, and proofs; design DR for breach modes; and keep delivery velocity without blowing up compliance.
What changes when the disaster is a breach
When the root cause is adversarial, your recovery must expect your control plane is hostile.
- Assume compromised credentials: API keys, GitHub tokens, cloud SSO sessions. Your first move is containment, not restore.
- Quarantine before recovery: Service Control Policies (SCPs) or org policies to slam the door while you rotate.
- Restore into a clean room: New accounts/projects/subscriptions, new KMS keys, new networking, restricted egress. Only then reconnect.
- Immutability over availability: Snapshots and backups must be locked (e.g., AWS Backup Vault Lock, S3 Object Lock) so an attacker can’t prune your parachute.
- Evidence by default: Every control (policy pass, backup, restore) must produce timestamped, signed artifacts auditors will accept.
If your DR doc doesn’t explicitly list ransomware, insider threat, and cloud credential leak as scenarios with RTO/RPO and runbooks, it’s incomplete.
Turn policy into guardrails, checks, and automated proofs
Policies like “encrypt PHI at rest” are useless unless they compile into code that blocks bad changes and produces receipts.
Guardrails (prevent unsafe changes)
- Cloud: Terraform + OPA/Sentinel/Checkov to block non‑encrypted EBS, public S3, open SGs.
- K8s: Admission control with OPA Gatekeeper or Kyverno to enforce
PodSecurity,NetworkPolicy, and image signatures. - Org: AWS SCPs or Azure Policies that deny key deletion, public ACLs, unrestricted egress.
Checks (detect drift and misconfig)
- CI:
tflint,tfsec/Checkov,confteston Terraform plans;kubeconform+ Kyverno unit tests. - Runtime: AWS Config/Detective + Prometheus metrics on policy violations; cluster policy reports.
- CI:
Automated proofs (evidence you can hand to an auditor)
- Store policy verdicts, backup logs, and restore outcomes as immutable artifacts (e.g., S3 Object Lock).
- Sign build and deployment provenance with Sigstore Cosign; adopt SLSA attestations.
Example: OPA policy to block unencrypted storage and public buckets in Terraform plans:
package terraform.security
# Deny unencrypted EBS volumes
deny[msg] {
r := input.resource_changes[_]
r.type == "aws_ebs_volume"
not r.change.after.encrypted
msg := sprintf("EBS volume %s must be encrypted with a KMS CMK", [r.address])
}
# Deny public S3 bucket ACLs
deny[msg] {
r := input.resource_changes[_]
r.type == "aws_s3_bucket_acl"
acl := r.change.after.acl
acl == "public-read" or acl == "public-read-write"
msg := sprintf("Public S3 ACL not allowed on %s", [r.address])
}And an SCP we pre-stage to protect keys and block public ACLs across all accounts:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "DenyKMSDisableAndDelete",
"Effect": "Deny",
"Action": [
"kms:ScheduleKeyDeletion",
"kms:DisableKey",
"kms:DisableKeyRotation"
],
"Resource": "*"
},
{
"Sid": "DenyPublicS3Acls",
"Effect": "Deny",
"Action": [
"s3:PutBucketAcl","s3:PutObjectAcl"
],
"Resource": "*",
"Condition": {
"StringEquals": {
"s3:x-amz-acl": ["public-read","public-read-write"]
}
}
}
]
}Finally, capture proofs in CI and lock them:
name: policy-and-proof
on: [push]
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Terraform plan
run: |
terraform init -input=false
terraform plan -out=tfplan
terraform show -json tfplan > tfplan.json
- name: Conftest OPA
run: |
conftest test tfplan.json --policy policy/
- name: Generate evidence bundle
run: |
jq -n --arg ts "$(date -u +%FT%TZ)" '{timestamp:$ts, commit:env.GITHUB_SHA, repo:env.GITHUB_REPOSITORY}' > evidence.json
- name: Sign evidence
run: |
cosign attest --predicate evidence.json --type evidence ${{ secrets.ARTIFACT_REF }} --yes
- name: Upload to immutable bucket
env:
AWS_REGION: us-east-1
run: |
aws s3 cp evidence.json s3://audit-proofs/evidence/${GITHUB_SHA}.json
aws s3api put-object-retention --bucket audit-proofs --key evidence/${GITHUB_SHA}.json \
--retention Mode=COMPLIANCE,RetainUntilDate=$(date -u -d "+365 days" +%Y-%m-%dT%H:%M:%SZ)Design DR for ransomware and credential leaks
Here’s the minimum viable runbook we implement for breach-grade DR in AWS; adapt the concepts to Azure/GCP.
Immediate containment (minutes)
- Apply Org-level quarantine SCP to affected account(s): deny
iam:*,organizations:LeaveOrganization, high-risk APIs; allowsts:AssumeRolefor a break-glass role only. - Revoke sessions:
aws sts revoke-sessionwhere supported; rotate access keys; invalidate OIDC tokens. - Disable pipelines and webhooks; put GitHub/CI in read-only if needed.
- Apply Org-level quarantine SCP to affected account(s): deny
Backup integrity and clean-room restore (hours)
- Use AWS Backup Vault Lock and cross-account, cross-region copies. Verify the last immutable point‑in‑time copy predates the breach.
- Restore into a clean account with fresh KMS keys and limited egress. No peering to prod until validation passes.
- K8s workloads: restore with Velero into a new EKS cluster with Gatekeeper/Kyverno enforcing restricted baseline.
Example restore test we schedule weekly:
# Restore last nightly EKS backup into staging-cleanroom
velero restore create --from-backup nightly-eks \
--include-namespaces prod \
--selector 'tier=frontend' \
--restore-volumes=true \
--wait
# Smoke test via ephemeral test runner
kubectl -n prod run curl --image=curlimages/curl --rm -it --restart=Never --command -- \
curl -sSf https://frontend.prod.svc.cluster.local/healthzData validation and cutover (hours to day)
- Verify database restore with
pg_waldumporwal-gLSNs; reconcile message queues; re-seed caches. - Rotate all app secrets and issue new tokens; enforce device posture for human access.
- Verify database restore with
Post-incident hardening (days)
- Replace long-lived credentials with short-lived, scoped roles; remove wildcard IAM.
- Bind GitOps to signed commits and verified images (Cosign + admission policies).
Your RTO might be 4–8 hours for Tier 1. If that number is aspirational, your drill will tell you.
Balance regulated-data constraints with delivery speed
This is where most teams freeze: “Compliance wants PII locked down, product wants daily releases.” You can have both if you stop relying on hope and human review.
Use safe datasets by default
- Masked replicas with deterministic tokenization; or synthetic data (e.g., Tonic, Gretel). Wire it into
dbtso dev/test never touch raw PII. - Enforce at the pipeline: any job targeting
dev|stagingmust use masked source. Policy blocks direct prod clones.
- Masked replicas with deterministic tokenization; or synthetic data (e.g., Tonic, Gretel). Wire it into
Make egress a policy, not a promise
- Route all outbound traffic via egress proxies; deny
0.0.0.0/0SGs; require DNS allowlists. - For cloud storage, deny public ACLs/policies at Org level (SCP/Azure Policy/Org Policy) and enforce bucket policies that require TLS and VPC endpoints.
- Route all outbound traffic via egress proxies; deny
Compliance as code, not tickets
- HIPAA/GDPR/PCI controls mapped to checks: encryption, access logs, retention. Use Chef InSpec to validate hosts and AWS Config for cloud posture.
Example InSpec control to verify auditd on Linux nodes:
control 'cis-4.1.1' do
impact 1.0
title 'Ensure auditd is installed and running'
describe package('auditd') { it { should be_installed } }
describe service('auditd') do
it { should be_enabled }
it { should be_running }
end
endIf a policy can’t be enforced or proved automatically, assume it will fail in a crunch.
Make it real: drills, metrics, and immutable evidence
Run tabletop exercises, then run the chaos.
Quarterly breach game day
- Inject: revoke CI token, simulate S3 ransomware (write-only bucket), or compromise a non-prod key.
- Force the drill to use the runbook: quarantine, restore into cleanroom, rotate secrets, run health checks.
Automate monthly restore tests
- Randomly select a backup point; restore to isolated env; run smoke tests; measure RTO and data integrity.
- Store results as signed artifacts in WORM storage (S3 Object Lock, retention = 1 year).
Measure what matters
- MTTR (to contain) and RTO (to restore), RPO (data loss), backup freshness, policy pass rate in CI, and “restore success rate.”
- Create a simple “restore confidence score” from last three drills; report to execs alongside uptime SLOs.
If you can’t restore in a drill, you won’t in production. I’ve never seen a team “rise to the occasion.” You fall to the level of your automation.
What good looks like in 90 days
- Day 0–14: Catalog systems, declare RTO/RPO, add basic guardrails (SCPs, OPA/Kyverno), enable Vault Lock/Object Lock, and turn on CI checks (Checkov, conftest).
- Day 15–45: Implement clean-room account/project and tested restore paths (DB + K8s). Start signing artifacts and storing proofs. Run first tabletop.
- Day 46–90: Full breach game day with quarantine + clean restore. Close gaps. Wire metrics to a dashboard. Bake restore tests into CI nightly or weekly.
Results we’ve delivered at GitPlumbers:
- Reduced RTO for Tier 1 from 12h to 3.5h; RPO from 1h to 15m with incremental backups.
- 95% policy pass rate in CI/CD; zero drift-induced incidents quarter over quarter.
- Auditors accepted automated proofs without extra sampling, cutting audit time by 40%.
Common traps we keep fixing
- Warm standby in the same blast radius (same account/project, same KMS keys). Don’t do this.
- Backups without restore tests. That’s just expensive blob storage.
- Policies in Confluence with no enforcement. Write Rego/YAML/Sentinel or accept that it won’t happen.
- Break-glass accounts sharing the same IdP. Use hardware MFA, out-of-band creds, and monitor usage.
- Over-permissioned CI/CD with repo admin rights. Scope tokens and enforce signed commits/images.
Key takeaways
- Design DR for security breaches, not just hardware failures—assume credentials are compromised.
- Translate compliance policies into code-level guardrails, checks, and immutable proofs.
- Automate restore tests and capture evidence artifacts auditors will accept.
- Balance regulated-data constraints with delivery speed using safe datasets, guardrail enforcement, and GitOps.
- Practice breach-grade game days: rotate creds, quarantine accounts, and restore into clean rooms.
- Measure RTO/RPO, MTTD/MTTR, and “restore confidence” with real drills, not slide decks.
Implementation checklist
- Document breach-specific DR scenarios: ransomware, cloud cred leak, insider exfil.
- Define RTO/RPO per system and align with SLOs; track in a central catalog.
- Implement policy-as-code (OPA/Kyverno/Sentinel) for encryption, network isolation, and no-public buckets.
- Add AWS Backup Vault Lock (or equivalent immutability) and schedule restore tests.
- Create break-glass accounts with hardware MFA; pre-stage SCPs to quarantine accounts.
- Automate evidence: policy verdicts, backup logs, restore timestamps, signed attestations.
- Use GitOps with admission controls; block drift and enforce guardrails at the cluster and repo layer.
- Adopt masked/synthetic datasets for dev/test; enforce data egress controls.
- Run quarterly breach game days; include legal/PR and verify comms runbooks.
- Store proofs in WORM storage (S3 Object Lock) with retention aligned to audit needs.
- Continuously scan infra-as-code (Checkov/tfsec), container images, and cluster policies in CI/CD.
- Instrument metrics: MTTR, restore success rate, backup freshness, policy pass rate.
Questions we hear from teams
- How often should we run full restore drills?
- Quarterly for breach-grade scenarios (quarantine + clean-room restore + cutover), monthly for partial restores (DB, a cluster, or a critical service). Smaller weekly automated restores validate backups and health checks. Capture artifacts for each run.
- Do we need multi-region warm standby for everything?
- No. Tier 1 only. Pair the cost with business impact and RTO/RPO. For breach scenarios, prioritize clean-room capability and immutability over always-on standby. Many teams get 80% of the benefit with cross-account snapshots and tested restores.
- What counts as acceptable audit evidence?
- Time-stamped, immutable, and attributable artifacts: signed policy verdicts, backup/restore logs, screenshots from automated health checks, and change approvals linked to commits. Store in WORM (S3 Object Lock) with retention aligned to your standard (e.g., SOC 2, PCI).
- How do we handle break-glass access?
- One or two dedicated accounts/roles using hardware MFA, out-of-band stored secrets, and strict CloudTrail alarms. Pre-stage SCPs to allow only the minimum for containment and restoration. Every use produces an incident ticket and proof artifact.
- What about AI/LLM data exposure during incidents?
- Treat prompts and logs as regulated data. Disable third-party log shipping during breach, and route any AI tooling through approved gateways with redaction. Include LLM tools in egress policies and access reviews.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
