The Restore That Doesn’t Re‑Open the Breach: DR Plans for When Security Fails

Most DR plans assume hardware dies, not that your prod identity is burning. Here’s how to design breach‑aware recovery with real guardrails, automated proofs, and enough speed to keep the business alive—without leaking regulated data.

In breach DR, speed without rekeying is just a faster way to fail compliance.
Back to all posts

The breach no one planned for

Two summers ago, a unicorn SaaS called us after they “passed” a DR drill by restoring production in 45 minutes—straight into an attacker’s waiting arms. Their AWS keys were still live, their golden AMI had been poisoned, and their RDS restore reconnected to a VPC the adversary had persistence in. They met RTO. They also re-opened the breach.

I’ve seen that movie too many times. Most DR plans assume a disk dies. Real incidents look like: lateral movement, stolen OIDC tokens, ransomware in a sidecar, or a compromised CI runner. Your plan has to assume compromise and still move fast—especially under regulated data constraints (PCI, HIPAA, SOC 2, GDPR).

What actually breaks under breach conditions

  • Keys and trust anchors are stale: IAM roles, GitHub OIDC trust, cluster service accounts—all suspect.
  • Backups are contaminated or mutable: S3 buckets without Object Lock; snapshots in the same account the attacker owns.
  • Networks default to convenience: Restores land in prod VPCs with shared endpoints and open egress.
  • Evidence is ad hoc: Auditors want proof; you have Slack screenshots.
  • Policies live in PDFs: No guardrails, no checks, no proofs—just wishful thinking.

When this goes wrong, you might hit your RTO and still fail the business goal: safe restoration without further data exposure. Measure new things: add RTO’ (time to safely reconnect) and MTTK (mean time to key rotation) alongside RTO/RPO, MTTD/MTTR.

Turn policy into guardrails, checks, and automated proofs

Paper policies don’t help at 2 a.m. Translate them:

  • Guardrails (preventive): Infrastructure defaults that enforce policy (immutable backups, encryption, deny-by-default networks).
  • Checks (detective): CI/CD and pre-merge tests that fail unsafe changes (Terraform, K8s, images).
  • Automated proofs (evidence): Artifacts that a human and an auditor can trust (signed builds, conformance reports, drill logs).

A few concrete patterns:

  • Use S3 Object Lock and cross-account replication for backup immutability.
  • Enforce encryption and immutability via OPA or Checkov in CI.
  • Require signed container images with Cosign; verify at admission with Kyverno or Gatekeeper.
  • Generate machine-readable evidence (SARIF, JUnit, signed attestations) each pipeline run.
# Terraform 1.6: immutable, cross-account backup bucket with KMS
resource "aws_kms_key" "backup" {
  description             = "DR backups KMS key"
  deletion_window_in_days = 30
  enable_key_rotation     = true
}

resource "aws_s3_bucket" "backup" {
  bucket = "org-dr-backups-prod"
  object_lock_enabled = true
}

resource "aws_s3_bucket_object_lock_configuration" "backup" {
  bucket = aws_s3_bucket.backup.id
  rule {
    default_retention {
      mode  = "COMPLIANCE"
      days  = 30
    }
  }
}

resource "aws_s3_bucket_server_side_encryption_configuration" "backup" {
  bucket = aws_s3_bucket.backup.id
  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm     = "aws:kms"
      kms_master_key_id = aws_kms_key.backup.arn
    }
  }
}
# Rego (OPA) policy: deny S3 buckets without versioning and object lock
package dr.guardrails

deny[msg] {
  input.resource.type == "aws_s3_bucket"
  not input.resource.versioning
  msg := "S3 bucket missing versioning"
}

deny[msg] {
  input.resource.type == "aws_s3_bucket"
  not input.resource.object_lock_enabled
  msg := "S3 bucket must enable Object Lock for backups"
}
# Kyverno 1.13: verify image signatures at admission
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-signed-images
spec:
  validationFailureAction: Enforce
  rules:
  - name: check-cosign
    match:
      resources:
        kinds: [Pod]
    verifyImages:
    - image: "ghcr.io/yourorg/*"
      keyless:
        issuer: https://token.actions.githubusercontent.com
        subject: "repo:yourorg/*:ref:refs/heads/main"
# GitHub Actions: checks + proofs (SARIF) for IaC
name: policy-checks
on: [pull_request]
jobs:
  scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: bridgecrewio/checkov-action@v12
        with:
          soft_fail: false
          output_format: sarif
          output_file_path: checkov.sarif
      - uses: instrumenta/conftest-action@v0.3
        with:
          files: terraform
      - uses: github/codeql-action/upload-sarif@v3
        with:
          sarif_file: checkov.sarif

Architect restores for isolation first, speed second

You can’t treat a breach like a power outage. Restore into a clean room with new trust anchors, then reconnect.

  • Separate accounts and regions: Backups replicate to a dedicated DR account with different KMS admins; warm standby is in another region.
  • Quarantine networking: Dedicated VPC with no peering, no shared endpoints, egress blocked by default. Only IR tooling allowed.
  • Rotate trust: New IAM roles, new OIDC provider thumbprints, fresh cluster service account keys, new DB creds.
  • Scan before attach: Run SBOM + malware scans, schema diffs, and data classification before any east-west traffic.
# Minimal runbook fragment: restore RDS into clean VPC with new creds
# Assumes you pre-created subnet groups, security groups, and KMS in DR account
SNAP=prod-rds-snap-2025-07-15
aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier dr-quarantine-db \
  --db-snapshot-identifier $SNAP \
  --db-subnet-group-name dr-quarantine-subnets \
  --kms-key-id arn:aws:kms:us-east-2:DR:key/abcd-... \
  --no-publicly-accessible

# Rotate creds immediately and disable old rotation workflows
aws secretsmanager put-secret-value --secret-id dr/quarantine/db --secret-string @new.json
# Kubernetes 1.29: deny-all egress in quarantine namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: deny-all-egress
  namespace: quarantine
spec:
  podSelector: {}
  policyTypes: [Egress]
  egress: []

We also separate RTO from RTO’. Example target: RTO ≤ 60m for P0, RTO’ ≤ 120m after rekeying and scans. That framing stops “fast but unsafe” restores from winning the day.

Automate drills—and generate evidence while you sleep

If you can’t drill it, you can’t do it. If you can’t prove it, you didn’t do it.

  • Tabletops monthly; live restores quarterly: Simulate stolen CI OIDC token, compromised cluster SA, or ransomware notes in a volume. Involve SRE, AppSec, Legal, and PR.
  • Evidence bots: Store logs, screenshots, and JSON artifacts in a write-once evidence bucket; sign with cosign attest.
  • Conformance packs: Use AWS Config or InSpec to produce machine-verifiable reports.
# Evidence snapshot helper
TS=$(date -u +%Y%m%dT%H%M%SZ)
CMD="aws rds describe-db-instances --db-instance-identifier dr-quarantine-db"
$CMD | tee evidence/$TS-rds.json
cosign attest --key cosign.key --predicate evidence/$TS-rds.json \
  --type slsa.dev/provenance/v1 --replace ghcr.io/yourorg/dr-evidence:latest
# AWS Config conformance pack (excerpt): encrypted EBS volumes
Resources:
  EncryptedVolumes:
    Type: AWS::Config::ConfigRule
    Properties:
      Source:
        Owner: AWS
        SourceIdentifier: ENCRYPTED_VOLUMES

After four quarters of disciplined drills, one fintech we worked with cut RTO from 3h to 55m and RTO’ from “¯\(ツ)/¯” to 95m, while reducing time-to-auditable-evidence from days to minutes.

Moving fast with regulated data without getting sued

You can ship and stay compliant if you design for it.

  • Data minimization by architecture: Tokenize and segregate regulated data into separate stores and VPCs; make most services non-regulated by design.
  • Immutable logs and backups: S3 Object Lock, CloudTrail Lake with immutable retention, and KMS keys with separate admins.
  • Signed artifacts: Require cosign verify at deploy; keep SBOMs (Syft) and vulnerability scans (Grype, Trivy) as artifacts.
  • Paved roads: Provide secure-by-default Terraform modules and Helm charts; developers move fast on the rails.
# Admission verify step for regulated namespaces
kubectl kyverno verify-image ghcr.io/yourorg/payments/api:sha256:... \
  --source oci --keyless --issuer https://token.actions.githubusercontent.com \
  --subject "repo:yourorg/payments:ref:refs/heads/main"

Balance is about latency of controls. Move checks left (fast CI), keep guardrails in infra (always-on), and reserve human review for exceptions. GitOps with ArgoCD gives great leverage: declarative state, diffs, and rollback with provenance.

A 30/60/90 plan you can actually run

  1. 30 days: baseline and guardrails

    • Inventory backups; enable S3 Object Lock and cross-account replication.
    • Split DR into a new account; create clean-room VPC, subnets, SGs.
    • Enforce basic policies in CI with Checkov/Conftest; turn on AWS Config.
    • Adopt signed images path in CI with Cosign; document break-glass.
  2. 60 days: drills and proofs

    • Run a tabletop: stolen CI OIDC token + restore to clean room.
    • Add Kyverno/Gatekeeper policies for signature verification.
    • Automate evidence capture; store in immutable bucket.
    • Define RTO/RPO, RTO’, and MTTK targets; alert when violated.
  3. 90 days: scale and harden

    • Quarterly live restore; rotate keys as part of drill.
    • Introduce data classification scans pre-attach; quarantine network defaults.
    • Roll out paved-road modules; require SBOMs for regulated workloads.
    • Report metrics to execs: RTO, RTO’, MTTK, drill pass rate, evidence latency.

GitPlumbers has helped orgs from seed-stage fintechs to stodgy enterprises get here with minimal thrash. No silver bullets—just boring, tested patterns that hold up when the pager goes off.

Related Resources

Key takeaways

  • Your DR plan must assume compromise: stale keys, poisoned images, and persistent access.
  • Translate policies into guardrails (infra), checks (CI), and automated proofs (artifacts) or they won’t survive a 2 a.m. incident.
  • Isolate restores in clean rooms with cross-account, cross-region, immutable backups; quarantine before reconnecting.
  • Automate drills and evidence collection so compliance isn’t the bottleneck during recovery.
  • Use signed artifacts, encrypted snapshots, and network isolation to keep regulated data safe while moving fast.

Implementation checklist

  • Define breach-aware RTO/RPO, plus RTO’ (time to safe re-connection) and MTTK (mean time to key rotation).
  • Backups: immutable (S3 Object Lock), cross-account, cross-region, KMS-encrypted with separate key admins.
  • Pre-provision a clean-room restore environment with deny-by-default networking and break-glass access.
  • Policy-as-code: OPA/Kyverno for clusters, Checkov/Conftest for IaC, InSpec or AWS Config for evidence.
  • Run tabletop + live restore drills quarterly; capture artifacts (logs, screenshots, signatures) automatically.
  • Require signed images (Cosign), SBOMs, and secret hygiene checks before reattaching services.
  • Keep runbooks in repo with `make` targets; test them under chaos conditions.

Questions we hear from teams

What’s the difference between RTO and RTO’?
RTO is time to restore service. RTO’ is time to restore safely—after rotating keys, scanning artifacts, and verifying controls. Track both so you don’t reward unsafe speed.
How do we prove to auditors that our DR works?
Automate evidence: SARIF from policy checks, signed attestations (Cosign), AWS Config/InSpec conformance reports, and timestamped drill logs in a write-once bucket (Object Lock). Treat audits like CI: reproducible, machine-readable outputs.
Is Kyverno or Gatekeeper better for admission policy?
If you’re already invested in OPA/Rego across infra, Gatekeeper keeps the language consistent. If you want Kubernetes-native ergonomics and built-ins for image verification, Kyverno 1.13+ is excellent. We deploy both depending on team skill and stack.
Do we need a second AWS account for DR?
Yes. Cross-account backups and restores reduce blast radius and let you separate KMS administrators. Same-account backups are convenient until they’re not—especially during a breach.
How often should we drill?
Monthly tabletops and quarterly live restores. Rotate keys during drills. If you can’t practice it, you won’t ship it under pressure.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Design a breach-aware DR plan See our Security & Compliance approach

Related resources