The DR Plan That Survived a Breach: Policy to Guardrails, Checks, and Proofs
If your disaster recovery plan ignores security incidents, it’s not a plan—it’s a wish. Here’s how to bake breach scenarios into DR, codify policy as guardrails, and ship fast without leaking PII.
Security incidents are disasters. Treat them with the same rigor, SLOs, and automation as failovers.Back to all posts
The 2 a.m. breach that broke the “DR-only” plan
We had a client with picture-perfect DR for region failover. Runbooks laminated, Route 53 failover tested, RDS snapshots like clockwork. Then 2 a.m. on a Tuesday: an engineer’s personal access token was pushed to a public repo during a rushed hotfix, and an attacker started siphoning data through a seemingly legit service account. Zero floods. No AZ outage. Just a quiet exfiltration through the front door.
The DR playbook didn’t mention token revocation, KMS key rotation, or terminating Okta sessions. The incident response doc existed, but it wasn’t wired into deployments, access, or observability. MTTR wasn’t measured for “revoked keys,” only “cluster back online.” That night we learned: if your DR plan doesn’t include security breach scenarios, you don’t have DR—you have wishful thinking.
Security incidents are disasters. Treat them with the same rigor, SLOs, and automation as failovers.
Make DR objectives security-aware
Classic DR talks RTO and RPO. That’s necessary, but during a breach you also need to prioritize containment and integrity over raw uptime.
- Add breach SLOs alongside RTO/RPO:
MTTD(mean time to detect) andMTTRfor containment.- Blast radius SLO: max number of credentials or tenants exposed before containment.
- Key rotation SLO: e.g., rotate compromised KMS key grants within 15 minutes.
- Session revocation SLO: invalidate SSO sessions and API tokens within 10 minutes.
- Define minimal viable service: what you can run safely while secrets rotate and audit runs.
- Prioritize integrity over availability: better to go read-only or shed load than serve tampered data.
- Pre-approve compensating controls: feature flags to disable risky flows, circuit breakers to block outbound data egress, coarse-grained kill switches.
When leadership asks “are we up?”, you want to answer “we’ve contained the blast, revoked tokens, rotated keys, and restored read-only service to 90% of tenants” with metrics, not vibes.
Turn policy into guardrails you can’t ignore
Policies in PDFs don’t stop breaches; executable guardrails do. Here’s what actually works.
- Infrastructure policy as code with OPA/Conftest
# policy/terraform_s3.rego
package rules.s3
# Deny buckets without encryption or with public ACLs
violation[msg] {
resource := input.resource_changes[_]
resource.type == "aws_s3_bucket"
exists(resource.change.after.acl)
lower(resource.change.after.acl) == "public-read"
msg := sprintf("S3 bucket %s has public ACL", [resource.address])
}
violation[msg] {
resource := input.resource_changes[_]
resource.type == "aws_s3_bucket_server_side_encryption_configuration"
not resource.change.after.rule.apply_server_side_encryption_by_default.sse_algorithm
msg := sprintf("SSE not enforced for %s", [resource.address])
}Run it in CI with conftest test plan.json after terraform plan -out=plan && terraform show -json plan > plan.json.
- Kubernetes admission controls with Kyverno
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-signed-images
spec:
validationFailureAction: Enforce
rules:
- name: verify-cosign
match:
resources:
kinds: [Pod]
verifyImages:
- image: "ghcr.io/yourorg/*"
key: "cosign.pub"
attestations:
- name: slsa-provenance
predicateType: https://slsa.dev/provenance/v1Block unsigned images and require SLSA provenance. No signature, no deploy.
- Org-wide safety rails with AWS SCPs
Deny dangerous actions at the root so a compromised role can’t nuke logs or turn off KMS:
{
"Version": "2012-10-17",
"Statement": [
{"Effect": "Deny", "Action": ["kms:DisableKey", "kms:ScheduleKeyDeletion"], "Resource": "*"},
{"Effect": "Deny", "Action": ["logs:DeleteLogGroup", "logs:DeleteLogStream"], "Resource": "*"}
]
}- CI that fails closed
# .github/workflows/policy.yml
name: policy
on: [pull_request]
jobs:
verify:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: terraform init && terraform plan -out=plan
- run: terraform show -json plan > plan.json
- uses: instrumenta/conftest-action@v1
with: {files: plan.json}
- name: Scan images
run: |
trivy fs --exit-code 1 --severity CRITICAL,HIGH .
- name: Verify signatures
run: cosign verify --key cosign.pub ghcr.io/yourorg/service:${{ github.sha }}Your policy now stops bad infra, blocks unsigned code, and breaks builds that put you at risk.
Automated proofs, not screenshots
Auditors don’t want slide decks; they want evidence. You don’t want humans screenshotting dashboards at quarter-end, either. Produce signed, machine-verifiable proofs during CI/CD.
- Emit JSON evidence for each control and sign it with
cosign.
# generate_evidence.sh
set -euo pipefail
mkdir -p evidence
jq -n --arg build "$GITHUB_SHA" --arg time "$(date -Iseconds)" \
'{control:"S3 encryption enforced", build:$build, time:$time, status:"pass"}' > evidence/s3_encryption.json
cosign attest --predicate evidence/s3_encryption.json \
--predicate-type https://example.com/controls/v1 \
--key cosign.key ghcr.io/yourorg/service:${GITHUB_SHA}Store immutable proofs
- Push attestations to an OCI registry or transparency log (Rekor).
- Mirror to an append-only S3 bucket with object lock and AWS Backup.
Aggregate control status
- Use
Security Hub,AWS Config, orOpenSearchdashboards to show “last-pass time” per control. - Set alerts if a control goes stale (no fresh evidence within SLO).
- Use
End result: when the auditor asks “prove encryption at rest,” you point to signed, time-stamped artifacts linked to commits and deployments.
Breach playbooks you can run, not read
In a breach, humans are slow and sleepy. Convert runbooks into idempotent scripts and make targets that can be run by on-call with minimal context.
- Rotate IAM access keys and update K8s secrets
# scripts/rotate_access_key.sh
set -euo pipefail
USER=$1
NEW=$(aws iam create-access-key --user-name "$USER" | jq -r .AccessKey.AccessKeyId)
SECRET=$(aws iam list-access-keys --user-name "$USER" | jq -r .AccessKeyMetadata[0].AccessKeyId)
aws iam delete-access-key --user-name "$USER" --access-key-id "$SECRET"
# update k8s secret
kubectl create secret generic api-creds --from-literal=AWS_ACCESS_KEY_ID=$NEW \
--from-literal=AWS_SECRET_ACCESS_KEY=$(aws iam list-access-keys --user-name "$USER" | jq -r .AccessKeyMetadata[0].Status) \
-o yaml --dry-run=client | kubectl apply -f -
kubectl rollout restart deploy -n apps backend- Revoke SSO sessions and OAuth tokens
# Okta example (requires Okta API token)
USER_ID=$1
curl -s -X POST "https://yourorg.okta.com/api/v1/users/$USER_ID/lifecycle/revoke_sessions" \
-H "Authorization: SSWS $OKTA_TOKEN" -H "Content-Type: application/json"- Segment traffic fast
- Pre-wire
Istio/NGINXrules to block egress to sensitive endpoints on a feature flag. - Keep per-tenant allowlists and a kill switch to force read-only for impacted tenants.
- Pre-wire
Runbooks should log every action to a dedicated audit sink with request IDs. If you can’t run it safely in staging, it won’t save you at 2 a.m.
Drill it like you mean it
We run quarterly security GameDays with clients. The rules: production-like data, real tooling, and a stopwatch.
- Announce a scenario: leaked CI token detected; attacker using a service account.
- Start the clock. PagerDuty fires. On-call isolates affected apps via feature flags.
- Revoke tokens and sessions. Rotate keys with scripts. Force redeploy with new secrets.
- Query logs and DLP to confirm exfil window. Notify stakeholders.
- Restore minimal service. Collect evidence artifacts.
- Retrospect with metrics: time to isolate, revoke, rotate, restore.
Targets we like to see after two cycles:
- MTTD under 5 minutes via anomaly alerts (impossible travel, unusual egress, CI token use from new ASN).
- Session revocation within 10 minutes; credential rotation within 15 minutes.
- Read-only restore for 90% of tenants within 30 minutes.
- 95% controls covered by automated, signed evidence artifacts.
If your numbers aren’t improving, your guardrails aren’t where the work happens (hint: put them in CI and admissions, not Confluence).
Move fast with regulated data—safely
You don’t have to pick between HIPAA/GDPR and shipping velocity. You need a safety envelope.
- Data classification by default: tag resources and schemas with
data_class=public/internal/confidential/regulated. Deny egress ofregulateddata to non-compliant paths. - Ephemeral environments: per-PR namespaces with masked secrets; destroy on merge. Seed with synthetic or tokenized data.
- JIT and break-glass access: use
Teleport/AWS IAM Identity Centerfor time-bound roles. Log every elevation; require manager + security approval for regulated datasets. - DLP and schema guards in CI: block code that writes PII to logs; fail builds on accidental S3 public policies.
Example: enforce PII log blocking with semgrep and prevent prod deploy without approvals.
# .github/workflows/deploy.yml
name: deploy
on:
push:
branches: [main]
jobs:
scan-and-deploy:
permissions: {id-token: write, contents: read}
environment: production
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: returntocorp/semgrep-action@v1
- name: Require approvals for prod
if: github.ref == 'refs/heads/main'
run: |
reviewers=$(gh api repos/${{ github.repository }}/environments/production/protection_rules | jq length)
test $reviewers -ge 2
- name: Deploy
run: ./scripts/deploy.shAnd block PII-in-logs in K8s with Kyverno:
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: forbid-pii-logs
spec:
validationFailureAction: Enforce
rules:
- name: block-pii
match:
resources: {kinds: [Pod]}
validate:
message: "Containers must not set LOG_PII=true"
pattern:
spec:
containers:
- env:
- name: LOG_PII
value: "false"The result: developers move fast within a guardrail that keeps regulators and customers happy.
What good looks like (and how GitPlumbers helps)
When we retrofit breach-ready DR at clients, we aim for:
- 90% of controls enforced pre-merge or at admission; changes that violate policy don’t land.
- Automated, signed evidence for all high-risk controls (encryption, provenance, access)
- Runnable runbooks with a measured rotation time under 15 minutes.
- Quarterly drills with improving SLOs and clear residual-risk reports.
I’ve seen the other version—binder full of policies, human approvals, screenshots. It always fails under real pressure. If you want help building the version that works, that’s what we do at GitPlumbers: turn policies into guardrails, checks, and proofs you can run before the pager goes off.
- Need to make this real? Bring us your current DR/IR docs and a week of pipeline access. We’ll map controls, encode guardrails, and run your first GameDay.
- Want to see examples? We’ve got case studies where we cut rotation MTTR by 70% and replaced auditor screenshots with signed attestations.
Ship safely, sleep better. The pager will still ring—but you’ll be ready.
Key takeaways
- Treat security incidents as first-class DR scenarios with explicit RTO/RPO, MTTD/MTTR, and blast-radius SLOs.
- Translate policy into executable guardrails using OPA, Kyverno, AWS SCPs, and CI checks—no PDFs-as-process.
- Produce automated, tamper-evident proofs in the pipeline to satisfy auditors and reduce manual toil.
- Build runnable breach playbooks (not PDFs) that rotate keys, revoke sessions, and segment access with a single command.
- Drill quarterly with chaos-style security GameDays; measure revocation and recovery times like uptime SLOs.
- Balance speed and compliance with data classification, ephemeral envs, JIT access, and evidence-by-default.
Implementation checklist
- Inventory crown jewels: data stores, signing keys, CI/CD secrets, identity providers, and SBOM pipeline.
- Define breach-specific SLOs: time to detect, isolate, revoke, rotate, and restore minimal service.
- Codify controls: OPA for IaC, Kyverno for K8s, AWS SCPs for org-wide sanity, CI for attestations.
- Automate evidence: store signed control results and artifacts in an immutable bucket/log.
- Create runnable runbooks for key rotation, session revocation, and traffic segmentation.
- Drill with red/blue exercises and rehearse on production-like data without exposing PII.
- Track metrics: MTTR for key rotation, % coverage of automated evidence, drill pass rate, residual risk.
Questions we hear from teams
- What’s the difference between DR and IR, and why combine them?
- DR traditionally handles availability events (AZ/region outages), while IR handles security breaches. In reality, a breach is a disaster: you need to contain, restore integrity, and resume service. Combining them means you define shared SLOs (revocation, rotation, minimal service), run joint drills, and encode guardrails that reduce both failure modes.
- We’re regulated (HIPAA/GDPR). How do we keep speed?
- Automate guardrails and evidence. Use data classification tags, Kyverno/OPA admissions, and CI checks that fail fast. Ephemeral envs with synthetic data keep dev velocity, while JIT access and immutable evidence keep auditors happy.
- What tools do you recommend to start?
- Start with OPA/Conftest for IaC, Kyverno or Gatekeeper for K8s admissions, Sigstore Cosign for signing/attestations, Trivy for scanning, AWS SCPs + Config + Security Hub for cloud controls, and GitHub Actions for wiring proofs into CI.
- How often should we drill breach scenarios?
- Quarterly is a good baseline, with at least one unannounced GameDay per year. Track metrics (revocation time, rotation time, minimal-service restore) and raise the bar until it’s boring.
- What evidence do auditors actually accept?
- Time-stamped, signed artifacts tied to commits/deploys, plus logs from Config/Security Hub. Screenshots rot. Signed attestations and immutable buckets plus control dashboards survive scrutiny.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
