Privacy That Ships: Data Controls Regulators Sign Off On (And Your Pipelines Don’t Hate)
Build privacy-by-design into your data platform so audits pass, engineers move fast, and the business actually gets value.
Privacy that works looks like SRE: declarative, enforced by the platform, and leaving an evidence trail you don’t have to scramble to rebuild.Back to all posts
The audit that froze the roadmap
I watched a national retailer’s analytics program get kneecapped by a GDPR DPIA. They had Snowflake, Looker, dbt, the whole modern stack. But when auditors asked, “Show us purpose-based access to PII since Jan 1 and evidence of timely deletion for DSRs,” the answers were screenshots and vibes. Two red flags later, leadership froze the roadmap.
We rebuilt the privacy controls in six weeks without re-architecting. The trick wasn’t a new vendor. It was treating privacy like an SRE problem: policy-as-code, golden paths, and evidence by default. We shipped faster after the audit than before because the engineers stopped arguing with GRC in Slack and let the platform do the talking.
Regulators want evidence, not vibes
If you’ve been through GDPR, CCPA/CPRA, HIPAA, or SOC 2, you know the game. They don’t need perfection. They need consistent controls and proof:
- Classification and minimization (GDPR Art. 5): know what’s PII and don’t copy it everywhere.
- Privacy by design (Art. 25): controls in the platform, not hidden in dashboards.
- Access control + purpose limitation (Art. 6/32): ABAC beats ad-hoc grants.
- Retention and deletion (Art. 17): DSR/RTBF within SLA, evidence captured.
- Audit trail and lineage (Art. 30/33): who did what, where data flowed, and when.
Here’s the kicker: these are the same things you need for data reliability and quality. If you can’t prove provenance, purpose, and retention, you can’t trust your metrics either. Privacy done right is just good data engineering with a compliance wrapper.
The architecture that doesn’t crumble under audit
You don’t need a full rewrite. You need a thin privacy control plane that rides along your platform:
- Classify at ingest: Detect PII with
Microsoft PresidioorSnowflake Classification, attach tags to fields. Persist inDataHuborAmundsen. - Propagate tags: Push tags into
dbtmodels,Schema Registry(for Kafka), and warehouse catalogs. - Enforce with ABAC: Use platform-native guards:
Snowflakemasking policies and tags,BigQuerycolumn access policies,AWS Lake FormationorApache Rangerfor Hive/Presto/Trino. - Policy-as-code: Centralize purpose/role rules in
OPA/Rego, rendered to platform-specific objects via CI/CD. - Quality gates:
Great Expectations/Sodato stop PII drift and masking regressions before they hit prod. - Lineage and audit:
OpenLineage+MarquezorDataHubto trace PII through Airflow/Dagster. Store access logs inBigQuery/Athenawith 400+ day retention. - Encryption: KMS-backed at rest and in transit; default deny for buckets and stages.
- GitOps: All policies and exceptions in Git;
ArgoCD/Terraformto apply; no console-only snowflakes.
This pattern scales whether you’re in Snowflake, Databricks, or BigQuery. The details change; the control plane stays the same.
Implementation patterns that survive auditors and Friday deploys
Tag PII at the source
- Use Presidio at Kafka ingress or in your ingestion jobs to identify
email,phone,name,dob. - Write tags to schema metadata and propagate to dbt models and warehouses.
- Use Presidio at Kafka ingress or in your ingestion jobs to identify
ABAC at the warehouse
- Replace ad-hoc grants with tag- or attribute-based policies. Your BI users shouldn’t need direct table grants to see masked views.
Data contracts include privacy
- Require producers to declare PII in schemas. Fail CI if new PII fields are added without tags or purpose.
Deletion as a product
- One queue for DSRs, one orchestrated pipeline to fan out deletes across Snowflake/BigQuery, S3, and derived tables. Idempotent and auditable.
Evidence by default
- Every policy, exception, and DSR execution produces an artifact: policy version, approver, lineage impacted, rows touched, time to close.
Privacy SLOs
- “0 unauthorized PII reads,” “95% masking coverage,” “DSR MTTR < 24h.” Track in the same place you track data freshness.
Code you can steal
Tag PII in dbt and let enforcement follow the tags:
dbt_project.yml:
models:
marts:
+meta:
sensitivity: non_pii
models/marts/customers.sql:
config:
meta:
fields:
email:
tags: [pii, contact]
phone_number:
tags: [pii, contact]Snowflake dynamic masking with tag-based policies:
-- Define a masking policy
create or replace masking policy mask_email as (val string) returns string ->
case
when current_role() in ('ANALYST_PII','SECURITY_ADMIN') then val
else regexp_replace(val, '(^.).*(@.*$)', '\1***\2')
end;
-- Tag column and apply policy
create tag pii_tag allowed_values 'pii','non_pii';
alter table analytics.customers modify column email set tag pii_tag = 'pii';
-- Apply policy to any column tagged as PII (via automation or CI)
alter table analytics.customers modify column email set masking policy mask_email;BigQuery column access policy for least privilege:
-- Create a policy tag and grant access to a group
create policy tag `pii.contact`;
grant `roles/bigqueryPolicyTagUser` on policy tag `pii.contact` to `group:pii-analysts@company.com`;
-- Attach the policy tag to a column
alter table `prod.analytics.customers`
alter column email set policy tag `pii.contact`;OPA/Rego policy for purpose-based access (rendered into warehouse grants by CI):
package privacy
# Input: { user: {roles: ["analyst"], purpose: "marketing"}, resource: {tags: ["pii","contact"]} }
allow {
input.user.purpose == "fraud" # Only fraud purpose may see PII
not contains(input.resource.tags, "restricted_health")
}
mask {
contains(input.resource.tags, "pii")
input.user.purpose != "fraud"
}Retention and encryption with Terraform on S3:
resource "aws_s3_bucket" "raw" {
bucket = "company-raw"
force_destroy = false
}
resource "aws_s3_bucket_server_side_encryption_configuration" "raw" {
bucket = aws_s3_bucket.raw.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "aws:kms"
kms_master_key_id = aws_kms_key.data_key.arn
}
}
}
resource "aws_s3_bucket_lifecycle_configuration" "raw" {
bucket = aws_s3_bucket.raw.id
rule {
id = "retention"
status = "Enabled"
expiration { days = 365 }
noncurrent_version_expiration { noncurrent_days = 90 }
}
}Guard against PII drift with Great Expectations:
from great_expectations.dataset import PandasDataset
class Customers(PandasDataset):
_expectation_defaults = {}
df = Customers.load_from_checkpoint("customers_checkpoint")
# Emails should be masked in curated layer
df.expect_column_values_to_match_regex(
"email", r"^[^@]\\*{3}@.*$", mostly=1.0
)
# No free-text PII in comments
df.expect_column_values_to_not_match_regex(
"comments", r"\\b(\d{3}-\d{2}-\d{4}|@)\\b", mostly=1.0
)What good looks like: measurable outcomes
I’ve seen this work across fintech, healthcare, and retail with numbers that move the board deck:
- Audit findings → 0: From “needs improvement” on access logging to clean external audit in one quarter.
- DSR MTTR: Down from 14 days of manual SQL to under 24 hours, with an artifact for every request.
- Unauthorized PII reads: Zero, measured via warehouse access logs cross-checked with OPA decisions.
- Masking coverage: > 95% of columns tagged PII have active masking verified nightly; regressions paged.
- Cost/perf impact: < 3% query overhead using native policies vs proxy layers; no new vendors required.
- Delivery speed: Fewer Slack fights; data teams ship new marts faster because the guardrails are automatic.
Traps I keep seeing (and how to avoid them)
- Masking in BI tools only: If it’s not at the warehouse, it’s bypassable. Enforce at the platform and let BI inherit.
- RBAC explosion: Dozens of roles per team is unmaintainable. Go ABAC with tags and purposes.
- Un-tagged sprawl: If tagging is optional, it won’t happen. Fail CI when PII shows up untagged.
- DSR as a spreadsheet: Manual ticketing won’t scale. Put DSRs in a queue and treat deletes like debit transactions—atomic and recorded.
- No lineage, no evidence: Without OpenLineage/DataHub, your audit story is guesswork. Instrument it like uptime.
- LLM leakage: Redact at the edge before calling
OpenAI/Vertex AI. Use reversible tokenization for reidentification only where legally allowed.
A 90‑day plan you can actually finish
- Week 1–2: Inventory
- Run Presidio or native classifiers on top 20 tables. Tag fields and push into DataHub.
- Week 2–3: ABAC MVP
- Pick Snowflake or BigQuery. Implement masking policies for
email,phone,address. Grant access by purpose.
- Pick Snowflake or BigQuery. Implement masking policies for
- Week 3–4: Policy-as-code
- Stand up OPA. Encode purpose rules. Wire CI to render warehouse grants/policies from Rego.
- Week 4–5: Quality gates
- Add Great Expectations checks for masking patterns and PII drift. Fail the build on regressions.
- Week 5–6: Lineage + logs
- Enable OpenLineage in Airflow/Dagster. Centralize warehouse access logs in BigQuery/Athena with 400+ day retention.
- Week 6–7: Deletion pipeline
- Build an Airflow DAG that consumes DSRs, deletes across storage, captures evidence, and retries idempotently.
- Week 7–8: Retention
- Apply Terraform lifecycle policies to S3/GCS and equivalent in warehouse time travel settings.
- Week 8–9: Exceptions flow
- Git-based approvals for temporary unmasking by DPO/security. Time-bound, auto-revoke.
- Week 9–10: SLOs + dashboards
- Track unauthorized PII reads, DSR MTTR, masking coverage. Page on breach.
- Week 10–12: Rollout
- Expand tags, policies, and checks to the long tail. Kill legacy direct grants.
If it’s not in code, it didn’t happen. If it didn’t produce evidence, it won’t pass audit.
Key takeaways
- Privacy that works is boring: classify once, tag everywhere, enforce with ABAC, prove it with logs and lineage.
- Policy-as-code (OPA/Rego) + GitOps prevents Friday surprises and keeps auditors happy.
- Treat deletion (DSR/RTBF) like a product: idempotent, testable, with SLAs and end-to-end evidence.
- Data quality tooling doubles as privacy defense—build checks for PII drift and masking regressions.
- Measure what matters: zero unauthorized PII access, DSR MTTR < 24h, masking coverage > 95%, audit gap rate = 0.
Implementation checklist
- Inventory PII and sensitive data; propagate tags to schema/catalog and warehouse policies.
- Enforce ABAC at the platform layer (Snowflake policies, BigQuery column access, Lake Formation/Ranger).
- Encrypt everywhere by default with KMS-managed keys; rotate and log key use.
- Automate retention and the right-to-be-forgotten with idempotent deletion pipelines and evidence artifacts.
- Instrument lineage (OpenLineage) and access logs; keep them queryable for at least the statutory period.
- Add privacy SLOs to your data platform (e.g., 0 unauthorized PII reads per quarter).
- Shift-left with data contracts that include privacy metadata; block merges when PII fields appear untagged.
- Use GitOps to version policies, approvals, and exceptions; no console-only changes.
Questions we hear from teams
- Do I need a new vendor to pass audit?
- Usually no. Start with the platform-native controls you already pay for: Snowflake masking policies and tags, BigQuery policy tags, Lake Formation/Ranger for lakes. Add OPA for policy-as-code and OpenLineage for evidence. Vendors like Immuta/Privacera help at scale, but you can ship an MVP without them.
- Will masking and ABAC hurt performance?
- In practice, native policies add <3% overhead on typical analytics workloads. The cost of not enforcing privacy—unlimited copies, accidental exposure, audit delays—hits both cloud spend and delivery speed harder.
- How do we handle the right to be forgotten (DSR) in derived tables?
- Use idempotent deletion pipelines that replay deletes through lineage. For warehouses, rebuild or backfill affected materializations (dbt `--full-refresh` for impacted models). For lakes, propagate tombstones and compact. Capture artifacts proving which rows and datasets were touched.
- What about LLMs and AI features?
- Redact or tokenize PII before calling LLM APIs. Keep reversible tokens in a separate vault with strict purpose-based access. Log prompts and outputs, and disallow raw PII in prompts by policy. Treat AI integrations as another sink in your lineage graph.
- Who owns privacy policies: security, data, or legal?
- Security and legal define the rules; the data platform enforces them. Put the rules in code (OPA/Rego) and require Git-based approvals from DPO/security for exceptions. That keeps accountability clear and audits simple.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
