Do I need a new vendor to pass audit?

Usually no. Start with the platform-native controls you already pay for: Snowflake masking policies and tags, BigQuery policy tags, Lake Formation/Ranger for lakes. Add OPA for policy-as-code and OpenLineage for evidence. Vendors like Immuta/Privacera help at scale, but you can ship an MVP without them.

Will masking and ABAC hurt performance?

In practice, native policies add <3% overhead on typical analytics workloads. The cost of not enforcing privacy—unlimited copies, accidental exposure, audit delays—hits both cloud spend and delivery speed harder.

How do we handle the right to be forgotten (DSR) in derived tables?

Use idempotent deletion pipelines that replay deletes through lineage. For warehouses, rebuild or backfill affected materializations (dbt `--full-refresh` for impacted models). For lakes, propagate tombstones and compact. Capture artifacts proving which rows and datasets were touched.

What about LLMs and AI features?

Redact or tokenize PII before calling LLM APIs. Keep reversible tokens in a separate vault with strict purpose-based access. Log prompts and outputs, and disallow raw PII in prompts by policy. Treat AI integrations as another sink in your lineage graph.

Who owns privacy policies: security, data, or legal?

Security and legal define the rules; the data platform enforces them. Put the rules in code (OPA/Rego) and require Git-based approvals from DPO/security for exceptions. That keeps accountability clear and audits simple.

Data-engineering · Oct 11, 2025 · 9 minute read

Privacy That Ships: Data Controls Regulators Sign Off On (And Your Pipelines Don’t Hate)

Q: How do we handle the right to be forgotten (DSR) in derived tables?

Use idempotent deletion pipelines that replay deletes through lineage. For warehouses, rebuild or backfill affected materializations (dbt `--full-refresh` for impacted models). For lakes, propagate tombstones and compact. Capture artifacts proving which rows and datasets were touched.

Q: What about LLMs and AI features?

Redact or tokenize PII before calling LLM APIs. Keep reversible tokens in a separate vault with strict purpose-based access. Log prompts and outputs, and disallow raw PII in prompts by policy. Treat AI integrations as another sink in your lineage graph.

Q: Who owns privacy policies: security, data, or legal?

Security and legal define the rules; the data platform enforces them. Put the rules in code (OPA/Rego) and require Git-based approvals from DPO/security for exceptions. That keeps accountability clear and audits simple.

Build privacy-by-design into your data platform so audits pass, engineers move fast, and the business actually gets value.

Alex Mercer

Principal Data & Platform Engineer, GitPlumbers

20 years building and stabilizing data platforms at scale (AWS, GCP, Snowflake, Kafka, dbt). Ex-Stripe, ex-Target. I fix the scary parts so teams can ship safely.

Privacy that works looks like SRE: declarative, enforced by the platform, and leaving an evidence trail you don’t have to scramble to rebuild.

Back to all posts

The audit that froze the roadmap

I watched a national retailer’s analytics program get kneecapped by a GDPR DPIA. They had Snowflake, Looker, dbt, the whole modern stack. But when auditors asked, “Show us purpose-based access to PII since Jan 1 and evidence of timely deletion for DSRs,” the answers were screenshots and vibes. Two red flags later, leadership froze the roadmap.

We rebuilt the privacy controls in six weeks without re-architecting. The trick wasn’t a new vendor. It was treating privacy like an SRE problem: policy-as-code, golden paths, and evidence by default. We shipped faster after the audit than before because the engineers stopped arguing with GRC in Slack and let the platform do the talking.

Regulators want evidence, not vibes

If you’ve been through GDPR, CCPA/CPRA, HIPAA, or SOC 2, you know the game. They don’t need perfection. They need consistent controls and proof:

Classification and minimization (GDPR Art. 5): know what’s PII and don’t copy it everywhere.
Privacy by design (Art. 25): controls in the platform, not hidden in dashboards.
Access control + purpose limitation (Art. 6/32): ABAC beats ad-hoc grants.
Retention and deletion (Art. 17): DSR/RTBF within SLA, evidence captured.
Audit trail and lineage (Art. 30/33): who did what, where data flowed, and when.

Here’s the kicker: these are the same things you need for data reliability and quality. If you can’t prove provenance, purpose, and retention, you can’t trust your metrics either. Privacy done right is just good data engineering with a compliance wrapper.

The architecture that doesn’t crumble under audit

You don’t need a full rewrite. You need a thin privacy control plane that rides along your platform:

Classify at ingest: Detect PII with Microsoft Presidio or Snowflake Classification, attach tags to fields. Persist in DataHub or Amundsen.
Propagate tags: Push tags into dbt models, Schema Registry (for Kafka), and warehouse catalogs.
Enforce with ABAC: Use platform-native guards: Snowflake masking policies and tags, BigQuery column access policies, AWS Lake Formation or Apache Ranger for Hive/Presto/Trino.
Policy-as-code: Centralize purpose/role rules in OPA/Rego, rendered to platform-specific objects via CI/CD.
Quality gates: Great Expectations/Soda to stop PII drift and masking regressions before they hit prod.
Lineage and audit: OpenLineage + Marquez or DataHub to trace PII through Airflow/Dagster. Store access logs in BigQuery/Athena with 400+ day retention.
Encryption: KMS-backed at rest and in transit; default deny for buckets and stages.
GitOps: All policies and exceptions in Git; ArgoCD/Terraform to apply; no console-only snowflakes.

This pattern scales whether you’re in Snowflake, Databricks, or BigQuery. The details change; the control plane stays the same.

Implementation patterns that survive auditors and Friday deploys

Tag PII at the source
- Use Presidio at Kafka ingress or in your ingestion jobs to identify email, phone, name, dob.
- Write tags to schema metadata and propagate to dbt models and warehouses.
ABAC at the warehouse
- Replace ad-hoc grants with tag- or attribute-based policies. Your BI users shouldn’t need direct table grants to see masked views.
Data contracts include privacy
- Require producers to declare PII in schemas. Fail CI if new PII fields are added without tags or purpose.
Deletion as a product
- One queue for DSRs, one orchestrated pipeline to fan out deletes across Snowflake/BigQuery, S3, and derived tables. Idempotent and auditable.
Evidence by default
- Every policy, exception, and DSR execution produces an artifact: policy version, approver, lineage impacted, rows touched, time to close.
Privacy SLOs
- “0 unauthorized PII reads,” “95% masking coverage,” “DSR MTTR < 24h.” Track in the same place you track data freshness.

Code you can steal

Tag PII in dbt and let enforcement follow the tags:

dbt_project.yml:
  models:
    marts:
      +meta:
        sensitivity: non_pii

models/marts/customers.sql:
  config:
    meta:
      fields:
        email:
          tags: [pii, contact]
        phone_number:
          tags: [pii, contact]

Snowflake dynamic masking with tag-based policies:

-- Define a masking policy
create or replace masking policy mask_email as (val string) returns string ->
  case
    when current_role() in ('ANALYST_PII','SECURITY_ADMIN') then val
    else regexp_replace(val, '(^.).*(@.*$)', '\1***\2')
  end;

-- Tag column and apply policy
create tag pii_tag allowed_values 'pii','non_pii';
alter table analytics.customers modify column email set tag pii_tag = 'pii';

-- Apply policy to any column tagged as PII (via automation or CI)
alter table analytics.customers modify column email set masking policy mask_email;

BigQuery column access policy for least privilege:

-- Create a policy tag and grant access to a group
create policy tag `pii.contact`;
grant `roles/bigqueryPolicyTagUser` on policy tag `pii.contact` to `group:pii-analysts@company.com`;

-- Attach the policy tag to a column
alter table `prod.analytics.customers`
  alter column email set policy tag `pii.contact`;

OPA/Rego policy for purpose-based access (rendered into warehouse grants by CI):

package privacy

# Input: { user: {roles: ["analyst"], purpose: "marketing"}, resource: {tags: ["pii","contact"]} }
allow {
  input.user.purpose == "fraud"          # Only fraud purpose may see PII
  not contains(input.resource.tags, "restricted_health")
}

mask {
  contains(input.resource.tags, "pii")
  input.user.purpose != "fraud"
}

Retention and encryption with Terraform on S3:

resource "aws_s3_bucket" "raw" {
  bucket = "company-raw"
  force_destroy = false
}

resource "aws_s3_bucket_server_side_encryption_configuration" "raw" {
  bucket = aws_s3_bucket.raw.id
  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm     = "aws:kms"
      kms_master_key_id = aws_kms_key.data_key.arn
    }
  }
}

resource "aws_s3_bucket_lifecycle_configuration" "raw" {
  bucket = aws_s3_bucket.raw.id
  rule {
    id     = "retention"
    status = "Enabled"
    expiration { days = 365 }
    noncurrent_version_expiration { noncurrent_days = 90 }
  }
}

Guard against PII drift with Great Expectations:

from great_expectations.dataset import PandasDataset

class Customers(PandasDataset):
    _expectation_defaults = {}

df = Customers.load_from_checkpoint("customers_checkpoint")

# Emails should be masked in curated layer
df.expect_column_values_to_match_regex(
    "email", r"^[^@]\\*{3}@.*$", mostly=1.0
)

# No free-text PII in comments
df.expect_column_values_to_not_match_regex(
    "comments", r"\\b(\d{3}-\d{2}-\d{4}|@)\\b", mostly=1.0
)

What good looks like: measurable outcomes

I’ve seen this work across fintech, healthcare, and retail with numbers that move the board deck:

Audit findings → 0: From “needs improvement” on access logging to clean external audit in one quarter.
DSR MTTR: Down from 14 days of manual SQL to under 24 hours, with an artifact for every request.
Unauthorized PII reads: Zero, measured via warehouse access logs cross-checked with OPA decisions.
Masking coverage: > 95% of columns tagged PII have active masking verified nightly; regressions paged.
Cost/perf impact: < 3% query overhead using native policies vs proxy layers; no new vendors required.
Delivery speed: Fewer Slack fights; data teams ship new marts faster because the guardrails are automatic.

Traps I keep seeing (and how to avoid them)

Masking in BI tools only: If it’s not at the warehouse, it’s bypassable. Enforce at the platform and let BI inherit.
RBAC explosion: Dozens of roles per team is unmaintainable. Go ABAC with tags and purposes.
Un-tagged sprawl: If tagging is optional, it won’t happen. Fail CI when PII shows up untagged.
DSR as a spreadsheet: Manual ticketing won’t scale. Put DSRs in a queue and treat deletes like debit transactions—atomic and recorded.
No lineage, no evidence: Without OpenLineage/DataHub, your audit story is guesswork. Instrument it like uptime.
LLM leakage: Redact at the edge before calling OpenAI/Vertex AI. Use reversible tokenization for reidentification only where legally allowed.

A 90‑day plan you can actually finish

Week 1–2: Inventory
- Run Presidio or native classifiers on top 20 tables. Tag fields and push into DataHub.
Week 2–3: ABAC MVP
- Pick Snowflake or BigQuery. Implement masking policies for email, phone, address. Grant access by purpose.
Week 3–4: Policy-as-code
- Stand up OPA. Encode purpose rules. Wire CI to render warehouse grants/policies from Rego.
Week 4–5: Quality gates
- Add Great Expectations checks for masking patterns and PII drift. Fail the build on regressions.
Week 5–6: Lineage + logs
- Enable OpenLineage in Airflow/Dagster. Centralize warehouse access logs in BigQuery/Athena with 400+ day retention.
Week 6–7: Deletion pipeline
- Build an Airflow DAG that consumes DSRs, deletes across storage, captures evidence, and retries idempotently.
Week 7–8: Retention
- Apply Terraform lifecycle policies to S3/GCS and equivalent in warehouse time travel settings.
Week 8–9: Exceptions flow
- Git-based approvals for temporary unmasking by DPO/security. Time-bound, auto-revoke.
Week 9–10: SLOs + dashboards
- Track unauthorized PII reads, DSR MTTR, masking coverage. Page on breach.
Week 10–12: Rollout

Expand tags, policies, and checks to the long tail. Kill legacy direct grants.

If it’s not in code, it didn’t happen. If it didn’t produce evidence, it won’t pass audit.

Related Resources

Key takeaways

Privacy that works is boring: classify once, tag everywhere, enforce with ABAC, prove it with logs and lineage.
Policy-as-code (OPA/Rego) + GitOps prevents Friday surprises and keeps auditors happy.
Treat deletion (DSR/RTBF) like a product: idempotent, testable, with SLAs and end-to-end evidence.
Data quality tooling doubles as privacy defense—build checks for PII drift and masking regressions.
Measure what matters: zero unauthorized PII access, DSR MTTR < 24h, masking coverage > 95%, audit gap rate = 0.

Implementation checklist

Inventory PII and sensitive data; propagate tags to schema/catalog and warehouse policies.
Enforce ABAC at the platform layer (Snowflake policies, BigQuery column access, Lake Formation/Ranger).
Encrypt everywhere by default with KMS-managed keys; rotate and log key use.
Automate retention and the right-to-be-forgotten with idempotent deletion pipelines and evidence artifacts.
Instrument lineage (OpenLineage) and access logs; keep them queryable for at least the statutory period.
Add privacy SLOs to your data platform (e.g., 0 unauthorized PII reads per quarter).
Shift-left with data contracts that include privacy metadata; block merges when PII fields appear untagged.
Use GitOps to version policies, approvals, and exceptions; no console-only changes.

Questions we hear from teams

Do I need a new vendor to pass audit?: Usually no. Start with the platform-native controls you already pay for: Snowflake masking policies and tags, BigQuery policy tags, Lake Formation/Ranger for lakes. Add OPA for policy-as-code and OpenLineage for evidence. Vendors like Immuta/Privacera help at scale, but you can ship an MVP without them.
Will masking and ABAC hurt performance?: In practice, native policies add <3% overhead on typical analytics workloads. The cost of not enforcing privacy—unlimited copies, accidental exposure, audit delays—hits both cloud spend and delivery speed harder.
How do we handle the right to be forgotten (DSR) in derived tables?: Use idempotent deletion pipelines that replay deletes through lineage. For warehouses, rebuild or backfill affected materializations (dbt `--full-refresh` for impacted models). For lakes, propagate tombstones and compact. Capture artifacts proving which rows and datasets were touched.
What about LLMs and AI features?: Redact or tokenize PII before calling LLM APIs. Keep reversible tokens in a separate vault with strict purpose-based access. Log prompts and outputs, and disallow raw PII in prompts by policy. Treat AI integrations as another sink in your lineage graph.
Who owns privacy policies: security, data, or legal?: Security and legal define the rules; the data platform enforces them. Put the rules in code (OPA/Rego) and require Git-based approvals from DPO/security for exceptions. That keeps accountability clear and audits simple.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to GitPlumbers about privacy controls that actually ship See how we modernize data platforms without breaking SLAs