Do we need a full data governance platform before we start?

No. Start with tags in your warehouse (BigQuery policy tags, Snowflake masking policies, Unity Catalog) and wire them via Terraform. Add a lightweight catalog like DataHub or OpenMetadata for lineage and tagging. You can layer Collibra/Alation later if needed.

Will masking kill analyst productivity?

Not if you design for it. Use role-based dynamic masking so analysts see what they need (e.g., hashed email for joins) and only privileged roles can detokenize. Provide privacy-safe views and materialized aggregates for common workflows.

How do we handle DSARs without breaking referential integrity?

Use stable, privacy-safe identifiers (tokenized keys) to join, and implement deletion pipelines that remove or redact raw PII while re-materializing aggregates. Test DSAR pipelines in CI with seeded data to avoid surprises.

What about re-identification risk in aggregates?

Set minimum group sizes (k-anonymity thresholds), drop high-cardinality quasi-identifiers, and consider noise injection for sensitive metrics. Most analytics don’t need individual-level data—enforce purpose limitation via views.

We’re multi-cloud. Is this portable?

Yes. Tag-test-enforce-audit works across AWS/GCP/Azure. Use cloud-native primitives where possible and centralize policy-as-code (OPA/Rego, Terraform) and lineage (OpenLineage/DataHub) for portability.

Data-engineering · Oct 14, 2025 · 9 minute read

The Privacy Controls That Won’t Break Your Dashboards (Or Your Audit)

Stop treating privacy as a bolt-on. Build tag-test-enforce pipelines that ship reliable analytics and pass GDPR/CCPA/HIPAA without heroics.

Alex Moran

Principal Data & Platform Engineer, GitPlumbers

20 years wrangling data at scale—from on-prem Hadoop in 2010 to Snowflake and Databricks today. Ex-SRE turned data plumber; I’ve led incident recoveries, SOC2 audits, and GDPR overhauls at SaaS and fintech companies.

Privacy controls are just data quality controls with regulators watching.

Back to all posts

I’ve lost count of the times we’ve been called after a “minor” analytics tweak turned into a privacy incident. A marketing attribution pipeline at a well-known DTC brand started copying raw signup_events into a “sandbox” schema for experimentation. Six weeks later: GDPR DSAR backlog, audit scramble, and a board asking why product managers could query unhashed emails. None of this was malicious. It was missing controls that should have been baked into the pipeline, not left to Confluence and hope.

The business problem nobody wants to admit

Privacy incidents aren’t just regulatory fines. They torch reliability and slow delivery.

When PII leaks into the wrong tier, every downstream dataset becomes radioactive. Releases freeze. BI teams go dark.
Manual access reviews and ad-hoc masking add brittle logic to every DAG. MTTR for data incidents goes from hours to days.
Auditors don’t care about your Jira tickets. They want enforced controls, evidence trails, and repeatability.
The fix isn’t a bigger governance committee. It’s engineering the pipeline so privacy is a default, not an afterthought.

What regulators actually expect (and where teams get fined)

You don’t need a lawyer to ship good controls, but you do need to meet the spirit of the regs:

GDPR/CCPA: purpose limitation, data minimization, access control, deletion/retention, and auditability. Data Subject Access Requests (DSAR) must be fulfilled accurately and on time.
HIPAA: protect PHI with minimum necessary access, audit logs, BAAs, encryption, and breach notification.
PCI DSS: strong network segmentation, encryption, strict access and monitoring for PAN data.
Fines land when teams can’t prove:

What data is where (lineage and catalogs are blank or stale).
Who can access it (RBAC/ABAC are ad-hoc, spreadsheets are wrong).
How it’s protected (masking/encryption/tokenization are inconsistent).
How long it lives (retention is “we’ll get to it”).

The architecture that actually works: tag, test, enforce, audit

We’ve shipped this in anger at SaaS, fintech, and healthtech clients. It scales without turning every query into a compliance debate.

Tag: Classify columns with policy tags at ingestion (PII, PHI, Sensitive, Public). Automate discovery with scanners, but require data owners to confirm.
Test: Data contracts + DQ checks in CI and orchestrators (dbt + Great Expectations/Soda). Fail fast before bad schemas or unmasked PII land in prod.
Enforce: Platform-native policies at the warehouse/lakehouse (Snowflake masking policies, BigQuery policy tags, Databricks Unity Catalog row filters). Attribute-based access beats duct-taped SQL.
Audit: Lineage and access logs wired into a control coverage dashboard. Evidence on demand for auditors and leadership.
This is privacy-by-default. If something isn’t tagged, it can’t ship. If it’s tagged, controls apply automatically. GitOps for governance.

Step-by-step: ship privacy controls without blocking delivery

Here’s a blueprint we implement in a typical 6–10 week engagement.

Inventory and tag

Run automated discovery: AWS Macie, Google DLP, Microsoft Purview to flag candidate PII/PHI.
Store tags in your catalog (DataHub, Collibra, or Alation) and sync to the platform as policy tags.
Make tagging code-reviewable in Terraform.

# Terraform: BigQuery policy tags
resource "google_data_catalog_taxonomy" "privacy" {
  display_name = "Privacy"
  region       = "us"
}
resource "google_data_catalog_policy_tag" "pii_email" {
  taxonomy     = google_data_catalog_taxonomy.privacy.id
  display_name = "PII.Email"
}

Contracts and tests

Define schema and privacy expectations in code. Validate pre-merge and in your orchestrator.

# great_expectations: email should be hashed
expectations:
  - expect_column_values_to_match_regex:
      column: email_hash
      regex: "^[a-f0-9]{64}$"
  - expect_table_columns_to_match_set:
      column_set: [user_id, email_hash, created_at]

For pipelines: block jobs if a column tagged PII.* lands without a masking policy or tokenization flag.

Enforcement at the platform

BigQuery: use policy tags + IAM conditions to restrict column access.

# Terraform: grant access to analysts for NonSensitive only
resource "google_bigquery_dataset_access" "analyst_policy" {
  dataset_id = google_bigquery_dataset.analytics.dataset_id
  role       = "roles/bigquery.dataViewer"
  user_by_email = "analysts@company.com"
  # Analysts can’t read columns with PII tags
}

Snowflake: dynamic masking + row access policies.

create masking policy mask_email as (val string) returns string ->
  case when current_role() in ('PII_READER') then val else sha2(val) end;
alter table prod.users modify column email set masking policy mask_email;

Databricks (Unity Catalog): row/column-level permissions and data lineage; prefer ABAC via UC privileges.

Tokenize/encrypt where required

Use HashiCorp Vault Transform for format-preserving tokenization of emails/phone numbers. Don’t build your own crypto.

# Vault Transform: tokenization template
path "transform/transform/email" {
  capabilities = ["update"]
}

Encrypt at rest with KMS and rotate keys. Envelope encryption for exports.

Retention and deletion that actually runs

// S3 lifecycle: delete raw PII after 30 days, retain aggregates for 2 years
{
  "Rules": [{
    "ID": "pii-raw-expiration",
    "Filter": {"Prefix": "raw/pii/"},
    "Status": "Enabled",
    "Expiration": {"Days": 30}
  }]
}

Audit, lineage, and evidence

Enable OpenLineage via Marquez or use DataHub to capture column-level lineage.
Pipe warehouse access logs (Snowflake Access History, BigQuery Audit Logs) into a queryable store with dashboards on: who queried sensitive columns, and when.

Guardrails as code

Block merges that add untagged PII or remove masking.

# OPA/Rego: reject PR if sensitive columns lack policy tags
package privacy
violation[msg] {
  some c
  input.pr.columns[c].sensitivity == "PII"
  not input.pr.columns[c].policy_tag
  msg := sprintf("Column %s missing policy tag", [c])
}

This isn’t theory. We’ve used this exact flow at fintechs on Snowflake, media companies on BigQuery, and healthtechs on Databricks.

Concrete platform examples that hold up under audit

Snowflake

Use ACCOUNTADMIN only for break-glass; create SECURITYADMIN workflows via Terraform or Snowflake Schemas.
Masking policies for email, phone, ssn; row access for region-based restrictions.
ACCESS_HISTORY + LOGIN_HISTORY exported to a controlled schema for reporting.

create row access policy eu_only as (region string) returns boolean ->
  case when current_client() like '%EU%' then region = 'EU' else true end;
alter table prod.orders add row access policy eu_only on (region);

BigQuery

Policy tags applied at column-level; analysts get roles/bigquery.dataViewer but cannot read columns with PII.*.
DSAR pipeline uses DELETE FROM with WHERE user_id in lookups, then re-materializes aggregates.

-- Block direct SELECT on PII columns without tag permission
SELECT email FROM prod.users; -- will fail if caller lacks PII.Email permission

Databricks (Unity Catalog)

Centralize permissions in UC; use catalogs for environments (prod, staging) and schemas for domains.
Enable lineage in UC; require clusters to run with table access control.
Use delta sharing with redaction on sensitive columns for partners.

Proving it works: the only metrics leadership and auditors care about

Track outcomes, not vibes.

Coverage: % of datasets with policy tags; target >95% in 60 days.
Access lead time: median time to approve compliant access; target <1 hour with self-service groups.
DQ pass rate: dbt/Great Expectations tests; target >98% green in prod.
PII exposure: count of queries touching PII by non-privileged roles; target zero within 30 days.
Incident MTTR: time to detect and fix privacy regressions; target <4 hours with lineage + alerts.
DSAR SLA: time to fulfill deletion/export; target <7 days end-to-end.
Put these on a shared dashboard. When the auditor arrives, you export evidence; when finance asks “is this slowing us down?”, you show faster access approvals and fewer fire drills.

Traps I’ve seen (and how to avoid them)

Tag sprawl without enforcement: everyone tags, nothing applies. Fix: wire tags to platform policies that actually block reads.
Masking in ETL only: downstream copies re-introduce PII. Fix: enforce at the warehouse boundary; transformations should assume controls exist.
One-size-fits-all anonymization: hashing emails breaks joinability where business needs it. Fix: tokenization (Vault Transform) with role-based detokenization.
Orphaned DSAR logic: scripts live on one engineer’s laptop. Fix: productionize in Airflow/Dagster with tests and runbooks.
Governance by spreadsheet: auditors won’t buy it. Fix: policy-as-code, logs, and reproducible diffs.
“We’ll do it after GA”: privacy debt compounds. Fix: lightweight controls first (tags + masking), iterate to ABAC and lineage.

What we do when GitPlumbers parachutes in

Inventory and classify in two weeks: scanners + human review, policy taxonomy that makes sense to your business.
Implement platform-native controls fast: Snowflake masking/row access or BigQuery policy tags, all in Terraform with GitOps.
Wire tests and contracts into your CI and orchestrator, not just dbt docs.
Stand up lineage and access logging with a minimal-but-sufficient stack.
Ship a DSAR runbook and retention policies that run on a schedule, not hope.
Leave you with dashboards that show coverage, incidents, and audit evidence on demand.
No silver bullets. Just battle-tested guardrails that make your data safer and your releases less terrifying.

Related Resources

Key takeaways

Privacy controls are data quality controls—build them into the pipeline, not the ticket queue.
Tag, test, enforce, and audit: the four-part architecture that scales and satisfies auditors.
Use policy-as-code (OPA/Rego, Terraform) to make controls reviewable, diffable, and automatable.
Pick platform-native primitives first (BigQuery policy tags, Snowflake masking, Unity Catalog) before custom code.
Measure coverage and outcomes: % datasets tagged, PII exposure reductions, MTTR for privacy incidents, and access lead time.

Implementation checklist

Inventory and tag PII/PHI with automated scanners and human review.
Enforce masking and row-access at the warehouse layer (Snowflake/BigQuery/Databricks).
Validate contracts and DQ tests pre-merge and pre-deploy (dbt + Great Expectations/Soda).
Apply policy-as-code to block untagged PII and dangerous access patterns (OPA/Rego).
Tokenize or encrypt at source when needed (Vault Transform/KMS).
Set retention and deletion SLAs in storage with lifecycle policies.
Instrument lineage and access logs; publish control coverage dashboards.
Run canary/shadow deployments for schema changes; alert on privacy regressions.

Questions we hear from teams

Do we need a full data governance platform before we start?: No. Start with tags in your warehouse (BigQuery policy tags, Snowflake masking policies, Unity Catalog) and wire them via Terraform. Add a lightweight catalog like DataHub or OpenMetadata for lineage and tagging. You can layer Collibra/Alation later if needed.
Will masking kill analyst productivity?: Not if you design for it. Use role-based dynamic masking so analysts see what they need (e.g., hashed email for joins) and only privileged roles can detokenize. Provide privacy-safe views and materialized aggregates for common workflows.
How do we handle DSARs without breaking referential integrity?: Use stable, privacy-safe identifiers (tokenized keys) to join, and implement deletion pipelines that remove or redact raw PII while re-materializing aggregates. Test DSAR pipelines in CI with seeded data to avoid surprises.
What about re-identification risk in aggregates?: Set minimum group sizes (k-anonymity thresholds), drop high-cardinality quasi-identifiers, and consider noise injection for sensitive metrics. Most analytics don’t need individual-level data—enforce purpose limitation via views.
We’re multi-cloud. Is this portable?: Yes. Tag-test-enforce-audit works across AWS/GCP/Azure. Use cloud-native primitives where possible and centralize policy-as-code (OPA/Rego, Terraform) and lineage (OpenLineage/DataHub) for portability.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Get a privacy-by-default assessment See how we fix broken data platforms