Do we need a new platform to get compliant?

No. Start with metadata and policy-as-code. Use your warehouse’s native features (Snowflake masking, BigQuery policy tags) and add CI guardrails. We rarely recommend ripping out your lake/warehouse unless it’s fundamentally missing column-level controls.

How do we handle free-text PII in event payloads?

Don’t. Tokenize or hash at the edge. If you must ingest, route into a quarantine dataset with strict policies and a scrubbing job. Flag via lints and pattern scans (DLP) and block the fields from flowing into analytics models.

What about data scientists who need raw PII for model training?

Create purpose-bound sandboxes with time-boxed access and masked defaults. Require a ticket with documented legal basis. Snapshot datasets with row/column policies applied; log all queries. Revoke automatically after the window ends.

Will masking slow down our dashboards?

Usually not. Native masking policies are evaluated at query time and push down efficiently. If you see hotspots, precompute masked views for BI and reserve unmasked tables for narrow roles.

Data-engineering · Nov 7, 2025 · 9 minute read

Ship Dashboards, Not Subpoenas: Standing Up Privacy Controls Without Killing Your Data Pipeline

The playbook we've deployed in regulated shops to meet GDPR/CCPA/HIPAA while keeping reliability, quality, and velocity high.

Alex Grant

Partner, Data & Reliability Engineering

20 years shipping data platforms you’ve actually heard of. Ex-Stripe, ex-Airbnb data infra. Helps teams make privacy and reliability boring and automated.

Privacy isn’t a platform; it’s a pipeline property you can test, version, and ship.

Back to all posts

The Tuesday 9AM Fire Drill You’ve Lived Through

Last year, a fintech client pinged us at 9:03 AM: legal halted their monthly revenue dashboard because a new marketing_attribution model was joining user_email and device_id. Auditor asked for "evidence of masking and purpose limitation". Engineering had a pile of LookML, a spaghetti of Airflow DAGs, and a lakehouse that was supposed to make this easy. It didn’t.

We got them compliant in four weeks without freezing development. Not with a new platform, but with metadata-first controls, warehouse-native policies, and CI that treats privacy like a failing test. Here’s the pattern that actually works, with the knobs and dials you can copy-paste today.

What “Privacy Controls” Actually Mean for Data Teams

Forget the legalese. For engineers, privacy translates to enforceable behaviors:

Classification: know which columns are PII, Sensitive, PHI, etc.
Purpose limitation: only certain roles/jobs can use PII for approved purposes (e.g., fraud, not marketing).
Minimization: don’t move raw PII unless necessary; mask or tokenize by default.
Retention: drop or archive on schedule; prove you did.
Auditability: show who accessed what, when, and whether it was masked.

Practical KPIs we hold ourselves to:

PII coverage: percent of PII columns with tags and policies applied (target > 98%).
Access latency: time from request to approved access (target < 24 hours with automation).
Data reliability SLOs: freshness and success rates on privacy-critical models (> 99% schedule success; < 2 hours lag).
Incident MTTR: time to detect and remediate privacy violations (target < 2 hours).

Classify Once, Enforce Everywhere: Tags, Contracts, and CI

You can’t enforce what you haven’t labeled. Stop maintaining three spreadsheets. Pick a system of record for metadata: we’ve used dbt + the warehouse catalog, or OpenMetadata/Amundsen. Then make the tags stick via code.

Tag in dbt: annotate columns in schema.yml and auto-propagate tags to downstream models.
Block merges in CI: if a model with PII lacks masking policy or a purpose annotation, fail the PR.

Example: dbt macro to propagate PII tags and set a default policy name in model metadata.

-- macros/propagate_pii_tags.sql
{% macro propagate_pii(model) %}
  {%- set cols = adapter.get_columns_in_relation(model) -%}
  {%- for col in cols -%}
    {%- if col.tags and ('PII' in col.tags) -%}
      {{ log('PII column detected: ' ~ col.name, info=True) }}
      -- attach meta used by the warehouse deploy step
      {% do col.update({'meta': {'privacy_policy': 'MASK_DEFAULT'}}) %}
    {%- endif -%}
  {%- endfor -%}
{% endmacro %}

schema.yml snippet with explicit tags:

version: 2
models:
  - name: marketing_attribution
    columns:
      - name: user_email
        tags: [PII, IDENTIFIER]
      - name: device_id
        tags: [PII, SENSITIVE]

CI guardrail (pseudo) that fails if PII exists without privacy_policy:

#!/usr/bin/env bash
set -euo pipefail
PII_WITHOUT_POLICY=$(dbt ls --resource-type model --output json \
  | jq -r '.[] | select(.columns!=null) | .columns[] | select((.tags//[])|index("PII")) | select((.meta.privacy_policy//null)==null) | .name' | wc -l)
if [ "$PII_WITHOUT_POLICY" -gt 0 ]; then
  echo "Refusing to merge: PII columns missing privacy_policy metadata" >&2
  exit 1
fi

This is boring plumbing. Boring is good. It’s also the only way to scale beyond hero spreadsheets.

Enforce at the Warehouse: Masking, Policy Tags, and Terraform

I’ve seen teams reinvent masking in application code. It drifts instantly. Use your platform’s native controls and manage them as code.

Snowflake: dynamic data masking and row access policies tied to roles and tags.
BigQuery: policy tags with column-level access tied to IAM.
Terraform: golden path to make it repeatable and reviewable.

Snowflake masking policy based on purpose-specific roles:

-- Snowflake: create a masking policy once, apply to tagged columns
CREATE MASKING POLICY MASK_DEFAULT AS (val STRING) RETURNS STRING ->
  CASE
    WHEN CURRENT_ROLE() IN ('DATA_FRAUD_ANALYST', 'SYSADMIN') THEN val
    WHEN CURRENT_ROLE() = 'DATA_MKT_ANALYST' THEN REGEXP_REPLACE(val, '(^.).*(.@.*$)', '\\1***\\2')
    ELSE NULL
  END;

-- Apply masking policy to a column
ALTER TABLE ANALYTICS.MKT.MARKETING_ATTRIBUTION MODIFY COLUMN USER_EMAIL
  SET MASKING POLICY MASK_DEFAULT;

BigQuery policy tags via Terraform:

# terraform/main.tf
resource "google_data_catalog_policy_tag" "pii" {
  taxonomy = google_data_catalog_taxonomy.privacy.id
  display_name = "PII"
}

resource "google_bigquery_table" "marketing_attr" { 
  dataset_id = var.dataset
  table_id   = "marketing_attribution"
  schema     = file("schemas/marketing_attribution.json")
}

# In schema JSON, add policyTags:
# { "name": "user_email", "type": "STRING", "policyTags": { "names": ["${google_data_catalog_policy_tag.pii.name}"] } }

resource "google_data_catalog_policy_tag_iam_binding" "pii_reader" {
  policy_tag = google_data_catalog_policy_tag.pii.name
  role       = "roles/datacatalog.categoryFineGrainedReader"
  members    = ["group:fraud-analysts@company.com"]
}

With tags and policies in place, downstream tools (Looker, Sigma) inherit masking automatically. No more “oops, that CSV export had emails.”

Streams and Contracts: Stop Leaking PII in Kafka and Files

Most privacy blowups we fix happen upstream—schemas change in Kafka or batch files grow new columns. You need contracts and gates.

Schema Registry (Confluent) with compatibility rules; reject producers that add PII without annotation.
Pre-commit linter for Avro/JSON/Protobuf to require privacy_class.
OPA/Rego to enforce purpose-based access at query time in services.

Simple Python linter for Avro files in CI:

# tools/avro_privacy_lint.py
import json, sys
bad = []
for path in sys.argv[1:]:
    with open(path) as f:
        schema = json.load(f)
    for field in schema.get("fields", []):
        if field["type"] == "string" and field["name"].endswith("email"):
            if "privacy_class" not in field.get("metadata", {}):
                bad.append((path, field["name"]))
if bad:
    print("Missing privacy_class:")
    for p, n in bad:
        print(f" - {p}:{n}")
    sys.exit(1)

An example Rego snippet to require a purpose claim for PII tables:

# policy/privacy.rego
package privacy

default allow = false

# input: { role, purpose, table, tags }
allow {
  not input.tags[_] == "PII"
}
allow {
  input.tags[_] == "PII"
  input.purpose == "fraud"
  input.role == "DATA_FRAUD_ANALYST"
}

Gate query services through OPA/Envoy; pass table tags from the catalog. It’s not perfect, but it’s a cheap circuit breaker before data escapes to the wrong workload.

Reliability Isn’t Optional: Tests, Lineage, and Evidence

Auditors don’t accept vibes. Neither should you. Add quality checks and lineage so you can prove controls worked and quickly triage when they didn’t.

Great Expectations/Soda/Deequ for nulls, uniqueness, value ranges, and freshness.
OpenLineage + Marquez for lineage and PII tag propagation.
SLOs wired to alerts (PagerDuty/Slack) when privacy-critical models break.

Great Expectations example:

# great_expectations/expectations/marketing_attr.yml
expect_table_row_count_to_be_between: {min_value: 1}
expect_column_values_to_not_be_null:
  column: user_email
  mostly: 0.999
expect_table_row_count_to_be_between:
  min_value: 0
  max_value: 50000000
expect_table_column_to_exist:
  column: device_id
# Freshness check via custom expectation or monitor job age < 2h

Airflow with OpenLineage so PII tags flow into Marquez:

# airflow.cfg
[openlineage]
backend = http
transport = http
url = http://marquez:5000
namespace = analytics-prod

# In DAGs, set dataset facets with tags from dbt/warehouse catalog

Now you can answer the two questions every auditor asks: “Show me which jobs touch PII” and “Show me where masking is applied.” Also, you get faster MTTR because you can see exactly which DAG introduced an untagged column.

Retention, Deletion, and Tokenization Without Breaking BI

You can be compliant and keep your dashboards fast.

Tokenize at the edge: store irreversible tokens in analytics; keep a secure lookup for operational needs.
Time-bound views: expose only last N days of PII for approved roles; aggregate older data.
Automated retention: scheduled jobs that drop partitions and emit deletion receipts to a log bucket.

Partitioned retention in BigQuery:

-- Partitioned table with automatic expiration
CREATE TABLE analytics.raw_events (
  user_pseudo_id STRING,
  event_name STRING,
  event_ts TIMESTAMP OPTIONS (description = "PII-adjacent")
) PARTITION BY DATE(event_ts)
  OPTIONS (partition_expiration_days = 365);

For Snowflake, use DATA_RETENTION_TIME_IN_DAYS and a daily task that purges beyond legal requirements. Log rows affected; ship to your audit storage account with immutability enabled (Object Lock/WORM).

Rollout Plan With Real Numbers (What We Actually Delivered)

You don’t boil the ocean. You sequence it.

Pick one domain (we chose marketing). Inventory 12 models, 4 tables, 2 Kafka topics. Tag PII in dbt and the warehouse catalog.
Apply Snowflake masking and BigQuery policy tags via Terraform. Restrict Looker explores to masked views.
Add GE tests for nulls/freshness and wire OpenLineage to Marquez. Publish a PII lineage graph.
Add schema linter to the Kafka CI. Block merges that introduce untagged PII fields.
Publish SLOs (99% DAG success, <2h freshness). Alert on violations. Add weekly compliance report: PII coverage, access latency, incident count.

Results in 4 weeks (fintech, Series D):

PII coverage: 63% -> 99.2%
Access latency (marketing analysts): 10 days -> 6 hours (self-serve roles with purpose requests)
Incident MTTR: 26 hours -> 1.8 hours (lineage + alerts)
Data quality: null rate on user_email down 88% after enforcing contracts upstream
Business impact: revenue dashboard shipped on schedule; legal signed off before month-end close

We didn’t rip-and-replace. We tightened bolts already in the stack. If you need a sparring partner to do this under audit pressure, GitPlumbers has done it in fintech, healthtech, and adtech without turning your lake into a museum.

Related Resources

Key takeaways

Classify and tag PII at the source; propagate tags automatically through your pipelines.
Enforce access with warehouse-native controls (masking, row/column policies) and policy tags, managed via Terraform.
Treat privacy rules as code: CI should block schemas, queries, or models that violate contracts.
Instrument observability to prove compliance and reliability with auditable metrics.
Roll out incrementally: one domain, one policy, measurable deltas in MTTR, PII coverage, and time-to-access.

Implementation checklist

Inventory and tag PII fields with a single metadata system of record.
Automate tag propagation in `dbt` and enforce in warehouse/lake.
Provision masking/row policies and policy tags via Terraform.
Gate schemas with CI (Schema Registry + linters) and queries with purpose binding.
Add data quality tests (freshness, nulls, uniqueness) and lineage (OpenLineage).
Publish SLOs and dashboards for PII coverage, access latency, and incident MTTR.
Run a limited-scope pilot; iterate before scaling.

Questions we hear from teams

Do we need a new platform to get compliant?: No. Start with metadata and policy-as-code. Use your warehouse’s native features (Snowflake masking, BigQuery policy tags) and add CI guardrails. We rarely recommend ripping out your lake/warehouse unless it’s fundamentally missing column-level controls.
How do we handle free-text PII in event payloads?: Don’t. Tokenize or hash at the edge. If you must ingest, route into a quarantine dataset with strict policies and a scrubbing job. Flag via lints and pattern scans (DLP) and block the fields from flowing into analytics models.
What about data scientists who need raw PII for model training?: Create purpose-bound sandboxes with time-boxed access and masked defaults. Require a ticket with documented legal basis. Snapshot datasets with row/column policies applied; log all queries. Revoke automatically after the window ends.
Will masking slow down our dashboards?: Usually not. Native masking policies are evaluated at query time and push down efficiently. If you see hotspots, precompute masked views for BI and reserve unmasked tables for narrow roles.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Bring in GitPlumbers for a privacy controls assessment Read the regulated fintech case study