The Data Governance Playbook That Survived an Audit and Shipped Features

Governance that engineers don’t hate: policy-as-code, quality SLOs, and locked-down PII that still lets marketing run campaigns.

“Governance that can’t be expressed as code is just theater.”
Back to all posts

The audit that woke us up

A few years back, a regulated fintech called us 48 hours before a SOC 2 Type II audit. The examiner asked for column-level lineage for email addresses, proof of masking in non-prod, and evidence that failed quality checks blocked downstream ML scoring. Their stack was familiar: Fivetran -> Kafka -> S3 (Delta) -> Databricks + dbt -> Snowflake -> Hightouch. Half the governance “docs” lived in a Confluence graveyard. Sound familiar?

We got them through the audit, but more importantly, we left them with a governance framework that didn’t grind delivery to a halt. The trick wasn’t buying another catalog or scheduling more committees. It was turning governance into code developers could run, test, and deploy.

Governance that can’t be expressed as code is just theater.

Governance that ships, not slows

Forget the slideware. The governance that works in production does three things:

  • Improves reliability: Fewer broken dashboards, lower MTTR when pipelines fail.
  • Protects data: PII/PHI is discoverable, masked, and access is auditable.
  • Delivers business value: Marketing, finance, and product get trustworthy data fast.

Common failure patterns I’ve seen:

  • A catalog without enforcement. Metadata is stale by the next sprint.
  • Quality checks living in notebooks. No one sees failures until earnings day.
  • IAM chaos. Over-privileged roles because “the query must run today.”
  • Shadow ETL and rogue reverse-ETL breaking lineage.

What actually works is a minimum viable governance stack implemented as code and aligned to SLOs you can defend to an auditor and a CFO.

The minimum viable governance stack (with code)

  1. Data contracts at the edges
  • Use schema-registry-backed formats (Avro, Protobuf, or JSON Schema).
  • Reject bad events at ingress; fail closed, not open.
{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "https://schemas.acme.com/user_event.schema.json",
  "title": "user_event",
  "type": "object",
  "properties": {
    "user_id": {"type": "string"},
    "email": {"type": "string", "format": "email"},
    "created_at": {"type": "string", "format": "date-time"}
  },
  "required": ["user_id", "created_at"],
  "additionalProperties": false
}
  1. Lineage you can trust
  • Instrument OpenLineage via Airflow/Dagster or Spark listeners. Store to Marquez or DataHub.
export OPENLINEAGE_URL=https://marquez.yourco.local
export OPENLINEAGE_API_KEY=$OPENLINEAGE_KEY
# Airflow:
export AIRFLOW__OPENLINEAGE__ENABLE=True
  1. Quality tied to SLOs
  • Use dbt tests for model-level checks; Great Expectations for dataset-level expectations.
version: 2
models:
  - name: fct_orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: email
        meta: { pii: true }
        tests:
          - relationships:
              to: ref('dim_users')
              field: email
    config:
      tags: [pii]
      severity: error
# great_expectations/expectations/fct_orders.yml
expectations:
  - expect_table_row_count_to_be_between: {min_value: 10000}
  - expect_column_values_to_not_be_null: {column: order_id}
  - expect_column_values_to_match_regex: {column: email, regex: "^.+@.+$"}
  1. Access control and masking
  • Centralize in Terraform. Use tag-based policies where available.
# Terraform: AWS Lake Formation tag-based access
resource "aws_lakeformation_resource_lf_tags" "pii_tag" {
  database { name = "analytics" }
  lf_tags = { pii = ["true"] }
}

resource "aws_lakeformation_permissions" "analyst" {
  principal   = aws_iam_role.analyst.arn
  permissions = ["SELECT"]
  permissions_with_grant_option = []
  lf_tag_policy {
    resource_type = "TABLE"
    expression { key = "pii" values = ["false"] }
  }
}
  1. Audit and evidence
  • All changes via PRs, with OPA/Rego guards; nightly dumps of policy states to S3/GCS for audits.

Policy as code and GitOps for data

You don’t need another committee. You need merge checks. Encode governance rules in CI.

  • Terraform for IAM/policies: Snowflake roles/warehouses, BigQuery policy tags, S3/KMS.
  • OPA/Rego: Prevent unreviewed PII exposure or models without tests.
  • GitOps: ArgoCD/Flux for platform configs; dbt deployments through CI with checks.

Rego example: block dbt PRs if models with pii: true lack a masking policy reference.

package gitplumbers.dq

violation["pii model missing masking policy"] {
  some i
  input.pull_request.changed_files[i].path == "models/fct_users.yml"
  not input.pull_request.body contains "masking_policy:"
  input.pull_request.body contains "pii: true"
}

Snowflake masking policy as code:

create or replace masking policy mask_email as (val string) returns string ->
  case when current_role() in ('DATA_SCIENTIST') then val
       when val is null then null
       else regexp_replace(val, '(?<=.).(?=[^@]*?@)', '*') end;

alter table analytics.fct_orders modify column email set masking policy mask_email;

BigQuery policy tag via Terraform:

resource "google_data_catalog_tag_template" "pii" {
  display_name = "PII"
  tag_template_id = "pii_template"
  fields { field_id = "pii" display_name = "PII" type { primitive_type = "BOOL" } }
}

resource "google_bigquery_datapolicy_data_policy" "email_mask" {
  data_policy_id = "mask_email"
  policy_tag     = google_data_catalog_policy_tag.pii_email.name
  data_policy_type = "DATA_MASKING_POLICY"
  data_masking_policy { predefined_expression = "HASH" }
}

The result: governance is reviewable, testable, and auditable. No more “tribal knowledge.”

Quality SLOs you can defend

Don’t boil the ocean. Pick 3 SLOs and wire alerts to PagerDuty/Slack with owner runbooks.

  • Freshness: 99% of fct_orders rows less than 2 hours old during business hours.
  • Completeness: > 99.5% of events conform to contract; reject otherwise.
  • Accuracy/Consistency: Order revenue within ±0.5% of Stripe settlement daily.

Prometheus alert for freshness:

# Using Sloth to define SLOs
apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
  name: fct-orders-freshness
spec:
  service: data-pipelines
  slos:
    - name: freshness-2h
      objective: 99
      sli:
        events:
          errorQuery: |
            sum(rate(pipeline_freshness_seconds{table="fct_orders"} > 7200[5m]))
          totalQuery: |
            sum(rate(pipeline_freshness_seconds{table="fct_orders"}[5m]))
      alerting:
        name: FctOrdersFreshness
        labels: {severity: critical}
        annotations:
          summary: fct_orders freshness SLO breach

Airflow DAG snippet to fail early on SLO breach:

from airflow.sensors.base import BaseSensorOperator
from prometheus_api_client import PrometheusConnect

class FreshnessSensor(BaseSensorOperator):
    def poke(self, context):
        pc = PrometheusConnect(url="https://prom.yourco.local", disable_ssl=True)
        q = 'pipeline_freshness_seconds{table="fct_orders"} > 7200'
        return pc.custom_query(query=q) == []

Tie these to budgets: if freshness falls below 99% weekly, auto-open an incident with an owner and a time-boxed fix.

Lock down PII without killing analytics

I’ve seen teams go from “everyone is sysadmin” to “no one can query anything.” The middle path:

  • Classify once with tags: pii, phi, confidential, public.
  • Mask by default, unmask by role and purpose.
  • Row access policies for geography/legal holds.

Snowflake example with row access:

create or replace row access policy restrict_eu as (country string) returns boolean ->
  case when current_role() in ('EU_ANALYST') then true else country != 'EU' end;

alter table analytics.customer set row access policy restrict_eu on (country);

BigQuery column policy tag binding:

-- Assuming a policy tag projects/us-east1/tags/pii_email exists
alter table `project.analytics.fct_orders`
  alter column email set policy tag `projects/p/taxonomies/tax/policyTags/pii_email`;

Kafka contract enforcement (Confluent Schema Registry):

curl -X PUT \
  -H "Content-Type: application/json" \
  --data '{"compatibility": "BACKWARD"}' \
  http://schema-registry:8081/config/user_event

Net effect in a recent rollout: access approvals dropped from 5 days to 2 hours, and non-prod PII exposure incidents went to zero after we enforced masking policies in CI.

Prove value in 90 days: a realistic rollout plan

Phase 0 (Week 0): Baseline and risk map

  • Inventory top 20 datasets by business impact. Tag PII/PHI. Identify owners.
  • Measure current incident rate, MTTR, and access-request SLA.

Phase 1 (Weeks 1–4): Contracts, lineage, and basic tests

  1. Enforce schemas at ingestion (Avro/JSON Schema, Schema Registry).
  2. Add OpenLineage to Airflow/Spark; publish to Marquez/DataHub.
  3. Add dbt tests to Tier-1 models; wire CI to block on failures.

Phase 2 (Weeks 5–8): Access control and masking

  1. Implement Terraform-managed roles and tag-based policies.
  2. Add masking/row policies for PII in Snowflake/BigQuery.
  3. Create an “access request” GitOps flow: PR-driven with auto-approvals for non-sensitive.

Phase 3 (Weeks 9–12): SLOs and observability

  1. Define 3 SLOs (freshness, completeness, accuracy) and alerts.
  2. Add Great Expectations suites to critical tables; publish results to ge_docs or DataHub.
  3. Run a game day: break a contract and verify the blast radius is contained.

Expected outcomes we’ve delivered at clients (real numbers):

  • 60–80% reduction in data-quality incidents on Tier-1 models within 60 days.
  • MTTR down from ~6h to <45m on failed loads.
  • Access request cycle time down from days to hours.
  • Audit prep time cut by ~70% because evidence is exported from code.

What I’d do differently next time

  • Don’t lead with a catalog. Lead with contracts, tests, and access control. Catalog comes alive once metadata is real.
  • Keep your policy surface area small. Two data classes to start (PII vs. non-PII) beats six that no one remembers.
  • Make product managers sign off on SLOs. If it’s not tied to a KPI, it will be ignored.
  • Budget for a platform owner. Governance fails when it’s “nobody’s job.”

If you want help wiring this without derailing your roadmap, this is exactly the sort of trench work GitPlumbers does—fixing the messy middle so you can pass audits and ship features.

Related Resources

Key takeaways

  • Governance that matters is executable: treat policies as code, versioned and tested in CI.
  • Start with a minimum viable governance stack: data contracts, lineage, quality SLOs, access control, and audit.
  • Use dbt and Great Expectations to enforce tests that map directly to business SLOs (freshness, completeness, accuracy).
  • Lock down PII with column- and row-level controls (Snowflake/BigQuery) while preserving legitimate use via masked roles.
  • Prove value in 90 days with measurable outcomes: fewer broken dashboards, faster access requests, lower MTTR.

Implementation checklist

  • Define your data classes (PII/PHI/confidential/public) and map to controls.
  • Codify access with Terraform/IaC and tag-based policies (Snowflake, BigQuery, Lake Formation).
  • Adopt data contracts at ingestion (Avro/JSON Schema) and fail closed on violations.
  • Instrument lineage with OpenLineage and wire alerts to SLOs.
  • Add dbt tests + Great Expectations suites tied to KPIs; block deploys on regressions.
  • Create masking/row-access policies for PII and enforce via roles and policy tags.
  • Run audits from code: export configs, run drift detection, and store evidence artifacts.

Questions we hear from teams

How is this different from just buying a data catalog?
Catalogs are useful for discovery and documentation, but they don’t enforce behavior. This approach treats governance as code—contracts at ingestion, tests in CI, access via Terraform, and lineage instrumentation—so the system itself prevents violations and produces audit evidence automatically.
Where should we start if we have nothing?
Start at the edges: enforce data contracts on ingestion and add dbt tests to Tier-1 models. In parallel, classify PII columns and apply a single masking policy. That alone reduces breakage and risk, and it doesn’t require boiling the ocean.
What if we’re primarily on Databricks or Snowflake?
Same patterns. On Databricks, use Delta Lake with table properties for classification, Unity Catalog for permissions, and OpenLineage Spark listener. On Snowflake, use masking and row access policies with Terraform-managed roles. dbt and Great Expectations work across both.
How do we measure success?
Track DQ incident rate, SLO attainment (freshness/completeness/accuracy), MTTR, access-request cycle time, and audit prep time. If those trend in the right direction within 90 days, your governance is working.
What about notebooks and ad-hoc data pulls?
Treat them as products. Run notebooks against governed compute (e.g., Databricks with Unity Catalog or Snowflake with roles), restrict access via roles, and require checked-in artifacts for anything promoted beyond exploratory work.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to a GitPlumbers principal Download our governance-as-code checklist

Related resources