The Day the Auditor Found Your S3 Bucket: A Data Governance Framework Engineers Don’t Hate

A pragmatic, engineering-first approach to data governance that boosts reliability, tightens security, and actually ships.

Governance isn’t a committee; it’s a merge check that blocks bad data before it hits prod.
Back to all posts

The day the audit letter hit Slack

I’ve lived the 7 a.m. Slack ping from Legal: “We have a GDPR DPIA due Friday. Also, why is the s3://prod-customer-exports bucket public?” Meanwhile finance couldn’t close because a “harmless” schema drift broke the revenue model. Classic: CSVs in S3, Snowflake roles nobody remembers, Shadow ETLs in Airflow, and PII sprinkled across Parquet like confetti.

We were asked to “stand up governance” without freezing delivery. We didn’t do a six-month catalog project. We cut a thin slice: identity + catalog + lineage + contracts + policy-as-code. In three weeks, incidents dropped, access got faster, and the auditor left with more evidence than they asked for.

Governance that works is a merge check, not a meeting.

Governance engineers don’t hate: the minimum viable control plane

Skip the buzzwords. If you’re running Snowflake/Databricks, Kafka, dbt, and an orchestrator (Airflow/Dagster), you need four pillars:

  • Identity and authorization: Centralized roles, tag-based access, column masking. Use Apache Ranger (on-prem) or AWS Lake Formation (cloud). For SaaS DWH, lean on Snowflake roles and dynamic masking.
  • Catalog and lineage: One source of truth for datasets, PII tags, stewards. DataHub or OpenMetadata for the catalog; OpenLineage for runtime lineage.
  • Data contracts and tests: Producers own schemas; consumers get guarantees. Enforce with Schema Registry (Avro/Protobuf/JSON), dbt tests, Great Expectations/Soda checks.
  • Policy-as-code and GitOps: Governance lives in Git. Approvals and deployments via ArgoCD/CI. OPA/Rego to enforce patterns.

Everything else (risk registers, steercos) hangs off these.

Build the control plane: identity, catalog, lineage

You can deploy this with tools you already run.

  • Identity and access
    • Standardize on short-lived creds (OIDC to Snowflake/Databricks). No more warehouse users with permanent passwords.
    • Use tag-based access for PII. On AWS, Lake Formation + Glue table/column tags works well. On-prem, Ranger tag-based policies.
# Terraform: tag-based PII policy with Lake Formation
resource "aws_lakeformation_lf_tag" "pii" {
  key   = "classification"
  values = ["pii", "restricted"]
}

resource "aws_lakeformation_permissions" "marketing_reader" {
  principal   = aws_iam_role.marketing_analyst.arn
  permissions = ["DESCRIBE", "SELECT"]

  lf_permissions_with_grant = []

  table_with_columns {
    database_name = "warehouse"
    name          = "customers"
    column_names  = ["id", "country", "signup_date"]
  }

  # deny columns tagged pii
  excluded_column_names = ["email", "phone", "ssn"]
}
  • Snowflake masking policies
create or replace masking policy mask_email as (val string) returns string ->
  case when current_role() in ('SECURITY_ADMIN','PII_READER') then val
       else regexp_replace(val, '(^.).+(@.+$)', '\1***\2') end;

alter table analytics.customers modify column email set masking policy mask_email;
  • Catalog and lineage
    • Stand up DataHub quickly via Helm; ingest Snowflake, dbt, and Airflow.
    • Tag PII at the column level. Make “owner” and “steward” mandatory metadata.
# DataHub ingestion for dbt + Snowflake (snippet)
source:
  type: dbt
  config:
    manifest_path: ./target/manifest.json
---
source:
  type: snowflake
  config:
    account_id: ACME
    username: ${SNOW_USER}
    password: ${SNOW_PWD}
    role: DATA_READER
    include_column_lineage: true
  • Runtime lineage with OpenLineage
# Airflow: enable OpenLineage
from openlineage.airflow import DAG
# set OPENLINEAGE_URL and API key in env; jobs auto-emit lineage

Outcome: who can see what is explicit, PII is masked by default, and you can answer “where did this column come from?” in seconds.

Guardrails in the data path: contracts, tests, and SLOs

Most “data unreliability” is just schema drift and bad assumptions.

  • Data contracts at the edges
    • For event streams, enforce compatibility with Schema Registry.
# Confluent Cloud: enforce BACKWARD compatibility on topic
confluent schema-registry subject update my-topic-value --compatibility BACKWARD
  • For batch interfaces, define contract docs and dbt sources with constraints.
# dbt source + tests
version: 2
sources:
  - name: app
    tables:
      - name: customers
        columns:
          - name: id
            tests: [not_null, unique]
          - name: email
            tests:
              - not_null
              - accepted_values:
                  regex: "^[^@\s]+@[^@\s]+$"
  • Quality tests as merge gates
    • Run Great Expectations or Soda in CI and in Airflow.
# Great Expectations: basic expectations
expect_table_row_count_to_be_between:
  min_value: 1
expect_column_values_to_not_be_null:
  column: email
expect_column_values_to_match_regex:
  column: country
  regex: "^[A-Z]{2}$"
  • SLOs for data reliability

    • Define SLOs like:
      • Freshness: 99% of analytics.orders loads finish by 06:00 UTC.
      • Accuracy: 99.5% of orders.amount within ±0.5% vs source-of-truth.
      • Schema stability: <1 breaking change/month per domain.
    • Wire alerts with Prometheus exporting Airflow DAG metrics and test pass rates; page on SLO burn.
  • Block bad changes

# OPA policy: block PRs adding PII columns without masking
package governance

violation[msg] {
  input.change.type == "add_column"
  input.change.tags[_] == "pii"
  not input.change.masking_policy
  msg := sprintf("PII column %s requires masking policy", [input.change.name])
}

Outcome: Producers can evolve, consumers don’t get surprised, and tests fail fast—before a CFO yells.

Policy-as-code and GitOps: governance that ships the same day

Governance breaks down when it’s hidden in wikis. Put it in Git with owners and reviews.

  1. Create a governance repo: policies (Rego), access maps, dataset tags, masking rules, and SLOs.
  2. CI checks:
    • Validate catalog metadata (owners, tags).
    • Run opa test on policy pack.
    • Dry-run terraform plan for IAM/Ranger/Lake Formation.
  3. Deploy via ArgoCD or your CI to apply policies and metadata.
# ArgoCD app for governance definitions
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: governance
spec:
  source:
    repoURL: https://github.com/acme/governance
    path: envs/prod
    targetRevision: main
  destination:
    server: https://kubernetes.default.svc
    namespace: governance
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

Automate evidence too: nightly export Snowflake.ACCOUNT_USAGE, CloudTrail, Ranger audit logs into your SIEM. The next time Audit asks “who read email last quarter?”, you query, don’t scramble.

Proving value: reliability, speed, and compliance you can measure

When we rolled this out at a fintech scale-up (Snowflake + Kafka + dbt + Airflow), we tracked five KPIs:

  • Data incident rate: -68% in 60 days (schema drift caught pre-merge).
  • MTTR for data incidents: 5h -> 1h 20m (lineage + ownership + alerts).
  • DQ pass rate: 89% -> 98.3% for tier-1 models.
  • P95 pipeline latency: Improved 35% after pruning flaky retries.
  • Access request SLA: 5 business days -> 2 hours (tag-based policies + self-service catalog).

Compliance outcomes:

  • GDPR DSAR fulfillment time from 10 days to 36 hours (lineage + PII tags + masking).
  • Audit prep from four panic weeks to two calm days (automated evidence exports).
  • Zero critical findings on least-privilege and key rotation (Vault + KMS + policy-as-code).

Business value:

  • Finance closed two days earlier because revenue marts stabilized.
  • Marketing attribution went live on schedule—no cross-team schema fights.
  • Cloud storage egress costs dropped 18% after lineage revealed unused feeds.

The playbook: what works (and what we’d do differently)

If I had to land this in a quarter at a new shop, here’s the sequence that’s actually stuck for us at GitPlumbers:

  1. Scope the blast radius: pick two domains (e.g., Payments, Growth) and 10 tier-1 tables.
  2. Stand up the catalog (DataHub/OpenMetadata). Require owner, steward, pii tags in PRs.
  3. Centralize identity: short-lived creds; map business roles -> warehouse roles; kill snowflake-local users.
  4. Tag PII at column level. Mask by default; explicit allow for privileged roles.
  5. Enforce contracts: Schema Registry BACKWARD compatibility on Kafka; dbt constraints on batch.
  6. Wire tests: Great Expectations/Soda in CI; fail merges on red.
  7. Emit lineage (OpenLineage in Airflow). Visualize upstream/downstream in the catalog.
  8. Define SLOs for freshness/accuracy; page on burn rates.
  9. Policy-as-code: OPA + Terraform, deployed via ArgoCD/CI. Everything reviewed.
  10. Automate evidence exports to SIEM. Build saved queries for the top 10 audit asks.

What I’d skip early: company-wide data council, custom metadata UIs, and hunting the perfect tool. Use what integrates cleanly; replace later if you must. Governance that bites on day one earns you political capital to refine in Q2.

If you need a crew that ships guardrails without slowing your teams, this is exactly the kind of mess GitPlumbers fixes—legacy, AI-assisted, or somewhere in between.

Related Resources

Key takeaways

  • Governance that works is a pipeline feature, not a committee meeting: bake it into CI/CD and runtime.
  • Identity, catalog/lineage, data contracts, and policy-as-code are the minimum viable control plane.
  • Measure data reliability like SRE: SLOs for freshness/accuracy and MTTR for data incidents.
  • Use tag-based access and masking policies to secure PII without blocking delivery.
  • Prove business value: faster access approvals, fewer incidents, and audit evidence on demand.

Implementation checklist

  • Inventory PII and tag it at the column level in your catalog.
  • Centralize identity with role-based access and tag-based policies (Ranger or Lake Formation).
  • Enforce data contracts with `Schema Registry` and dbt/Great Expectations tests.
  • Make policies code-reviewed and versioned; deploy with `ArgoCD` or your CI.
  • Instrument lineage (`OpenLineage`) and quality SLOs; alert via `Prometheus`/`PagerDuty`.
  • Automate evidence: export audit logs (`CloudTrail`, `Snowflake ACCOUNT_USAGE`) nightly into your SIEM.
  • Define access request SLAs in minutes, not days—automate approvals for low-risk datasets.

Questions we hear from teams

Do we need to buy an enterprise catalog before we start?
No. Start with DataHub or OpenMetadata; you can migrate metadata later if you outgrow them. The value comes from consistent tags, ownership, and lineage—tools are interchangeable if you keep the metadata model clean.
How do we avoid slowing down engineers with access controls?
Use tag-based policies and short-lived credentials. Default to masked views and automate low-risk access approvals. Measure access SLA as a product KPI; aim for hours, not days.
What if our data lives across Snowflake, S3, and Databricks?
Unify governance at the metadata and policy layers: catalog for tags/ownership, OpenLineage for flows, and policy-as-code (OPA/Terraform) to push consistent rules into Snowflake, Lake Formation, and Ranger.
Can we quantify ROI on governance?
Yes: fewer incidents (rate/MTTR), higher DQ pass rate, faster audit cycles, reduced time-to-access, and cost savings from pruning unused data flows identified via lineage.
Where does AI fit into this?
Treat model inputs/outputs as governed datasets. Tag features with sensitivity, enforce contracts on feature schemas, and log lineage into your catalog. Mask or tokenize PII before features hit training or inference.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to GitPlumbers about data governance that ships Get the governance playbook (PDF)

Related resources