The Data Lake That Stopped Drowning Us: Designing a Lakehouse That Scales Without Torching Trust
You don’t need another petabyte. You need a lakehouse that keeps promises as it scales: reliable data, quality you can prove, and business value delivered on schedule.
A data lake that scales volume without scaling trust is just an expensive junk drawer.Back to all posts
The data lake that almost drowned us
I walked into a retailer’s data platform that was “cloud scale” on paper and an incident generator in reality. 6 PB on S3, 400+ Athena users, and nightly reports missing half their KPIs. Marketing built lookalikes on stale data; Finance couldn’t reconcile revenue; engineering got paged at 2am for yet another small-files blowup. Tools weren’t the issue—they had Spark, Kafka, dbt, Airflow, and a dozen curated buckets. The issue was simple: volume scaled, trust didn’t.
What fixed it wasn’t a forklift migration. We standardized on a transactional table format, enforced schemas, added quality gates, introduced actual SLOs, and automated the boring operational stuff. In 90 days, MTTR dropped from 9h → 45m, quality pass rate rose from 71% → 98%, and Athena spend dropped 42%—with faster dashboards. Here’s the playbook we used at the retailer, and frankly, across fintech, B2B SaaS, and healthtech since the Hadoop days.
Architecture that scales: lakehouse with guardrails
If your “lake” is just Parquet files in a bucket, you’re paying cloud storage prices for a junk drawer. You need ACID, schema evolution, time travel, and compaction.
- Pick one:
Delta Lake,Apache Iceberg, orApache Hudi. Don’t multi-format unless you enjoy chaos. We typically default:- Delta on Databricks-heavy teams
- Iceberg for multi-engine (Trino/Athena/Spark/Snowflake external) shops
- Hudi when upsert-heavy change capture dominates
- Catalog it: Unity Catalog, AWS Glue Catalog, or Iceberg REST/Nessie. Everything registers here.
- Organize data using Bronze/Silver/Gold. Treat promotions like artifact promotion in software.
- Keep files healthy: target 128–512 MB file sizes, scheduled compaction/OPTIMIZE, and clustering (
Z-Orderin Delta,sort.orderin Iceberg).
Example: Spark writing to Delta with schema enforcement and auto-optimize:
from pyspark.sql import SparkSession
spark = (SparkSession.builder
.appName("orders-stream")
.config("spark.databricks.delta.optimizeWrite.enabled", "true")
.config("spark.databricks.delta.autoCompact.enabled", "true")
.getOrCreate())
schema = "order_id BIGINT, user_id BIGINT, amount DECIMAL(10,2), ts TIMESTAMP, country STRING"
raw = (spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "broker:9092")
.option("subscribe", "orders")
.load())
parsed = (raw.selectExpr("CAST(value AS STRING) AS json")
.selectExpr("from_json(json, '{}') as data".format(schema))
.select("data.*"))
(parsed.writeStream
.format("delta")
.outputMode("append")
.option("checkpointLocation", "s3://lake/bronze/orders/_checkpoints")
.option("path", "s3://lake/bronze/orders")
.start())If you’re team Iceberg, equivalent configs with write.format=iceberg and sort/compaction via expire_snapshots + rewrite_data_files work well.
Ingestion that won’t melt: batch, streaming, contracts
I’ve seen teams YOLO CSVs into S3 and pray. That works—until governance and growth show up. Do three things:
- Standardize landing layout
s3://landing/<domain>/<table>/ingest_date=YYYY-MM-DD/part-*.snappy.parquet- Partition first by ingest date, then by a low-cardinality column like
countryorregion. Avoid cardinality bombs (user_id, device_id).
- Use CDC where it matters
Debezium→ Kafka/Kinesis → upserts into Hudi/Delta/Iceberg. Keeps facts accurate without nightly full reloads.
- Define data contracts at the edge
- Publish a
yamlschema with types, allowed values, and nullability. Validate before Bronze. Bad payloads go to a DLQ.
- Publish a
Example contract (yaml) for orders:
name: orders_v1
owner: data-platform@company.com
schema:
- name: order_id
type: long
required: true
- name: user_id
type: long
required: true
- name: amount
type: decimal(10,2)
required: true
constraints:
min: 0
- name: ts
type: timestamp
required: true
- name: country
type: string
required: true
constraints:
one_of: [US, CA, GB, DE, FR]
reject_on_unknown_fields: truePut this contract in Git, version it, and fail the ingest job if violated. This single move kills a whole class of downstream “why is everything NULL?” incidents.
Quality as code: tests, quarantine, and circuit breakers
Shipping petabytes is easy. Shipping truth is hard. Treat quality like unit tests + run-time guards.
- Add Great Expectations or Soda suites for every table. Run at Bronze→Silver and Silver→Gold promotions.
- Implement a quarantine zone (
s3://lake/quarantine/<table>/...) for failing partitions. - Use circuit breakers: if completeness or accuracy SLIs fall below threshold, stop promotion, page on-call.
Great Expectations example for orders completeness and range checks:
expectation_suite_name: orders_suite
expectations:
- expect_table_row_count_to_be_between:
min_value: 100
- expect_column_values_to_not_be_null:
column: order_id
- expect_column_values_to_be_between:
column: amount
min_value: 0
- expect_column_values_to_be_in_set:
column: country
value_set: ["US", "CA", "GB", "DE", "FR"]Wire this into your orchestrator. Airflow DAG fragment with a breaker:
from airflow import DAG
from airflow.operators.empty import EmptyOperator
from great_expectations_provider.operators.great_expectations import GreatExpectationsOperator
from datetime import datetime
with DAG("silver_promote_orders", start_date=datetime(2024,1,1), schedule_interval="@hourly", catchup=False) as dag:
bronze_done = EmptyOperator(task_id="bronze_done")
ge_check = GreatExpectationsOperator(
task_id="ge_orders",
data_context_root_dir="/opt/gx",
checkpoint_name="orders_checkpoint"
)
def breaker(**context):
if context['ti'].xcom_pull(task_ids='ge_orders')['success'] is False:
raise Exception("Quality gate failed; halting promotion")
from airflow.operators.python import PythonOperator
circuit_breaker = PythonOperator(task_id="circuit_breaker", python_callable=breaker, provide_context=True)
bronze_done >> ge_check >> circuit_breakerTie alerts to PagerDuty. Publish pass/fail and coverage to a dashboard. Aim for 95%+ test coverage on critical Gold entities by column-weighted importance, not just table count.
Governance, catalog, and lineage you can actually use
Security theater is common: encryption at rest, public buckets “blocked,” and then a service account that can read everything. Fix it properly:
- Centralize metadata in a real catalog (Unity/Glue/Iceberg REST). Enforce table registration. No table, no query.
- Row/column-level security:
- Unity Catalog grants with attribute-based access (ABAC)
- Iceberg + Trino/Starburst with Ranger/Apache Sentry or engine-native policies
- PII handling: tokenize in Bronze, expose only masked versions in Silver/Gold. Keep raw restricted.
- Lineage: instrument OpenLineage (Marquez) from Airflow/Dagster/Spark. Make it visible in the catalog.
- Retention and GDPR: use table-level delete semantics (
DELETE FROM table WHERE user_id = ...) with snapshot expiry.
A minimal Terraform snippet for S3 with sane defaults:
resource "aws_s3_bucket" "lake" {
bucket = "company-lake"
force_destroy = false
}
resource "aws_s3_bucket_versioning" "lake" {
bucket = aws_s3_bucket.lake.id
versioning_configuration { status = "Enabled" }
}
resource "aws_s3_bucket_server_side_encryption_configuration" "lake" {
bucket = aws_s3_bucket.lake.id
rule { apply_server_side_encryption_by_default { sse_algorithm = "aws:kms" } }
}
resource "aws_s3_bucket_public_access_block" "lake" {
bucket = aws_s3_bucket.lake.id
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}Pair this with least-privilege IAM, per-domain roles, and temporary credentials only. If your catalog doesn’t enforce access, your lake will leak.
Operate like software: CI/CD, GitOps, and SLOs
Most data lakes rot because changes are ad hoc. Treat the platform like any other production system.
- GitOps for infra with Terraform; deploy via ArgoCD or CI runners. No click-ops.
- dbt for transformations with tests and docs. Use exposures for downstream SLAs.
- Promotion via PRs only: Bronze→Silver→Gold moves are build artifacts, not bash scripts.
- SLOs/SLIs: define freshness, completeness, and accuracy per table. Track in Prometheus and alert in Grafana.
Example SLIs to track:
- Freshness (max
tsin Silver vs wall clock) < 15 min (P95) - Completeness (rows in Silver / rows in Bronze) > 0.995 (daily)
- Accuracy (sample recon vs source or reconciliation totals) > 0.99
- MTTR for data incidents < 60 min; change failure rate < 10%
Prometheus-style alert rule (pseudo):
groups:
- name: data-slo
rules:
- alert: OrdersFreshnessBreached
expr: (time() - orders_silver_max_ts_seconds) > 900
for: 10m
labels:
severity: page
annotations:
summary: "Orders freshness SLO breached"Add OpenLineage events from orchestrators so you can pinpoint exactly which upstream table broke your KPI in seconds, not hours.
Cost and performance tuning that pays for itself
If your CFO is asking why Athena/Trino cost doubled, it’s usually metadata or small-file hell.
- Fix partitioning. Use dates and a coarse domain dimension. Avoid high-cardinality columns.
- Enforce target file size. Compact hourly for streaming, daily for batch.
- Optimize table layout:
- Delta:
OPTIMIZE ... ZORDER BY (ts, country) - Iceberg: enable
write.target-file-size-bytes, userewrite_data_filesandsort.orderon query keys
- Delta:
- Prune data early: projections in dbt/Spark, not SELECT *
- Cache hot datasets in your engine (Databricks SQL cache, Trino memory)
- Monitor $/TB scanned, query latency, small-file ratio, and partitions-per-query.
Athena compaction with Iceberg via Trino SQL example:
-- Compact small files for yesterday’s partition
ALTER TABLE sales_iceberg EXECUTE REWRITE DATA USING BIN_PACK WHERE day = date '2025-12-03';
-- Expire old snapshots to keep metadata in check
CALL system.expire_snapshots('sales_iceberg', date '2025-11-01');Expect 20–50% cost drops just by compaction + better partitioning. I’ve seen 10x speed-ups on critical dashboards after Z-Ordering the right columns.
What good looks like in 90 days
When GitPlumbers is asked to rescue a lake that’s “big but brittle,” we aim for measurable, boring reliability:
- Week 1–2: choose table format + catalog, stand up promotion paths, define 3–5 critical data contracts
- Week 3–4: implement Great Expectations on top 10 tables, add quarantine + circuit breakers, wire Prometheus SLIs
- Week 5–6: optimize 3 heavy queries; compaction jobs; fix partitioning; publish cost dashboard
- Week 7–8: dbt tests/doc coverage 80% of Gold; lineage visible; least-privilege IAM enforced
- Week 9–12: expand contracts; streaming for latency-sensitive domains; incident runbooks and on-call
Outcomes we typically deliver:
- MTTR: 9h → 45m
- Data quality pass rate: 70–80% → 95–99%
- Freshness: hours → minutes (P95 < 15m on key entities)
- Cost: 30–50% reduction in $/TB scanned; 2–5x faster BI queries
The point isn’t buzzwords. It’s trust you can scale. When the business asks “Can we launch the new pricing model this quarter?”, your answer shouldn’t depend on whether last night’s CSV had an extra comma.
Key takeaways
- Scale volume and trust together by choosing a transactional table format (Delta/Iceberg/Hudi) and enforcing schemas from day one.
- Treat quality as code: run expectations, quarantine bad data, and fail fast with circuit breakers tied to SLOs.
- Design ingestion for growth: standardized file layout, partitioning you can live with, and streaming where latency matters.
- Govern with a real catalog and lineage; lock down access at the column/row level and version your data like code.
- Operate the lakehouse with GitOps, dbt tests, and SLOs in Prometheus; automate compaction and optimize to kill small files and cost bleed.
- Prove value with numbers: reduce MTTR, raise quality pass rates, cut query cost/latency, and accelerate time-to-insight.
Implementation checklist
- Pick one table format that gives you ACID + time travel: Delta, Iceberg, or Hudi. Standardize.
- Stand up a proper catalog (Unity/Glue/Iceberg REST/Nessie) and require table registration.
- Define data contracts in `yaml`. Reject non-conformant payloads at the edge.
- Add a Bronze/Silver/Gold layout with quality gates between layers.
- Instrument SLIs: freshness, completeness, accuracy, and schema drift; alert on SLO breaches.
- Automate compaction/OPTIMIZE and enforce target file sizes (128–512 MB).
- Adopt GitOps: infra in Terraform, pipelines in code, promotion via PRs only.
- Publish a cost dashboard: $/TB scanned, query latency, small-file ratio, storage growth.
Questions we hear from teams
- Delta vs. Iceberg vs. Hudi—what should I pick?
- If you’re on Databricks and want simplicity, Delta is great. If you need multi-engine interoperability (Trino, Athena, Spark, Snowflake external tables), Iceberg is the safe bet. If your workload is heavy on CDC upserts and incremental merges, Hudi shines. Pick one and standardize.
- When do I need streaming instead of batch?
- If business value depends on freshness under ~30 minutes (fraud scoring, near-real-time ops), go streaming. Otherwise, high-frequency micro-batch often wins on simplicity and cost. Avoid cargo-cult streaming for daily KPIs.
- How do I handle GDPR deletion with immutable logs?
- Use table-level deletes in your lakehouse format (DELETE FROM …) and expire snapshots/compaction to physically remove data within policy. Encrypt and restrict raw Bronze access; propagate deletes through Silver/Gold via CDC or change tables.
- Do I still need a warehouse if I have a lakehouse?
- Often yes, for semantics, governance, and BI concurrency. Many teams run a warehouse (Snowflake/BigQuery) over curated Gold or use Trino/Databricks SQL. The trick is consistent tables and one source of truth in your catalog.
- What about AI-generated pipelines—safe or not?
- AI can draft boilerplate Spark/dbt code fast, but we always run a **vibe code cleanup**: enforce the contract, add tests, set file sizes, and wire SLOs. Otherwise you get cute demos and nasty 2am incidents.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
