Delta vs. Iceberg vs. Hudi—what should I pick?

If you’re on Databricks and want simplicity, Delta is great. If you need multi-engine interoperability (Trino, Athena, Spark, Snowflake external tables), Iceberg is the safe bet. If your workload is heavy on CDC upserts and incremental merges, Hudi shines. Pick one and standardize.

When do I need streaming instead of batch?

If business value depends on freshness under ~30 minutes (fraud scoring, near-real-time ops), go streaming. Otherwise, high-frequency micro-batch often wins on simplicity and cost. Avoid cargo-cult streaming for daily KPIs.

How do I handle GDPR deletion with immutable logs?

Use table-level deletes in your lakehouse format (DELETE FROM …) and expire snapshots/compaction to physically remove data within policy. Encrypt and restrict raw Bronze access; propagate deletes through Silver/Gold via CDC or change tables.

Do I still need a warehouse if I have a lakehouse?

Often yes, for semantics, governance, and BI concurrency. Many teams run a warehouse (Snowflake/BigQuery) over curated Gold or use Trino/Databricks SQL. The trick is consistent tables and one source of truth in your catalog.

What about AI-generated pipelines—safe or not?

AI can draft boilerplate Spark/dbt code fast, but we always run a **vibe code cleanup**: enforce the contract, add tests, set file sizes, and wire SLOs. Otherwise you get cute demos and nasty 2am incidents.

Data-engineering · Dec 4, 2025 · 10 minute read

The Data Lake That Stopped Drowning Us: Designing a Lakehouse That Scales Without Torching Trust

You don’t need another petabyte. You need a lakehouse that keeps promises as it scales: reliable data, quality you can prove, and business value delivered on schedule.

Alex Park

Principal Consultant, Data & AI

20 years building and fixing data platforms across retail, fintech, and SaaS. Ex-ThoughtWorks, ex-Databricks partner lead. I’ve migrated Hadoop monsters to sane lakehouses and cleaned up more AI-generated pipelines than I care to admit.

A data lake that scales volume without scaling trust is just an expensive junk drawer.

Back to all posts

The data lake that almost drowned us

I walked into a retailer’s data platform that was “cloud scale” on paper and an incident generator in reality. 6 PB on S3, 400+ Athena users, and nightly reports missing half their KPIs. Marketing built lookalikes on stale data; Finance couldn’t reconcile revenue; engineering got paged at 2am for yet another small-files blowup. Tools weren’t the issue—they had Spark, Kafka, dbt, Airflow, and a dozen curated buckets. The issue was simple: volume scaled, trust didn’t.

What fixed it wasn’t a forklift migration. We standardized on a transactional table format, enforced schemas, added quality gates, introduced actual SLOs, and automated the boring operational stuff. In 90 days, MTTR dropped from 9h → 45m, quality pass rate rose from 71% → 98%, and Athena spend dropped 42%—with faster dashboards. Here’s the playbook we used at the retailer, and frankly, across fintech, B2B SaaS, and healthtech since the Hadoop days.

Architecture that scales: lakehouse with guardrails

If your “lake” is just Parquet files in a bucket, you’re paying cloud storage prices for a junk drawer. You need ACID, schema evolution, time travel, and compaction.

Pick one: Delta Lake, Apache Iceberg, or Apache Hudi. Don’t multi-format unless you enjoy chaos. We typically default:
- Delta on Databricks-heavy teams
- Iceberg for multi-engine (Trino/Athena/Spark/Snowflake external) shops
- Hudi when upsert-heavy change capture dominates
Catalog it: Unity Catalog, AWS Glue Catalog, or Iceberg REST/Nessie. Everything registers here.
Organize data using Bronze/Silver/Gold. Treat promotions like artifact promotion in software.
Keep files healthy: target 128–512 MB file sizes, scheduled compaction/OPTIMIZE, and clustering (Z-Order in Delta, sort.order in Iceberg).

Example: Spark writing to Delta with schema enforcement and auto-optimize:

from pyspark.sql import SparkSession

spark = (SparkSession.builder
  .appName("orders-stream")
  .config("spark.databricks.delta.optimizeWrite.enabled", "true")
  .config("spark.databricks.delta.autoCompact.enabled", "true")
  .getOrCreate())

schema = "order_id BIGINT, user_id BIGINT, amount DECIMAL(10,2), ts TIMESTAMP, country STRING"

raw = (spark.readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "broker:9092")
  .option("subscribe", "orders")
  .load())

parsed = (raw.selectExpr("CAST(value AS STRING) AS json")
  .selectExpr("from_json(json, '{}') as data".format(schema))
  .select("data.*"))

(parsed.writeStream
  .format("delta")
  .outputMode("append")
  .option("checkpointLocation", "s3://lake/bronze/orders/_checkpoints")
  .option("path", "s3://lake/bronze/orders")
  .start())

If you’re team Iceberg, equivalent configs with write.format=iceberg and sort/compaction via expire_snapshots + rewrite_data_files work well.

Ingestion that won’t melt: batch, streaming, contracts

I’ve seen teams YOLO CSVs into S3 and pray. That works—until governance and growth show up. Do three things:

Standardize landing layout
- s3://landing/<domain>/<table>/ingest_date=YYYY-MM-DD/part-*.snappy.parquet
- Partition first by ingest date, then by a low-cardinality column like country or region. Avoid cardinality bombs (user_id, device_id).
Use CDC where it matters
- Debezium → Kafka/Kinesis → upserts into Hudi/Delta/Iceberg. Keeps facts accurate without nightly full reloads.
Define data contracts at the edge
- Publish a yaml schema with types, allowed values, and nullability. Validate before Bronze. Bad payloads go to a DLQ.

Example contract (yaml) for orders:

name: orders_v1
owner: data-platform@company.com
schema:
  - name: order_id
    type: long
    required: true
  - name: user_id
    type: long
    required: true
  - name: amount
    type: decimal(10,2)
    required: true
    constraints:
      min: 0
  - name: ts
    type: timestamp
    required: true
  - name: country
    type: string
    required: true
    constraints:
      one_of: [US, CA, GB, DE, FR]
reject_on_unknown_fields: true

Put this contract in Git, version it, and fail the ingest job if violated. This single move kills a whole class of downstream “why is everything NULL?” incidents.

Quality as code: tests, quarantine, and circuit breakers

Shipping petabytes is easy. Shipping truth is hard. Treat quality like unit tests + run-time guards.

Add Great Expectations or Soda suites for every table. Run at Bronze→Silver and Silver→Gold promotions.
Implement a quarantine zone (s3://lake/quarantine/<table>/...) for failing partitions.
Use circuit breakers: if completeness or accuracy SLIs fall below threshold, stop promotion, page on-call.

Great Expectations example for orders completeness and range checks:

expectation_suite_name: orders_suite
expectations:
  - expect_table_row_count_to_be_between:
      min_value: 100
  - expect_column_values_to_not_be_null:
      column: order_id
  - expect_column_values_to_be_between:
      column: amount
      min_value: 0
  - expect_column_values_to_be_in_set:
      column: country
      value_set: ["US", "CA", "GB", "DE", "FR"]

Wire this into your orchestrator. Airflow DAG fragment with a breaker:

from airflow import DAG
from airflow.operators.empty import EmptyOperator
from great_expectations_provider.operators.great_expectations import GreatExpectationsOperator
from datetime import datetime

with DAG("silver_promote_orders", start_date=datetime(2024,1,1), schedule_interval="@hourly", catchup=False) as dag:
    bronze_done = EmptyOperator(task_id="bronze_done")

    ge_check = GreatExpectationsOperator(
        task_id="ge_orders",
        data_context_root_dir="/opt/gx",
        checkpoint_name="orders_checkpoint"
    )

    def breaker(**context):
        if context['ti'].xcom_pull(task_ids='ge_orders')['success'] is False:
            raise Exception("Quality gate failed; halting promotion")

    from airflow.operators.python import PythonOperator
    circuit_breaker = PythonOperator(task_id="circuit_breaker", python_callable=breaker, provide_context=True)

    bronze_done >> ge_check >> circuit_breaker

Tie alerts to PagerDuty. Publish pass/fail and coverage to a dashboard. Aim for 95%+ test coverage on critical Gold entities by column-weighted importance, not just table count.

Governance, catalog, and lineage you can actually use

Security theater is common: encryption at rest, public buckets “blocked,” and then a service account that can read everything. Fix it properly:

Centralize metadata in a real catalog (Unity/Glue/Iceberg REST). Enforce table registration. No table, no query.
Row/column-level security:
- Unity Catalog grants with attribute-based access (ABAC)
- Iceberg + Trino/Starburst with Ranger/Apache Sentry or engine-native policies
PII handling: tokenize in Bronze, expose only masked versions in Silver/Gold. Keep raw restricted.
Lineage: instrument OpenLineage (Marquez) from Airflow/Dagster/Spark. Make it visible in the catalog.
Retention and GDPR: use table-level delete semantics (DELETE FROM table WHERE user_id = ...) with snapshot expiry.

A minimal Terraform snippet for S3 with sane defaults:

resource "aws_s3_bucket" "lake" {
  bucket = "company-lake"
  force_destroy = false
}

resource "aws_s3_bucket_versioning" "lake" {
  bucket = aws_s3_bucket.lake.id
  versioning_configuration { status = "Enabled" }
}

resource "aws_s3_bucket_server_side_encryption_configuration" "lake" {
  bucket = aws_s3_bucket.lake.id
  rule { apply_server_side_encryption_by_default { sse_algorithm = "aws:kms" } }
}

resource "aws_s3_bucket_public_access_block" "lake" {
  bucket                  = aws_s3_bucket.lake.id
  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

Pair this with least-privilege IAM, per-domain roles, and temporary credentials only. If your catalog doesn’t enforce access, your lake will leak.

Operate like software: CI/CD, GitOps, and SLOs

Most data lakes rot because changes are ad hoc. Treat the platform like any other production system.

GitOps for infra with Terraform; deploy via ArgoCD or CI runners. No click-ops.
dbt for transformations with tests and docs. Use exposures for downstream SLAs.
Promotion via PRs only: Bronze→Silver→Gold moves are build artifacts, not bash scripts.
SLOs/SLIs: define freshness, completeness, and accuracy per table. Track in Prometheus and alert in Grafana.

Example SLIs to track:

Freshness (max ts in Silver vs wall clock) < 15 min (P95)
Completeness (rows in Silver / rows in Bronze) > 0.995 (daily)
Accuracy (sample recon vs source or reconciliation totals) > 0.99
MTTR for data incidents < 60 min; change failure rate < 10%

Prometheus-style alert rule (pseudo):

groups:
- name: data-slo
  rules:
  - alert: OrdersFreshnessBreached
    expr: (time() - orders_silver_max_ts_seconds) > 900
    for: 10m
    labels:
      severity: page
    annotations:
      summary: "Orders freshness SLO breached"

Add OpenLineage events from orchestrators so you can pinpoint exactly which upstream table broke your KPI in seconds, not hours.

Cost and performance tuning that pays for itself

If your CFO is asking why Athena/Trino cost doubled, it’s usually metadata or small-file hell.

Fix partitioning. Use dates and a coarse domain dimension. Avoid high-cardinality columns.
Enforce target file size. Compact hourly for streaming, daily for batch.
Optimize table layout:
- Delta: OPTIMIZE ... ZORDER BY (ts, country)
- Iceberg: enable write.target-file-size-bytes, use rewrite_data_files and sort.order on query keys
Prune data early: projections in dbt/Spark, not SELECT *
Cache hot datasets in your engine (Databricks SQL cache, Trino memory)
Monitor $/TB scanned, query latency, small-file ratio, and partitions-per-query.

Athena compaction with Iceberg via Trino SQL example:

-- Compact small files for yesterday’s partition
ALTER TABLE sales_iceberg EXECUTE REWRITE DATA USING BIN_PACK WHERE day = date '2025-12-03';
-- Expire old snapshots to keep metadata in check
CALL system.expire_snapshots('sales_iceberg', date '2025-11-01');

Expect 20–50% cost drops just by compaction + better partitioning. I’ve seen 10x speed-ups on critical dashboards after Z-Ordering the right columns.

What good looks like in 90 days

When GitPlumbers is asked to rescue a lake that’s “big but brittle,” we aim for measurable, boring reliability:

Week 1–2: choose table format + catalog, stand up promotion paths, define 3–5 critical data contracts
Week 3–4: implement Great Expectations on top 10 tables, add quarantine + circuit breakers, wire Prometheus SLIs
Week 5–6: optimize 3 heavy queries; compaction jobs; fix partitioning; publish cost dashboard
Week 7–8: dbt tests/doc coverage 80% of Gold; lineage visible; least-privilege IAM enforced
Week 9–12: expand contracts; streaming for latency-sensitive domains; incident runbooks and on-call

Outcomes we typically deliver:

MTTR: 9h → 45m
Data quality pass rate: 70–80% → 95–99%
Freshness: hours → minutes (P95 < 15m on key entities)
Cost: 30–50% reduction in $/TB scanned; 2–5x faster BI queries

The point isn’t buzzwords. It’s trust you can scale. When the business asks “Can we launch the new pricing model this quarter?”, your answer shouldn’t depend on whether last night’s CSV had an extra comma.

Related Resources

Key takeaways

Scale volume and trust together by choosing a transactional table format (Delta/Iceberg/Hudi) and enforcing schemas from day one.
Treat quality as code: run expectations, quarantine bad data, and fail fast with circuit breakers tied to SLOs.
Design ingestion for growth: standardized file layout, partitioning you can live with, and streaming where latency matters.
Govern with a real catalog and lineage; lock down access at the column/row level and version your data like code.
Operate the lakehouse with GitOps, dbt tests, and SLOs in Prometheus; automate compaction and optimize to kill small files and cost bleed.
Prove value with numbers: reduce MTTR, raise quality pass rates, cut query cost/latency, and accelerate time-to-insight.

Implementation checklist

Pick one table format that gives you ACID + time travel: Delta, Iceberg, or Hudi. Standardize.
Stand up a proper catalog (Unity/Glue/Iceberg REST/Nessie) and require table registration.
Define data contracts in `yaml`. Reject non-conformant payloads at the edge.
Add a Bronze/Silver/Gold layout with quality gates between layers.
Instrument SLIs: freshness, completeness, accuracy, and schema drift; alert on SLO breaches.
Automate compaction/OPTIMIZE and enforce target file sizes (128–512 MB).
Adopt GitOps: infra in Terraform, pipelines in code, promotion via PRs only.
Publish a cost dashboard: $/TB scanned, query latency, small-file ratio, storage growth.

Questions we hear from teams

Delta vs. Iceberg vs. Hudi—what should I pick?: If you’re on Databricks and want simplicity, Delta is great. If you need multi-engine interoperability (Trino, Athena, Spark, Snowflake external tables), Iceberg is the safe bet. If your workload is heavy on CDC upserts and incremental merges, Hudi shines. Pick one and standardize.
When do I need streaming instead of batch?: If business value depends on freshness under ~30 minutes (fraud scoring, near-real-time ops), go streaming. Otherwise, high-frequency micro-batch often wins on simplicity and cost. Avoid cargo-cult streaming for daily KPIs.
How do I handle GDPR deletion with immutable logs?: Use table-level deletes in your lakehouse format (DELETE FROM …) and expire snapshots/compaction to physically remove data within policy. Encrypt and restrict raw Bronze access; propagate deletes through Silver/Gold via CDC or change tables.
Do I still need a warehouse if I have a lakehouse?: Often yes, for semantics, governance, and BI concurrency. Many teams run a warehouse (Snowflake/BigQuery) over curated Gold or use Trino/Databricks SQL. The trick is consistent tables and one source of truth in your catalog.
What about AI-generated pipelines—safe or not?: AI can draft boilerplate Spark/dbt code fast, but we always run a **vibe code cleanup**: enforce the contract, add tests, set file sizes, and wire SLOs. Otherwise you get cute demos and nasty 2am incidents.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to a GitPlumbers Architect See our Lakehouse Modernization wins