Stop Hoarding, Start Shipping: A Scalable Data Lake Playbook for Reliability and ROI

What we deploy when your data volume 10x’s and the business still expects yesterday’s dashboards to load.

Volume doesn’t break data lakes—undisciplined formats, tiny files, and missing contracts do.
Back to all posts

The week the data lake broke

I’ve seen the movie. Marketing launches a successful campaign, traffic 5x’s, and your “lake” on s3://data/landing becomes a swamp. Airflow backfills thrash EC2, dbt jobs time out, and the CFO asks why the revenue dashboard is three hours late on quarter close. A client called us after their Spark 3.2 jobs started OOM’ing because a single hour of event data created 400k tiny files. Trino queries that used to run in 30 seconds were taking 15 minutes. No one changed a line of code; volume did.

When volume explodes, reliability debt comes due.

This is where data lakes either grow up—or bleed money. Here’s what we deploy at GitPlumbers when the business needs yesterday’s metrics today, even as data volume takes the elevator.

What actually scales (and what doesn’t)

Let’s skip the hype:

  • Table formats matter: Iceberg, Delta, Hudi are not interchangeable. For multi-engine lakes (Spark/Flink/Trino/Snowflake) and long-term evolution, Iceberg’s snapshot isolation and metadata scaling have been more predictable for us.
  • Object stores are not file systems: S3/ADLS/GCS make listing and small files your hidden tax. You must optimize layout and compaction.
  • Catalogs are the backbone: Pick one (Glue, Nessie, Unity Catalog) and standardize. Schema evolution without a consistent catalog is how you strand tables.
  • Reliability isn’t “we have retries”: You need SLOs, contracts, and DQ gates that block bad data from leaking into finance.
  • Compute separation is a gift—use it: Mix Spark 3.5 for heavy writes, Flink 1.18 for streaming upserts, and Trino 430 for BI without stepping on each other.

What doesn’t scale: directory-per-day partitions with unbounded small files, write-only pipelines without compaction, and treating “schema on read” as a license to skip contracts.

The reference architecture that holds up at 10x

The stack we roll out when the house is on fire:

  • Storage: S3 or ADLS with bucket policies; tier cold data to IA/Archive.
  • Table format: Apache Iceberg v1 tables (v2 for row-level deletes when required). Catalog via AWS Glue or Project Nessie.
  • Ingestion: Kafka 3.6 / Redpanda with Debezium for CDC. Batch via Spark or object copy.
  • Compute: Spark 3.5 (EMR/Databricks) for batch, Flink 1.18 for streaming upserts + compaction, Trino 430 for interactive/BI.
  • Transform: dbt Core 1.7 on Trino/Spark for curated marts.
  • Orchestration: Airflow 2.9 or Dagster with OpenLineage enabled; lineage to Marquez.
  • Quality: Great Expectations or Soda Core; Deequ if you prefer JVM.
  • Observability: Emit metrics to Prometheus; alert in Grafana. Data incident paging like SRE.
  • Governance: Tags with Apache Atlas; access via AWS Lake Formation / Apache Ranger / Unity Catalog.

A minimal Iceberg table with sane defaults:

CREATE TABLE analytics.page_views (
  user_id BIGINT,
  page STRING,
  ts TIMESTAMP,
  country STRING
)
USING iceberg
PARTITIONED BY (bucket(64, user_id), hours(ts))
TBLPROPERTIES (
  'write.distribution-mode'='hash',
  'write.target-file-size-bytes'='134217728',  -- 128MB
  'commit.manifest.target-size-bytes'='8388608',
  'format-version'='2'  -- enable row-level deletes if you need GDPR erasure
);

Why this matters:

  • Bucketing user_id plus hourly partitioning balances file sizes and reduces skew.
  • 128MB target files keep S3 listings reasonable and Trino scans fast.
  • Format v2 unlocks privacy-compliant deletes without re-writing the world.

Reliability as a product: SLOs, contracts, and gates

If the lake powers revenue reports, treat it like a product.

  • Define SLOs you can defend:
    • Freshness: T+15m for operational metrics, T+2h for finance aggregates.
    • Completeness: >= 99.5% of expected records per hour.
    • Accuracy: contract-based checks on critical dimensions (e.g., non-null order_id, valid currency).
  • Publish data contracts using Avro or Protobuf in a Schema Registry (Confluent or Redpanda). Version them and require producer CI to validate changes.
  • Gate pipelines with DQ tests before publishing to curated zones.

Example: simple Great Expectations suite for a curated Iceberg table:

# checkpoints/page_views.json
{
  "data_asset_name": "iceberg.analytics.page_views",
  "validations": [
    {"expectation_suite_name": "page_views_curated"}
  ]
}
# expectations/page_views_curated.json
{
  "expectations": [
    {"expectation_type": "expect_table_row_count_to_be_between", "kwargs": {"min_value": 100000}},
    {"expectation_type": "expect_column_values_to_not_be_null", "kwargs": {"column": "user_id"}},
    {"expectation_type": "expect_column_values_to_match_regex", "kwargs": {"column": "country", "regex": "^[A-Z]{2}$"}}
  ]
}

Wire lineage so incidents aren’t scavenger hunts:

AIRFLOW__LINEAGE__BACKEND=openlineage.lineage_backend.OpenLineageBackend
OPENLINEAGE_URL=http://marquez:5000
OPENLINEAGE_NAMESPACE=prod-data

Now your on-call can answer “what broke and who’s downstream?” in minutes, not hours.

Kill small files, fix partitions, cut costs

Nine times out of ten, reliability issues hide inside file layout. Here’s what actually works:

  1. Write bigger files
    • Spark write tuning:
spark.sql.files.maxRecordsPerFile=500000
spark.sql.shuffle.partitions=512
spark.sql.adaptive.enabled=true
spark.sql.adaptive.coalescePartitions.enabled=true
  • Iceberg writer options per job:
--conf spark.sql.catalog.glue=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.glue.warehouse=s3://your-warehouse/ \
--conf spark.sql.catalog.glue.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog \
--conf spark.sql.catalog.glue.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
--conf spark.sql.iceberg.handle-timestamp-without-timezone=true
  1. Schedule compaction
    • Iceberg table service:
CALL glue.system.rewrite_data_files(
  table => 'analytics.page_views',
  strategy => 'binpack',
  options => map('min-input-files','50','max-file-size-bytes','134217728')
);
  • Or continuous compaction with Flink:
'write.upsert.enabled' = 'true',
'compaction.trigger.strategy' = 'num_or_size',
'compaction.target-file-size' = '134217728'
  1. Partition with intent

    • Avoid date-only partitions for high-velocity topics; use hours(ts) + buckets on high-cardinality keys.
    • For point lookups, add bucket(32, id); for range scans, avoid over-bucketing.
  2. Prune and cluster

    • Iceberg’s sort-order helps Trino/Spark prune:
ALTER TABLE analytics.page_views WRITE ORDERED BY (ts, country);
  1. Tier storage
    • Lifecycle policies: move snapshots older than 30 days to S3 IA and manifests older than 90 days to Glacier. Keep metadata compaction enabled.

We’ve cut Trino query latency 35–50% and S3 read costs 20–40% with these steps—consistently.

CDC and backfills without tears

CDC is where lakes become platforms or projects. The pattern that survives on-call:

  • Source: Debezium streams MySQL/Postgres changes to Kafka with Avro schemas and a Schema Registry.
  • Sink: Flink 1.18 upserts to Iceberg tables with primary keys. This gives idempotency and correct late-arrival handling.
  • Schema evolution: Require producers to bump versions; reject breaking changes. Make consumers tolerant: add-only by default.
  • Backfills: Batch load historical data to a staging Iceberg table, then MERGE into the main table using the same keys as CDC.

Flink SQL sink example:

CREATE TABLE ods_orders (
  order_id BIGINT,
  status STRING,
  amount DECIMAL(12,2),
  updated_at TIMESTAMP(3),
  PRIMARY KEY (order_id) NOT ENFORCED
) WITH (
  'connector'='iceberg',
  'catalog-name'='glue',
  'catalog-type'='hive',
  'warehouse'='s3://warehouse/',
  'format-version'='2',
  'write.upsert.enabled'='true',
  'write.target-file-size-bytes'='134217728'
);

Backfill via Spark using the same keys:

MERGE INTO glue.analytics.ods_orders t
USING glue.staging.ods_orders_backfill s
ON t.order_id = s.order_id
WHEN MATCHED AND s.updated_at > t.updated_at THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

If your CDC and backfills share keys and ordering, you avoid the “last writer wins” roulette.

Results we’ve delivered (and how fast)

Recent client outcomes after this playbook:

  • A global retailer moving from Hive tables to Iceberg on S3:
    • Reduced small files by 93% (avg file 12MB -> 162MB) using Flink compaction and scheduled rewrite_data_files.
    • Trino P95 query latency: 11m -> 3m 50s (65% faster) on core dashboards.
    • Data freshness: T+45m -> T+8m for orders; on-call MTTR for data incidents: 2h+ -> 18m with lineage and SLO alerts.
    • S3 read cost down 37% month-over-month; storage growth flat due to lifecycle policies.
  • A fintech streaming CDC from Postgres:
    • Contracted producers via Protobuf and a schema registry; breaking changes dropped to near-zero.
    • Finance close hit T+15m freshness SLO for the first time in company history.

It wasn’t magic. It was table format discipline, contracts, and file hygiene.

A 90-day rollout plan you can defend

  1. Days 1–15: Prove the spine
    • Pick Iceberg + catalog; stand up a non-prod warehouse bucket and catalog.
    • Migrate a single high-value table (read-heavy) from Hive/Parquet to Iceberg; enable compaction.
    • Add OpenLineage to orchestration; emit lineage to Marquez.
    • Define freshness/completeness SLOs and publish them.
  2. Days 16–45: Standardize ingestion and quality
    • Implement Debezium -> Kafka -> Flink upsert for one operational table.
    • Add Great Expectations checks, block publish on failure.
    • Convert 3–5 core BI models to dbt on Trino; add tests and docs.
  3. Days 46–90: Scale and cut cost
    • Roll compaction schedules to top 20 tables; enforce write options in CI for Spark jobs.
    • Implement storage lifecycle policies; measure cost per query and P95 latency.
    • Tag PII in Atlas and wire access controls via Lake Formation or Ranger.
    • Document backfill process and run one historical load end-to-end.

If you can’t demonstrate better freshness, fewer incidents, and lower cost by day 90, change the plan or the team.


If you want a partner who’s cleaned up these messes before, GitPlumbers has done this in retail, fintech, and logistics. We’ll help you land Iceberg, lock in reliability, and stop paying the small-files tax without pausing delivery.

structuredSections':[{

Related Resources

Key takeaways

  • Pick a table format first (we prefer Iceberg for large multi-engine lakes) and enforce it everywhere.
  • Define SLOs and data contracts before pipelines—then gate deployments with tests and lineage.
  • Solve small files early with write tuning and compaction; it crushes cost and latency.
  • Standardize CDC ingestion with idempotent upserts and versioned schemas; backfills must be boring.
  • Measure outcomes: freshness, data quality issue rate, cost per query, and MTTR for data incidents.

Implementation checklist

  • Choose `Iceberg` + a consistent catalog (`Glue` or `Nessie`) and lock the decision.
  • Publish SLOs: freshness, completeness, and accuracy; wire them into alerts and dashboards.
  • Implement data contracts with `Avro/Protobuf` + `Schema Registry`; enforce validation in ingestion.
  • Add DQ checks (`Great Expectations`/`Soda`) and block downstream if they fail.
  • Fix small files: tune `Spark` writes, enable Iceberg compaction, and run scheduled `rewrite_data_files`.
  • Adopt `OpenLineage` with Marquez for lineage; alert on broken dependencies.
  • Automate infra with `Terraform` and deployments with `GitOps` (`ArgoCD` or CI).
  • Pilot CDC (`Debezium` -> `Kafka` -> `Flink/Spark` upsert to Iceberg); prove idempotency and backfills.
  • Tag PII with `Atlas` and gate access via `Lake Formation`/`Ranger`/`Unity Catalog`.
  • Track metrics that matter: cost/query, P95 read latency, DQ failure rate, freshness SLOs.

Questions we hear from teams

Iceberg vs Delta vs Hudi—what should I pick?
If you need multi-engine access (Spark, Flink, Trino, Snowflake), Iceberg has been the safest long-term bet for us due to snapshot isolation, metadata scaling, and broad engine support. Delta is great inside Databricks, especially for DLT/Unity workflows. Hudi shines for streaming upserts with record-level indexing but can be trickier across engines. Pick one, standardize, and don’t mix formats casually.
Do I really need a catalog?
Yes. Glue, Nessie, or Unity Catalog anchors schemas, snapshots, and permissions. Without a consistent catalog, you’ll strand tables, botch schema evolution, and make lineage unreliable. We typically use Glue on AWS and Nessie when we want Git-like catalog versioning.
How do I enforce data contracts in a lake?
Define Avro/Protobuf schemas and publish to a Schema Registry. Producers validate in CI; consumers enforce compatibility at read and write. Add DQ gates (Great Expectations/Soda) before promoting to curated layers. Version everything and treat breaking changes like API changes—because they are.
What about GDPR/CCPA deletes on object storage?
Use Iceberg v2 row-level deletes or position deletes to surgically remove records without rewriting all data. Keep a deletion log, validate with DQ checks, and schedule metadata compaction. Test on a subset before enabling globally.
We’re a Snowflake shop—does this still apply?
Yes. You can use Snowflake’s Iceberg Tables to query your lake data with Snowflake compute, or continue to curate marts in Snowflake while using Iceberg for raw/bronze/silver layers. The reliability practices—SLOs, contracts, compaction—still pay dividends.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to GitPlumbers about hardening your data lake Read the CDC-to-Iceberg case study

Related resources