Do we have to migrate everything to Iceberg/Delta/Hudi to scale?

No. Start with the top 10–20 curated tables that drive most business value and cost. Convert those first, add compaction/maintenance, and enforce contracts. Leave long-tail raw data as files until it proves it deserves table treatment.

What’s the fastest win for lake performance?

Compaction and partition sanity. Get file sizes into the 128–512MB range, remove pathological partitions, and rerun the top BI queries. It’s common to cut query time and $/TB scanned dramatically without changing any dashboards.

How do we stop “silent wrong” data without slowing delivery?

Automate gates. Put `dbt test`/Great Expectations in the pipeline and quarantine failures. Teams move faster when they trust the curated layer and don’t have to manually validate every downstream report.

Data-engineering · Dec 15, 2025 · 8 minute read

Your Data Lake Didn’t “Scale” — It Just Got Slower and More Expensive

A scalable data lake is boring on purpose: predictable reliability, enforceable quality, and costs you can explain in a QBR. Here’s the architecture pattern that actually holds up when volumes 10x.

GitPlumbers Editorial Team

20-year data/platform engineering veteran

I’ve led and repaired data platforms across the Hive/Hadoop era, the Spark boom, and the current lakehouse wave—usually after the first “simple S3 data lake” quietly turned into an operational dumpster fire. At GitPlumbers, we fix brittle pipelines, tame runaway cloud bills, and restore trust in analytics with pragmatic architecture and reliability engineering.

A scalable data lake isn’t the one with the most tools. It’s the one that keeps its promises when the org gets bigger and messier.

Back to all posts

The moment your “data lake” becomes a liability

I’ve watched this movie too many times: the lake starts as a scrappy S3 bucket with “raw” parquet, a couple of Spark jobs, and a BI dashboard everyone loves. Then data volumes 10x, the number of producers doubles, and suddenly:

Finance asks why storage is cheap but query costs are exploding
Analysts complain the “gold” tables are wrong (again)
Engineers can’t answer basic questions like “what changed?” or “who owns this dataset?”
A single bad backfill turns into a week-long incident with a 40-message Slack thread

That’s not a scaling problem. That’s an architecture and operating model problem.

A lake that scales is deliberately boring: repeatable ingestion, enforceable schema, predictable performance, and visible reliability. Let’s talk about what actually works.

The architecture pattern that survives 10x growth

If you remember the bad old days of Hive-on-HDFS, you already know the core lesson: unmanaged files become a junk drawer. Modern lakes scale when you treat them like table systems, not file dumps.

At a high level, the pattern looks like this:

Object storage: S3 / ADLS Gen2 / GCS for durability and low-cost storage
Open table format: Apache Iceberg (or Delta Lake / Apache Hudi) to get ACID-ish behavior, schema evolution, snapshots, and time travel
Catalog: AWS Glue Data Catalog, Unity Catalog, or Hive Metastore to make tables discoverable and governable
Decoupled compute:
- Spark (batch/stream transforms, heavy lifting)
- Trino/Presto (fast SQL for BI and ad-hoc)
- Optional warehouse integration: Snowflake external tables or lakehouse engines
Orchestration + CI: Airflow/Dagster + dbt with tests and deployment controls

The outcomes you’re aiming for are measurable:

Freshness SLO (e.g., 95% of partitions available within 30 minutes)
Incident MTTR drops (hours → minutes) because you can pinpoint which job and which change caused the regression
Cost per TB scanned decreases because you stop forcing BI tools to read 10M tiny files

Reliability is an SLO, not a hope and a prayer

Most “data lake reliability” problems are self-inflicted:

No clear ownership per dataset
No explicit SLOs, so everything is “important” but nothing is monitored
Pipelines that silently succeed while producing garbage

Here’s what we do at GitPlumbers when teams want reliability without turning the data platform into a science project:

Define dataset-level SLOs (not platform-level hand-waving)
Measure them automatically
Alert like you mean it (and stop alerting on noise)

A simple SLO spec can live next to the transformation code:

# slo.yaml
datasets:
  - name: finance.fact_invoices
    freshness:
      objective: "p95 < 30m"
      alert_after: "45m"
    completeness:
      objective: "missing_partitions_per_day == 0"
    availability:
      objective: "query_error_rate < 0.5%"

Then wire it into your observability. If you’re on open tooling, OpenLineage + Marquez gives you lineage events you can correlate with failures. If you’re in Datadog/New Relic/CloudWatch, ship job metrics and table stats.

What changes when you do this:

You stop debating “is data late?” and start answering it with p95 freshness
Incidents become actionable: “job_transform_invoices failed on 2025-12-14 after deployment abc123”
Your execs stop hearing “data is flaky” and start seeing reliability trend lines

Quality at scale: contracts + gates (not manual spot checks)

I’ve seen teams spend millions on a lake and still run their quality process as “ask Sarah if the dashboard looks weird.” That doesn’t scale.

You need two layers:

Data contracts between producers and the lake (schema, keys, semantic expectations)
Quality gates that fail fast before bad data pollutes curated tables

For teams already using dbt, start with tests that catch 80% of real-world breakages:

# models/finance/schema.yml
version: 2

models:
  - name: fact_invoices
    columns:
      - name: invoice_id
        tests:
          - not_null
          - unique
      - name: invoice_date
        tests:
          - not_null
      - name: amount
        tests:
          - not_null
          - dbt_utils.accepted_range:
              min_value: 0
              max_value: 1000000

If you need richer expectations (distribution shifts, regex checks, referential integrity across domains), add Great Expectations at ingestion boundaries:

# great_expectations checkpoint snippet
checkpoint_config = {
  "name": "raw_invoices_checkpoint",
  "validations": [{
    "batch_request": {"datasource_name": "raw", "data_asset_name": "invoices"},
    "expectation_suite_name": "raw_invoices_suite"
  }],
  "action_list": [{"name": "store_validation_result", "action": {"class_name": "StoreValidationResultAction"}}]
}

Two practical rules I wish more teams followed:

Quarantine, don’t overwrite: route failing partitions to quarantine/ with metadata about why they failed
Block promotion: raw can be messy; curated cannot. Promotion to “gold” must be gated

Measurable outcomes we commonly see:

50–80% reduction in “silent wrong” incidents
Faster RCA because you can answer “what constraint failed?” instead of eyeballing rows

Performance and cost: the unsexy stuff that makes the lake usable

When a lake slows down, it’s rarely because “Spark is slow.” It’s because the lake is full of:

Tiny files (the classic “small files problem”)
Over-partitioning (dt=.../hour=.../minute=... because someone read a blog once)
No compaction, no clustering, no stats

If you standardize on Iceberg, you get sane primitives: snapshots, manifest files, partition evolution, and maintenance procedures.

A simple Iceberg compaction/optimization job (Spark) looks like:

-- Spark SQL with Iceberg
CALL catalog.system.rewrite_data_files(
  table => 'analytics.finance.fact_invoices',
  options => map('target-file-size-bytes','268435456')
);

CALL catalog.system.rewrite_manifests('analytics.finance.fact_invoices');

Operationally, the target is boring:

Data files: 128–512MB
Partitioning: aligned to the top 2–3 query predicates (usually date + one business key)
Maintenance cadence: daily compaction for hot tables, weekly for colder tables

For BI/SQL engines like Trino, make sure you’re not accidentally DDoS’ing your own lake. A minimal example of workload isolation:

# trino config (conceptual)
query.max-memory=20GB
query.max-total-memory=40GB
resource-groups.configuration-manager=file

Then define separate resource groups for:

Scheduled transforms
Interactive BI
Ad-hoc exploration

This is where your FinOps story gets real: you can attribute cost per workload class and stop blaming “the lake” for one team’s runaway queries.

A concrete reference implementation (AWS example) that scales

Here’s a pattern we’ve implemented repeatedly on AWS when teams want open formats and vendor flexibility:

Storage: S3 buckets per zone (raw, curated, sandbox) with lifecycle policies
Catalog: AWS Glue Data Catalog
Table format: Apache Iceberg
Compute: EMR Serverless or EKS Spark for transforms; Trino on EKS for SQL
Orchestration: Airflow (MWAA) or Dagster
Transform layer: dbt (with dbt-trino or dbt-spark)

Lock down the bucket correctly (yes, this matters more as you scale):

# terraform-ish example
resource "aws_s3_bucket" "curated" {
  bucket = "acme-data-curated"
}

resource "aws_s3_bucket_public_access_block" "curated" {
  bucket = aws_s3_bucket.curated.id
  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

A simple Airflow DAG pattern that enforces “quality before promotion”:

from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime

dag = DAG("finance_invoices", start_date=datetime(2025, 1, 1), schedule="@hourly", catchup=False)

ingest = BashOperator(
    task_id="ingest_raw",
    bash_command="python ingest_invoices.py --out s3://acme-data-raw/finance/invoices/",
    dag=dag,
)

quality = BashOperator(
    task_id="quality_gate",
    bash_command="great_expectations checkpoint run raw_invoices_checkpoint",
    dag=dag,
)

promote = BashOperator(
    task_id="transform_to_curated",
    bash_command="dbt run -s finance.fact_invoices && dbt test -s finance.fact_invoices",
    dag=dag,
)

ingest >> quality >> promote

What this buys you in practice:

Reproducibility: you can rebuild curated tables from raw using snapshots/time windows
Isolation: ingestion failures don’t corrupt curated
Auditable changes: schema evolution happens through versioned code and table metadata

What to measure (and what “success” looks like in 90 days)

If you can’t measure it, you can’t defend it when budgets tighten.

The scorecard that’s worked best for leaders I’ve partnered with:

Freshness: p50/p95 lag per critical dataset (minutes)
Quality: failed expectations/tests per day, plus top offenders by source
Reliability: incident count + MTTR for data pipeline incidents
Cost:
- $/TB stored (easy)
- $/TB scanned by query engine (this is where the savings are)
- compute hours by workload class (ingest/transform/BI)
Business value delivery:
- cycle time from source onboarding → first usable curated table (days)
- number of trusted datasets adopted by downstream teams

In a realistic 90-day push (not a fantasy “platform rewrite”), the teams that execute this pattern typically see:

30–60% reduction in BI query spend after compaction + sane partitioning
MTTR down from “half a day of archaeology” to under an hour with lineage + SLO alerting
Fewer escalations because quality gates catch issues before dashboards do

A scalable data lake isn’t the one with the most tools. It’s the one that keeps its promises when the org gets bigger and messier.

If you’re staring at a lake that’s getting slower, costlier, and less trusted as it grows, GitPlumbers can help you stabilize it without a boil-the-ocean migration. We usually start with a short reliability and performance assessment, then implement the highest-leverage fixes (table format, compaction/partitioning, quality gates, and SLOs) in weeks—not quarters.

Related Resources

Key takeaways

A “scalable” lake is a **table lake**: pick `Iceberg`/`Delta`/`Hudi`, not raw parquet sprawl.
Reliability is a product feature: define **freshness/availability SLOs**, page on violations, and track MTTR.
Quality at scale requires **contracts + gates** (e.g., `dbt` tests, Great Expectations) before data hits curated zones.
Performance and cost hinge on **file sizing, compaction, partition strategy, and workload isolation** (separate compute for ingest vs BI).
Governance that works uses **a catalog + least-privilege + row/column controls**, not tribal knowledge and wiki pages.

Implementation checklist

Standardize on one table format (`Iceberg`, `Delta Lake`, or `Hudi`) for curated datasets
Adopt a catalog (`Glue`, `Unity Catalog`, `Hive Metastore`) and treat it as production infrastructure
Define dataset SLOs: freshness, completeness, and query availability
Implement automated quality gates (`dbt` tests, Great Expectations/Deequ) in CI/CD and in pipelines
Add lineage + observability (`OpenLineage`, `Marquez`, or vendor equivalent) and alert on regressions
Enforce file size + compaction (target 128–512MB data files) and schedule maintenance jobs
Partition for access patterns, not for ego; revisit quarterly with real query stats
Separate compute by workload class (ingest, transform, ad-hoc, BI) with quotas and budgets

Questions we hear from teams

Do we have to migrate everything to Iceberg/Delta/Hudi to scale?: No. Start with the top 10–20 curated tables that drive most business value and cost. Convert those first, add compaction/maintenance, and enforce contracts. Leave long-tail raw data as files until it proves it deserves table treatment.
What’s the fastest win for lake performance?: Compaction and partition sanity. Get file sizes into the 128–512MB range, remove pathological partitions, and rerun the top BI queries. It’s common to cut query time and $/TB scanned dramatically without changing any dashboards.
How do we stop “silent wrong” data without slowing delivery?: Automate gates. Put `dbt test`/Great Expectations in the pipeline and quarantine failures. Teams move faster when they trust the curated layer and don’t have to manually validate every downstream report.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Book a data lake reliability assessment See our data engineering services