Your Data Lake Didn’t “Scale” — It Just Got Slower and More Expensive
A scalable data lake is boring on purpose: predictable reliability, enforceable quality, and costs you can explain in a QBR. Here’s the architecture pattern that actually holds up when volumes 10x.
A scalable data lake isn’t the one with the most tools. It’s the one that keeps its promises when the org gets bigger and messier.Back to all posts
The moment your “data lake” becomes a liability
I’ve watched this movie too many times: the lake starts as a scrappy S3 bucket with “raw” parquet, a couple of Spark jobs, and a BI dashboard everyone loves. Then data volumes 10x, the number of producers doubles, and suddenly:
- Finance asks why storage is cheap but query costs are exploding
- Analysts complain the “gold” tables are wrong (again)
- Engineers can’t answer basic questions like “what changed?” or “who owns this dataset?”
- A single bad backfill turns into a week-long incident with a 40-message Slack thread
That’s not a scaling problem. That’s an architecture and operating model problem.
A lake that scales is deliberately boring: repeatable ingestion, enforceable schema, predictable performance, and visible reliability. Let’s talk about what actually works.
The architecture pattern that survives 10x growth
If you remember the bad old days of Hive-on-HDFS, you already know the core lesson: unmanaged files become a junk drawer. Modern lakes scale when you treat them like table systems, not file dumps.
At a high level, the pattern looks like this:
- Object storage:
S3/ADLS Gen2/GCSfor durability and low-cost storage - Open table format:
Apache Iceberg(orDelta Lake/Apache Hudi) to get ACID-ish behavior, schema evolution, snapshots, and time travel - Catalog:
AWS Glue Data Catalog,Unity Catalog, orHive Metastoreto make tables discoverable and governable - Decoupled compute:
Spark(batch/stream transforms, heavy lifting)Trino/Presto(fast SQL for BI and ad-hoc)- Optional warehouse integration:
Snowflakeexternal tables or lakehouse engines
- Orchestration + CI:
Airflow/Dagster+dbtwith tests and deployment controls
The outcomes you’re aiming for are measurable:
- Freshness SLO (e.g., 95% of partitions available within 30 minutes)
- Incident MTTR drops (hours → minutes) because you can pinpoint which job and which change caused the regression
- Cost per TB scanned decreases because you stop forcing BI tools to read 10M tiny files
Reliability is an SLO, not a hope and a prayer
Most “data lake reliability” problems are self-inflicted:
- No clear ownership per dataset
- No explicit SLOs, so everything is “important” but nothing is monitored
- Pipelines that silently succeed while producing garbage
Here’s what we do at GitPlumbers when teams want reliability without turning the data platform into a science project:
- Define dataset-level SLOs (not platform-level hand-waving)
- Measure them automatically
- Alert like you mean it (and stop alerting on noise)
A simple SLO spec can live next to the transformation code:
# slo.yaml
datasets:
- name: finance.fact_invoices
freshness:
objective: "p95 < 30m"
alert_after: "45m"
completeness:
objective: "missing_partitions_per_day == 0"
availability:
objective: "query_error_rate < 0.5%"Then wire it into your observability. If you’re on open tooling, OpenLineage + Marquez gives you lineage events you can correlate with failures. If you’re in Datadog/New Relic/CloudWatch, ship job metrics and table stats.
What changes when you do this:
- You stop debating “is data late?” and start answering it with p95 freshness
- Incidents become actionable: “
job_transform_invoicesfailed on2025-12-14after deploymentabc123” - Your execs stop hearing “data is flaky” and start seeing reliability trend lines
Quality at scale: contracts + gates (not manual spot checks)
I’ve seen teams spend millions on a lake and still run their quality process as “ask Sarah if the dashboard looks weird.” That doesn’t scale.
You need two layers:
- Data contracts between producers and the lake (schema, keys, semantic expectations)
- Quality gates that fail fast before bad data pollutes curated tables
For teams already using dbt, start with tests that catch 80% of real-world breakages:
# models/finance/schema.yml
version: 2
models:
- name: fact_invoices
columns:
- name: invoice_id
tests:
- not_null
- unique
- name: invoice_date
tests:
- not_null
- name: amount
tests:
- not_null
- dbt_utils.accepted_range:
min_value: 0
max_value: 1000000If you need richer expectations (distribution shifts, regex checks, referential integrity across domains), add Great Expectations at ingestion boundaries:
# great_expectations checkpoint snippet
checkpoint_config = {
"name": "raw_invoices_checkpoint",
"validations": [{
"batch_request": {"datasource_name": "raw", "data_asset_name": "invoices"},
"expectation_suite_name": "raw_invoices_suite"
}],
"action_list": [{"name": "store_validation_result", "action": {"class_name": "StoreValidationResultAction"}}]
}Two practical rules I wish more teams followed:
- Quarantine, don’t overwrite: route failing partitions to
quarantine/with metadata about why they failed - Block promotion: raw can be messy; curated cannot. Promotion to “gold” must be gated
Measurable outcomes we commonly see:
- 50–80% reduction in “silent wrong” incidents
- Faster RCA because you can answer “what constraint failed?” instead of eyeballing rows
Performance and cost: the unsexy stuff that makes the lake usable
When a lake slows down, it’s rarely because “Spark is slow.” It’s because the lake is full of:
- Tiny files (the classic “small files problem”)
- Over-partitioning (
dt=.../hour=.../minute=...because someone read a blog once) - No compaction, no clustering, no stats
If you standardize on Iceberg, you get sane primitives: snapshots, manifest files, partition evolution, and maintenance procedures.
A simple Iceberg compaction/optimization job (Spark) looks like:
-- Spark SQL with Iceberg
CALL catalog.system.rewrite_data_files(
table => 'analytics.finance.fact_invoices',
options => map('target-file-size-bytes','268435456')
);
CALL catalog.system.rewrite_manifests('analytics.finance.fact_invoices');Operationally, the target is boring:
- Data files: 128–512MB
- Partitioning: aligned to the top 2–3 query predicates (usually
date+ one business key) - Maintenance cadence: daily compaction for hot tables, weekly for colder tables
For BI/SQL engines like Trino, make sure you’re not accidentally DDoS’ing your own lake. A minimal example of workload isolation:
# trino config (conceptual)
query.max-memory=20GB
query.max-total-memory=40GB
resource-groups.configuration-manager=fileThen define separate resource groups for:
- Scheduled transforms
- Interactive BI
- Ad-hoc exploration
This is where your FinOps story gets real: you can attribute cost per workload class and stop blaming “the lake” for one team’s runaway queries.
A concrete reference implementation (AWS example) that scales
Here’s a pattern we’ve implemented repeatedly on AWS when teams want open formats and vendor flexibility:
- Storage:
S3buckets per zone (raw,curated,sandbox) with lifecycle policies - Catalog:
AWS Glue Data Catalog - Table format:
Apache Iceberg - Compute:
EMR ServerlessorEKSSpark for transforms;TrinoonEKSfor SQL - Orchestration:
Airflow(MWAA) orDagster - Transform layer:
dbt(withdbt-trinoordbt-spark)
Lock down the bucket correctly (yes, this matters more as you scale):
# terraform-ish example
resource "aws_s3_bucket" "curated" {
bucket = "acme-data-curated"
}
resource "aws_s3_bucket_public_access_block" "curated" {
bucket = aws_s3_bucket.curated.id
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}A simple Airflow DAG pattern that enforces “quality before promotion”:
from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime
dag = DAG("finance_invoices", start_date=datetime(2025, 1, 1), schedule="@hourly", catchup=False)
ingest = BashOperator(
task_id="ingest_raw",
bash_command="python ingest_invoices.py --out s3://acme-data-raw/finance/invoices/",
dag=dag,
)
quality = BashOperator(
task_id="quality_gate",
bash_command="great_expectations checkpoint run raw_invoices_checkpoint",
dag=dag,
)
promote = BashOperator(
task_id="transform_to_curated",
bash_command="dbt run -s finance.fact_invoices && dbt test -s finance.fact_invoices",
dag=dag,
)
ingest >> quality >> promoteWhat this buys you in practice:
- Reproducibility: you can rebuild curated tables from raw using snapshots/time windows
- Isolation: ingestion failures don’t corrupt curated
- Auditable changes: schema evolution happens through versioned code and table metadata
What to measure (and what “success” looks like in 90 days)
If you can’t measure it, you can’t defend it when budgets tighten.
The scorecard that’s worked best for leaders I’ve partnered with:
- Freshness: p50/p95 lag per critical dataset (minutes)
- Quality: failed expectations/tests per day, plus top offenders by source
- Reliability: incident count + MTTR for data pipeline incidents
- Cost:
- $/TB stored (easy)
- $/TB scanned by query engine (this is where the savings are)
- compute hours by workload class (ingest/transform/BI)
- Business value delivery:
- cycle time from source onboarding → first usable curated table (days)
- number of trusted datasets adopted by downstream teams
In a realistic 90-day push (not a fantasy “platform rewrite”), the teams that execute this pattern typically see:
- 30–60% reduction in BI query spend after compaction + sane partitioning
- MTTR down from “half a day of archaeology” to under an hour with lineage + SLO alerting
- Fewer escalations because quality gates catch issues before dashboards do
A scalable data lake isn’t the one with the most tools. It’s the one that keeps its promises when the org gets bigger and messier.
If you’re staring at a lake that’s getting slower, costlier, and less trusted as it grows, GitPlumbers can help you stabilize it without a boil-the-ocean migration. We usually start with a short reliability and performance assessment, then implement the highest-leverage fixes (table format, compaction/partitioning, quality gates, and SLOs) in weeks—not quarters.
Key takeaways
- A “scalable” lake is a **table lake**: pick `Iceberg`/`Delta`/`Hudi`, not raw parquet sprawl.
- Reliability is a product feature: define **freshness/availability SLOs**, page on violations, and track MTTR.
- Quality at scale requires **contracts + gates** (e.g., `dbt` tests, Great Expectations) before data hits curated zones.
- Performance and cost hinge on **file sizing, compaction, partition strategy, and workload isolation** (separate compute for ingest vs BI).
- Governance that works uses **a catalog + least-privilege + row/column controls**, not tribal knowledge and wiki pages.
Implementation checklist
- Standardize on one table format (`Iceberg`, `Delta Lake`, or `Hudi`) for curated datasets
- Adopt a catalog (`Glue`, `Unity Catalog`, `Hive Metastore`) and treat it as production infrastructure
- Define dataset SLOs: freshness, completeness, and query availability
- Implement automated quality gates (`dbt` tests, Great Expectations/Deequ) in CI/CD and in pipelines
- Add lineage + observability (`OpenLineage`, `Marquez`, or vendor equivalent) and alert on regressions
- Enforce file size + compaction (target 128–512MB data files) and schedule maintenance jobs
- Partition for access patterns, not for ego; revisit quarterly with real query stats
- Separate compute by workload class (ingest, transform, ad-hoc, BI) with quotas and budgets
Questions we hear from teams
- Do we have to migrate everything to Iceberg/Delta/Hudi to scale?
- No. Start with the top 10–20 curated tables that drive most business value and cost. Convert those first, add compaction/maintenance, and enforce contracts. Leave long-tail raw data as files until it proves it deserves table treatment.
- What’s the fastest win for lake performance?
- Compaction and partition sanity. Get file sizes into the 128–512MB range, remove pathological partitions, and rerun the top BI queries. It’s common to cut query time and $/TB scanned dramatically without changing any dashboards.
- How do we stop “silent wrong” data without slowing delivery?
- Automate gates. Put `dbt test`/Great Expectations in the pipeline and quarantine failures. Teams move faster when they trust the curated layer and don’t have to manually validate every downstream report.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
