Your S3 Isn’t a Data Lake: The Architecture That Survives 10x Growth Without Melting Down

I’ve watched “just dump it in S3” turn into million-dollar outages. Here’s the lake architecture that scales, stays reliable, and actually delivers business value.

Data lakes don’t fail because S3 runs out of space; they fail because teams run out of discipline.
Back to all posts

I’ve seen more lakes drown teams than save them. The pattern is always the same: someone says “just land it in S3” and six months later analytics is running 12 hours behind, fraud models are training on ghost records, and the CFO wants to know why Athena now costs more than Snowflake. The fix isn’t another tool—it’s an architecture and operating model that respects reliability, quality, and business value.

The “infinite S3 bucket” lie

When we parachuted into a retail client last year, they had 3 PB in S3, 60k tiny Parquet files per day, and a nightly Spark job that took 9+ hours. A schema change in their POS system doubled the job time; a backfill pushed it to 18 hours. Their “data lake” was a folder convention plus ad hoc Glue crawlers. No table format, no compaction, no ownership.

What actually works is boring, repeatable, and proven:

  • Open table formats: Apache Iceberg or Delta Lake so you get ACID tables, schema evolution, and metadata scaling.
  • Domain-oriented bronze/silver/gold with explicit contracts.
  • CDC into bronze, dbt/Transform in silver, curated marts in gold.
  • Automated quality gates and SLOs for freshness and correctness.
  • Compaction and partition hygiene so you don’t die by small files.
  • Governance via a real catalog and row-level ACLs.

You don’t need to be on Databricks to do this. We’ve built this on AWS S3 + Glue Catalog + EMR/Spark + Trino + Airflow. Same story on ADLS + Synapse/Spark + Purview or GCS + Dataproc + BigLake.

Architecture that scales: bronze/silver/gold on Iceberg or Delta

Pick a table format that your engines can speak. If you want portability across Spark/Trino/Athena/Snowflake, Iceberg is a safe bet. If you’re all-in on Databricks, Delta is fine. The rest of this works either way.

Folder layout that keeps humans sane and tools fast:

  • s3://company-lake/bronze/<domain>/<source>/<table>/ — raw, append-only, partitioned by ingest date.
  • s3://company-lake/silver/<domain>/<dataset>/ — cleaned, typed, deduped, partitioned by business keys.
  • s3://company-lake/gold/<domain>/<mart>/ — curated models ready for BI/ML; stable contracts.

Table catalog:

  • Use Glue Catalog or Hive Metastore for Iceberg; Unity Catalog for Delta.
  • Expose tables to Athena/Trino for ad hoc and to Spark for heavy transforms.

A minimal Terraform baseline for S3 helps avoid footguns everyone learns the hard way:

resource "aws_s3_bucket" "lake" {
  bucket = "company-lake"
}

resource "aws_s3_bucket_versioning" "lake" {
  bucket = aws_s3_bucket.lake.id
  versioning_configuration { status = "Enabled" }
}

resource "aws_s3_bucket_server_side_encryption_configuration" "lake" {
  bucket = aws_s3_bucket.lake.id
  rule { apply_server_side_encryption_by_default { sse_algorithm = "aws:kms" } }
}

resource "aws_s3_bucket_lifecycle_configuration" "lake" {
  bucket = aws_s3_bucket.lake.id
  rule {
    id     = "bronze-retain-90d"
    status = "Enabled"
    filter { prefix = "bronze/" }
    transition { days = 30 storage_class = "STANDARD_IA" }
    expiration { days = 90 }
  }
}

resource "aws_s3_bucket_policy" "deny_unencrypted" {
  bucket = aws_s3_bucket.lake.id
  policy = jsonencode({
    Version = "2012-10-17",
    Statement = [{
      Effect = "Deny",
      Principal = "*",
      Action = ["s3:PutObject"],
      Resource = ["${aws_s3_bucket.lake.arn}/*"],
      Condition = { StringNotEquals = { "s3:x-amz-server-side-encryption" = "aws:kms" } }
    }]
  })
}

That lifecycle rule alone saved one client ~28% on storage for bronze without touching silver/gold.

Make reliability boring: CDC, contracts, and schema evolution

“Full extracts” do not scale. Use CDC from OLTP and land immutable events in bronze. Debezium + Kafka Connect is reliable and cheap compared to nightly dumps.

Example Debezium config for MySQL to Kafka:

{
  "name": "mysql-inventory-cdc",
  "config": {
    "connector.class": "io.debezium.connector.mysql.MySqlConnector",
    "database.hostname": "mysql-prod",
    "database.port": "3306",
    "database.user": "debezium",
    "database.password": "********",
    "database.server.id": "184054",
    "database.server.name": "inventory",
    "database.include.list": "sales,crm",
    "table.include.list": "sales.orders,crm.customers",
    "tombstones.on.delete": "false",
    "include.schema.changes": "true",
    "snapshot.mode": "initial_only",
    "converters": "unwrap",
    "unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
    "unwrap.drop.tombstones": "true"
  }
}

Stream bronze to Iceberg with Spark Structured Streaming. Iceberg handles schema evolution without rewriting the world, and snapshots make time travel and rollback sane.

from pyspark.sql import SparkSession

spark = (SparkSession.builder
  .appName("bronze-orders-stream")
  .config("spark.sql.catalog.glue", "org.apache.iceberg.spark.SparkCatalog")
  .config("spark.sql.catalog.glue.warehouse", "s3://company-lake/")
  .config("spark.sql.catalog.glue.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog")
  .getOrCreate())

orders = (spark.readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "broker:9092")
  .option("subscribe", "inventory.sales.orders")
  .load())

from pyspark.sql.functions import from_json, col
schema = "order_id STRING, customer_id STRING, amount DECIMAL(10,2), updated_at TIMESTAMP"
parsed = orders.select(from_json(col("value").cast("string"), schema).alias("r")).select("r.*")

(parsed.writeStream
  .format("iceberg")
  .outputMode("append")
  .option("checkpointLocation", "s3://company-lake/_checkpoints/bronze/orders/")
  .toTable("glue.bronze.sales_orders"))

Contracts: publish a gold table contract (columns, types, SLAs) and treat breaking changes like API changes. Consumers shouldn’t parse your logs to guess semantics. I’ve seen this alone cut MTTR from days to hours because teams stop arguing over “what does status mean?”

Quality you can measure: tests, expectations, and freshness SLOs

You don’t need a research paper—just enforce a handful of checks everywhere and make failures visible by default.

  • Freshness: 99% of days, max(updated_at) < 15 minutes behind for gold.sales_orders.
  • Completeness: no nulls on business keys; no negative amounts.
  • Uniqueness: dedupe on natural keys in silver; primary keys in gold.
  • Volume deltas: detect sudden drops/spikes > 3σ from baseline.

Add dbt tests on silver/gold models so they fail the build, not your CFO’s dashboard.

version: 2
models:
  - name: silver_sales_orders
    tests:
      - dbt_expectations.expect_compound_columns_to_be_unique:
          column_list: [order_id, updated_at]
      - not_null:
          column_name: order_id
      - dbt_expectations.expect_column_values_to_be_between:
          column: amount
          min_value: 0
  - name: gold_mrr
    tests:
      - not_null:
          column_name: customer_id
      - unique:
          column_name: customer_id

For table-level data checks and freshness alerts, Soda is dead simple:

# soda/checks-sales.yml
checks:
  - freshness(max(updated_at)) < 15m
  - missing_count(order_id) = 0
  - duplicate_count(order_id) = 0
  - row_count_greater_than: 1000

Run in CI and in Airflow after loads:

soda scan -d trino -c soda/config.yml soda/checks-sales.yml

Publish SLOs on a shared dashboard (Grafana) with alerts to the owning team’s Slack/PagerDuty. Data is prod; page someone when it’s down.

Keep performance predictable: partitions, compaction, and metadata hygiene

Most “the lake is slow” tickets reduce to two issues: partitions that don’t match query patterns and the small files problem.

  • Partition by what you filter: don’t default to dt—if 90% of queries filter by customer_id, use Iceberg’s bucket(customer_id, 128) plus truncate() for high-cardinality strings.
  • Target file size: write batches to 128–512 MB parquet files. Set writer options and limit micro-batches.
  • Compact and VACUUM: schedule daily compaction on bronze and weekly on silver; vacuum old snapshots.
  • Statistics: make sure Parquet/ORC stats and Bloom filters are enabled for predicate pushdown.

Airflow operator to run periodic Iceberg compaction with Spark:

from airflow import DAG
from airflow.providers.apache.spark.operators.spark_submit import SparkSubmitOperator
from datetime import datetime

dag = DAG("iceberg_compact_silver", start_date=datetime(2024, 1, 1), schedule="0 3 * * *")

SparkSubmitOperator(
  task_id="compact_sales_orders",
  dag=dag,
  application="s3://company-jobs/iceberg_compact.py",
  conf={
    "spark.sql.catalog.glue": "org.apache.iceberg.spark.SparkCatalog",
    "spark.sql.catalog.glue.warehouse": "s3://company-lake/",
    "spark.sql.catalog.glue.catalog-impl": "org.apache.iceberg.aws.glue.GlueCatalog"
  },
  application_args=["--table", "glue.silver.sales_orders", "--target-file-size-mb", "256"]
)

We’ve cut Trino query times by 40–70% just by fixing partitions and running compaction. One fintech saw Athena spend drop 32% the first month after compaction and bucket(user_id, 256).

Governance without killing velocity: catalog, ACLs, and lineage

If your “catalog” is a wiki page, compliance will win and your velocity will die. Use the real stuff:

  • Catalog: Glue/Unity/Hive with Iceberg/Delta. Tag PII and critical tables.
  • Access control: Lake Formation (AWS) or Ranger (Hadoop). Enforce column/row-level security and use service principals—not humans with admin creds running ad hoc CTAS.
  • Lineage: instrument OpenLineage in Airflow/Spark and surface it in DataHub or Marquez. When gold.revenue breaks, you want to know the upstream bronze topic and which dbt model introduced the bug.

Minimum viable controls that don’t suffocate teams:

  • PII columns masked in bronze/silver; only visible in gold to allowed roles.
  • Table-level ownership metadata; on-call rotation per domain.
  • Access via groups (data-marketing, data-finance), least privilege by default.

This isn’t busywork. We’ve avoided at least two auditor escalations by showing row-level access policies in Lake Formation and lineage from gold back to a CDC topic.

Operating the lake like an SRE: observability, cost, and runbooks

You wouldn’t run Kubernetes without dashboards. Same here.

  • Metrics: pipeline success rate, MTTR, data freshness, small-file ratio, average file size, average partition size, query scan bytes, and cost per TB processed.
  • Logs/Tracing: Spark metrics to Prometheus; Airflow stats to StatsD; OpenLineage events collected and visualized in DataHub.
  • Runbooks: playbooks per domain—how to replay CDC, how to rebuild silver after a schema change, how to backfill gold, and the panic button for “turn off late-arriving updates”.
  • Budgets: set AWS Budgets for S3, Glue, EMR, and Athena with alerts. Use S3 Inventory + Athena to find hotspots (e.g., millions of 5 KB files in bronze). Rotate compression: zstd for cold, snappy for hot.

We set SLOs with error budgets on freshness. If a team burns the budget, they slow feature work and fix reliability. Sounds draconian, works in practice.

A 90-day plan that actually delivers business value

This is how we land it at GitPlumbers without producing a 200‑page slide deck.

  1. Week 1–2: Pick format and catalog, stand up folders and Terraform baseline. Enable CDC for 1–2 critical sources. Define bronze/silver/gold for a single domain.
  2. Week 3–4: Build bronze streaming ingest to Iceberg/Delta. Implement first silver transform. Add dbt tests and Soda checks. Publish freshness SLOs.
  3. Week 5–6: Expose Trino/Athena. Compact bronze daily. Tune partitions. Lock down access with Lake Formation/Ranger. Wire OpenLineage.
  4. Week 7–8: Build first gold mart for a business KPI (e.g., MRR or CAC). Promise: <15 min freshness during business hours. Put a dashboard in front of finance.
  5. Week 9–12: Scale to two more domains. Add cost and performance dashboards. Drill MTTR down <2 hours for data incidents. Document runbooks.

What we typically see:

  • Freshness: 90%+ of tables meeting <15 min SLO by week 6.
  • Performance: 40–70% faster queries after compaction/partition tuning.
  • Cost: 20–35% lower S3/Athena/EMR spend from lifecycle + file sizing.
  • Reliability: MTTR down from days to hours because ownership + lineage + tests.

You don’t need perfection to deliver value. You need an opinionated architecture and the discipline to operate it. If you want help cutting through the hype and laying real tracks, that’s what we do at GitPlumbers.

Related Resources

Key takeaways

  • Treat the lake as a product with domain ownership, SLOs, and clear contracts—not a dumping ground.
  • Use an open table format (Iceberg/Delta) with bronze/silver/gold to scale schema evolution, compaction, and metadata.
  • Automate quality and freshness checks (Soda/GE) in CI and orchestrator, fail fast, and page the right owner.
  • Control small files and partitions; schedule compaction and enforce write batch sizes.
  • Govern with a real catalog and row-level ACLs; capture lineage so you can trace breakage and blast radius.
  • Start with a 90-day plan tied to business metrics: freshness SLOs, cost per TB, and MTTR reductions.

Implementation checklist

  • Pick your table format (Iceberg or Delta) and catalog (Glue/Hive/Unity/REST)
  • Stand up bronze/silver/gold folders and retention rules
  • Enable CDC for key OLTP sources (Debezium/Kafka)
  • Build dbt tests and Soda checks for critical tables
  • Instrument lineage (OpenLineage) and data SLO dashboards
  • Schedule compaction and vacuum jobs; alert on small-file ratios
  • Lock down access with Lake Formation/Ranger and service principals
  • Publish SLAs/SLOs and on-call runbooks for each domain

Questions we hear from teams

Iceberg or Delta—does it matter?
Both work. If you need multi-engine portability (Spark, Trino, Athena, Snowflake External Tables), Iceberg edges out. If you’re standardized on Databricks with Unity Catalog, Delta is fine. Pick one and commit to its compaction and governance patterns.
Do I need streaming?
Not everywhere. Use CDC streaming into bronze for operational sources where freshness matters. Batch is fine for slow-moving data (ERP, SFTP). The key is consistent contracts and quality checks.
Will this replace my warehouse?
Not necessarily. Many teams run a lake for raw+transforms and push curated marts to Snowflake/BigQuery. Lakes shine for flexibility and ML; warehouses still win for governed BI at scale. Hybrid is normal.
How do we prevent the small files problem?
Control writer batch sizes, set target file size, compact on a schedule, and avoid over-partitioning. Iceberg/Delta compaction should be a first-class Airflow/Dagster job with alerts.
What’s the fastest path to value?
Pick one business KPI, wire CDC to bronze, build a single silver model with dbt tests, then a gold mart with freshness SLOs. Put a dashboard in front of a real stakeholder within 30–45 days.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to a GitPlumbers engineer Get the Data Platform Checklist

Related resources