Stop Paying for Shuffles: The ETL Tune-Up That Cut Runtime 40% and Spend 35%

You don’t need a rewrite. You need to stop moving bytes you don’t need, fix file layouts, and put quality gates where they matter. Here’s the playbook we use to make ETL faster, cheaper, and more reliable—without breaking business logic.

“We didn’t rewrite a single line of business logic. We just fixed the plumbing—and the bill dropped 35%.”
Back to all posts

The ETL jobs that make your CFO cry

I walked into a team where the daily fact build on Spark took 2h15m, burned through a pile of S3 GETs, and still dropped late-arriving orders. Classic: small Parquet files, blind joins, and zero stats. No one wanted a rewrite. They just wanted it to stop paging the on-call and spiking the Snowflake bill.

We didn’t touch product logic. We cut wall time by 40% and spend by 35% in three weeks by fixing plumbing: storage layout, join strategies, and compute sizing—plus real data quality gates so we could sleep again.

Profile first: know exactly where time and money go

If you don’t measure, you’re cargo-culting configs. Profile before changing a line.

What to capture:

  • End-to-end wall time and critical path stages

  • Input/scan bytes, shuffle read/write, spill to disk

  • Per-query cost: slot_ms (BigQuery), credits (Snowflake), DBU (Databricks)

  • Table-level stats: row counts, partitions, file counts, average file size

  • Data freshness and failure MTTR

Examples:

-- BigQuery: get a plan and see estimated stages
EXPLAIN SELECT o.order_id, s.store_id
FROM `raw.orders` o
JOIN `dim.store` s USING(store_id)
WHERE o.event_date BETWEEN '2025-10-01' AND '2025-10-07';
# Spark: fetch the event log and open in Spark UI/History Server
spark-submit \
  --conf spark.eventLog.enabled=true \
  --conf spark.eventLog.dir=s3a://org-bucket/spark-logs \
  job.py

Track these KPIs per DAG run:

  • Spark: shuffle_read_bytes, shuffle_write_bytes, bytes_skipped, task spill time

  • Snowflake: WAREHOUSE_CREDITS_USED, partitions scanned vs pruned

  • BigQuery: slot_ms, bytes processed vs billed, materialized view hits

Pipe lineage via OpenLineage or Marquez so you can tie cost to upstream changes.

Design for reliability: idempotent, late-data aware, and observable

Optimizing without reliability just makes bad data arrive faster. We ship pipelines that are」「restartable」「 with deterministic outputs.

Patterns that work:

  • Idempotent upserts with stable keys and MERGE semantics

  • Watermarks for late data; dedupe by (business_key, event_ts)

  • Exactly-once on CDC by using source_lsn/op_ts and tombstones

  • Contract tests near sources; fail fast on schema drift

PySpark example (batch or micro-batch) that dedupes and tolerates late arrivals:

from pyspark.sql import functions as F, Window as W

raw = (spark.readStream
    .format("kafka")
    .option("subscribe", "orders.cdc")
    .load())

orders = (raw.selectExpr("CAST(value AS STRING) as json")
    .select(F.from_json("json", schema).alias("r")).select("r.*"))

# Deduplicate using business key and event time
w = W.partitionBy("order_id").orderBy(F.col("event_ts").desc())
latest = (orders
    .withWatermark("event_ts", "2 hours")
    .withColumn("rn", F.row_number().over(w))
    .filter("rn = 1"))

# Idempotent upsert into Delta/Iceberg
(latest.writeStream
    .format("delta")
    .option("checkpointLocation", "s3a://bucket/chk/orders")
    .foreachBatch(lambda df, epoch: (
        df.createOrReplaceTempView("updates") or
        spark.sql("""
            MERGE INTO silver.orders t
            USING updates u
            ON t.order_id = u.order_id
            WHEN MATCHED AND u.event_ts >= t.event_ts THEN UPDATE SET *
            WHEN NOT MATCHED THEN INSERT *
        """))
    .start())

This is the difference between safe retries and double-counted revenue.

Make storage do the heavy lifting: formats, partitions, and file sizing

90% of cost is moving bytes you didn’t need. Fix the lakehouse layout and watch costs fall.

What works in practice:

  • Use columnar: Parquet or ORC. Avoid CSV except at ingestion edges.

  • Partition on low-cardinality, high-pruning columns (e.g., event_date, not user_id).

  • Target file size 128–512MB to avoid the small-files tax; compact daily.

  • Prefer ZSTD for cold storage scans, Snappy for interactive jobs.

  • Adopt a table format: Delta Lake or Apache Iceberg for ACID, compaction, and time travel.

Spark example to write sane files with pruning:

(silver_df.repartition("event_date")  # partition for pruning
  .sortWithinPartitions("store_id", "event_ts")  # helps z-order-like locality
  .write
  .format("delta")
  .mode("append")
  .option("compression", "zstd")
  .option("maxRecordsPerFile", 3_000_000)  # tune to ~256MB files
  .save("s3a://lake/silver/orders"))

Delta compaction + data skipping:

-- Databricks / Delta
OPTIMIZE delta.`s3a://lake/silver/orders`
ZORDER BY (store_id, event_date);

Iceberg compaction via Spark SQL:

CALL catalog.system.rewrite_data_files(
  table => 'silver.orders',
  strategy => 'binpack',
  options => map('target-file-size-bytes','268435456'));

Immediate effect: fewer files opened, better predicate pushdown, less shuffle.

Push compute down, kill shuffles, and let the engine help you

Most of the runtime was a massive join with a skewed dimension. Fix the join strategy, stats, and let the optimizer earn its keep.

Tactics:

  • Enable adaptive query execution: spark.sql.adaptive.enabled=true

  • Collect stats on dimension tables so the planner can broadcast safely

  • Use join hints sparingly when the optimizer lies

  • Bucket or pre-sort where skew is chronic

Spark configs that consistently help:

# Spark 3.x
--conf spark.sql.adaptive.enabled=true \
--conf spark.sql.adaptive.skewJoin.enabled=true \
--conf spark.sql.autoBroadcastJoinThreshold=64m \
--conf spark.sql.files.maxPartitionBytes=134217728 \
--conf spark.sql.shuffle.partitions=400

Spark SQL with a directed broadcast:

SELECT /*+ BROADCAST(dim_store) */ f.*
FROM fact_orders f
JOIN dim_store ON f.store_id = dim_store.store_id
WHERE f.event_date >= date_sub(current_date(), 7);

Snowflake and BigQuery equivalents:

  • Snowflake: define clustering keys for large facts and let pruning work; avoid SELECT * on wide tables.

  • BigQuery: partition and cluster; watch bytes billed vs bytes processed and use materialized views for hot aggregations.

Snowflake warehouse hygiene:

ALTER WAREHOUSE ETL_WH SET AUTO_SUSPEND = 60;
ALTER WAREHOUSE ETL_WH SET AUTO_RESUME = TRUE;
ALTER WAREHOUSE ETL_WH SET MIN_CLUSTER_COUNT = 1, MAX_CLUSTER_COUNT = 2;

Stop paying for idle: right-size and schedule like you mean it

I’ve seen Airflow backfills stampede clusters at 2 a.m. and melt S3. Control concurrency and scale to zero when idle.

Airflow DAG guardrails:

from airflow import DAG
from airflow.models import Variable

dag = DAG(
  dag_id="orders_etl",
  max_active_runs=1,    # no overlapping runs
  concurrency=8,        # cap task-level parallelism
  catchup=False,
  default_args={"retries": 2, "retry_delay": timedelta(minutes=5)}
)

Cap backfills by date window:

for d in pendulum.period(start, end).range('days'):
    run_task.override(task_id=f"build_{d}")(build_partition, d)

Kubernetes autoscaling for Spark operators (scale executors on demand):

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: spark-executor-autoscale
spec:
  scaleTargetRef:
    name: spark-operator
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus:9090
      metricName: spark_stage_active_tasks
      threshold: '200'

BigQuery slot and reservation discipline with Terraform (reserve, don’t burst blindly):

resource "google_bigquery_reservation" "etl" {
  name        = "etl"
  slot_capacity = 200
}

resource "google_bigquery_assignment" "etl_default" {
  assignee   = "projects/my-project"
  job_type   = "QUERY"
  reservation = google_bigquery_reservation.etl.name
}

The point: cap parallelism, autosuspend warehouses, and use spot/preemptible nodes for batch. You’ll cut idle burn without slowing the path to green.

Validate like SREs: put gates where breakage actually happens

You can’t optimize what you can’t trust. We gate critical paths with dbt tests and Great Expectations, with alerts that wake the right person.

dbt model tests and freshness:

version: 2
models:
  - name: fct_orders
    tests:
      - unique:
          column_name: order_id
      - not_null:
          column_name: order_total
    config:
      materialized: incremental
      on_schema_change: fail
sources:
  - name: stripe
    freshness:
      warn_after: {count: 30, period: minute}
      error_after: {count: 120, period: minute}

Great Expectations quick check on a silver table:

import great_expectations as ge

batch = ge.read_parquet("s3a://lake/silver/orders/date=2025-10-01/*.parquet")
batch.expect_column_values_to_not_be_null("order_id")
batch.expect_column_values_to_be_between("order_total", min_value=0)
batch.expect_table_row_count_to_be_between(min_value=100000, max_value=200000)
results = batch.validate()
if not results.success:
    raise Exception("DQ failure: orders silver out of spec")

Alert on contract breaks near sources (e.g., Debezium schema change) so you fail fast before the fan starts spinning. Tie alerts to runbooks and owners, not a generic Slack channel.

Case file: 40% faster, 35% cheaper, fewer pages in three weeks

A multi-brand retailer on Databricks + Delta with Snowflake for BI. Pain: 2h15m daily ETL, frequent late data, and cost spikes at month-end.

What we changed:

  1. Profiling: found 68% of time in a skewed orders×store join, with 1.8TB shuffle and 40% task spill.
  2. Storage: compacted 3.2M tiny files into ~256MB Parquet; partitioned on event_date; ZORDER on (store_id, event_date).
  3. Compute: enabled AQE, broadcasted dim_store (120MB), dropped shuffle partitions from 1200 to 400.
  4. Scheduling: capped Airflow concurrency; windowed backfills; autosuspended Snowflake warehouses to 60s.
  5. Quality: added dbt uniqueness/null checks and Great Expectations row count bounds; implemented 2h watermark in the CDC stream.

Measured outcomes after two weeks of guarded rollouts:

  • Wall time: 2h15m → 1h21m (−39.6%)
  • Shuffle I/O: 1.8TB → 720GB (−60%)
  • S3 GETs: −57% (fewer small files)
  • Snowflake credits/day: −32% (auto-suspend + pruning)
  • Incidents: 3/week → 0 in 30 days; MTTR down from 45m to 12m

We didn’t rewrite business logic. We just fixed the plumbing.

What to do Monday morning

  • Capture a one-week baseline on runtime, shuffle, scan bytes, and cost per DAG.
  • Choose one high-ROI path (usually a skewed join or a small-files swamp) and ship a canary optimization.
  • Turn on adaptive execution and set broadcast thresholds; add stats on dimensions.
  • Compact and partition one hot table; run a limited backfill.
  • Add three dbt tests and one Great Expectations suite on the most critical model; wire to alerts.
  • Right-size your compute: auto-suspend warehouses, cap Airflow concurrency, and schedule batch to cheap windows.

If you want a fast path: yes, GitPlumbers runs a two-week ETL Tune-Up where we do exactly this, with before/after metrics and a rollback plan. No silver bullets—just results.

Related Resources

Key takeaways

  • Measure first: profile wall time, shuffle bytes, scan volume, and `slot_ms`/credits before touching code.
  • Design for idempotency and late data so retries don’t corrupt truth tables.
  • Stop moving bytes: columnar formats, partition pruning, sane file sizes, and table formats (Delta/Iceberg) unlock huge wins.
  • Use pushdown and kill shuffles: broadcast joins, join hints, statistics, and `spark.sql.adaptive.enabled=true`.
  • Right-size compute: auto-suspend, autoscaling, spot instances, and schedule windows cut idle burn.
  • Gate with data quality checks close to sources and contracts; fail fast with actionable alerts.
  • You can get 30–50% runtime and cost reduction in weeks without a rewrite if you fix the plumbing.

Implementation checklist

  • Collect baseline metrics: wall time, input bytes, shuffle bytes, task spill, scan volume, `slot_ms`/credits.
  • Verify idempotency: deterministic keys, `MERGE` semantics, dedupe strategy, watermark policy.
  • Fix storage layout: Parquet/ORC, partitions on low-cardinality columns, target file size 128–512MB, compression (ZSTD/Snappy).
  • Turn on adaptive execution and pushdown: `spark.sql.adaptive.enabled`, broadcast thresholds, join hints, stats.
  • Right-size compute: auto-suspend/scale-to-zero, spot/preemptible, reservation sizing, DAG concurrency limits.
  • Add quality gates: dbt tests, Great Expectations, freshness SLAs, canary backfills, data contracts.
  • Instrument lineage and cost: OpenLineage, query profiles, tags/labels for chargeback/FinOps.
  • Run a controlled backfill with canary and compare row counts, null rates, and business KPIs.

Questions we hear from teams

What’s the fastest way to get meaningful ETL speedups without a rewrite?
Compact small files, enable adaptive execution, and fix one skewed join with a broadcast or bucketing. Add basic dbt/GE checks so you don’t ship garbage faster. You’ll usually see 20–40% wins in a week.
Parquet `zstd` vs `snappy`?
Use `zstd` for batch-heavy tables where CPU is cheaper than I/O; use `snappy` for interactive workloads and frequent read-modify-write. Measure wall time and CPU utilization to decide.
Delta vs Iceberg for optimization features?
Both give you ACID, compaction, and metadata for pruning. Delta has `OPTIMIZE ZORDER` and is great on Databricks; Iceberg is engine-agnostic with solid partition evolution and bin-packing compaction. Choose based on your engine/ecosystem.
How do I avoid double-counting with retries?
Design idempotent writes with deterministic keys and `MERGE`/upsert semantics. Use watermarks and dedupe windows. For CDC, order by `op_ts`/`lsn` and apply tombstones.
Can I get cost reductions on Snowflake/BigQuery without touching ETL code?
Yes: right-size warehouses/slots, auto-suspend aggressively, add clustering/partitioning, and replace hot queries with materialized views. Tag jobs for chargeback to expose runaway spend.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Book a 2-week ETL Tune-Up assessment See our ETL optimization checklist

Related resources