How do I know if I should keep transforms in the warehouse vs. Spark?

If 80% of your transforms are SQL-friendly (aggregations, joins, windows) and your data already lives in Snowflake/BigQuery, keep them there. Move to Spark/Databricks when you need custom UDFs, complex sessionization, or ML that benefits from the lakehouse. Always co-locate compute with data to avoid network egress and shuffles.

What’s the quickest win to cut ETL costs without a rewrite?

Stop full reloads. Implement incremental loads with a watermark and `MERGE`. Pair that with partitioning and predicate pushdown. These two changes usually deliver 30–50% reductions in runtime and scanned bytes.

Won’t more quality checks slow my pipeline down?

Not if you tier them. Put fast, blocking checks on critical tables (uniqueness, not-null, recency). Make heavier profiling or anomaly detection non-blocking and run it out-of-band. Use canary runs on samples before full execution to fail fast.

How do I handle late-arriving data and still stay fast?

Use a sliding window strategy. Reprocess the last N days (e.g., 2–3) with idempotent `MERGE`s. Track watermarks per source and only re-open partitions that need it. This keeps the main flow fast while maintaining accuracy.

What metrics should I report to leadership?

Report SLA attainment (freshness/completeness), runtime on the critical path, cost per million rows processed, and MTTR. Leaders don’t care about engine flame graphs; they care that the dashboard is on time and the bill is predictable.

Data-engineering · Oct 6, 2025 · 10 minute read

Stop Paying for Slow ETL: The Playbook That Cut Our Snowflake Bill 42% and Ended 3 AM Pages

If your nightly ETL slips and your warehouse bill spikes, you don’t need another tool—you need a ruthless optimization pass that protects data quality and business SLAs.

Alex Mercer

Principal Data Engineer, GitPlumbers

20 years of breaking and fixing data systems. Ex-Adobe, ex-Shopify. Led migrations to Snowflake and Databricks, built CDC at scale, and carried the pager for more 3 AM ETLs than he cares to remember.

Fast is only useful if it’s repeatable. Optimize for the happy path and the failure path—both cost money.

Back to all posts

The 3 AM ETL That Tanked The CFO’s Mood

I walked into a client who had a familiar problem: a nightly Airflow DAG that was supposed to finish by 5 AM slid until 8:45, sales dashboards missed the standup, and Snowflake credits looked like a bonfire. They’d tried “just scale the warehouse” and “rewrite in Spark,” then got whiplash bills from both directions. Classic.

Here’s the uncomfortable truth: most ETL pipelines are slow and expensive because of a handful of fixable design choices—wasted I/O, full reloads, bad orchestration, and quality gates that either catch nothing or block everything. I’ll show you what we changed to cut runtime from 6.2 hours to 2.4 and reduce monthly warehouse spend by 42%, while improving data reliability.

Start With a Real Diagnosis, Not Voodoo Tuning

You can’t optimize what you don’t profile. Before touching code, establish baselines and SLOs.

Define SLOs tied to business value:
- Freshness: e.g., “Orders data ready by 6:00 AM UTC, 99.5%.”
- Completeness: e.g., “>= 99.9% of events ingested within 24h.”
- Accuracy: e.g., “Finance reconciliation delta < 0.1%.”
Instrument the pipeline:
- Runtime per task and end-to-end wall clock
- Bytes scanned vs. bytes returned (warehouse/lake)
- Rows processed and cost per run (Snowflake credits, BigQuery bytes, Databricks DBUs)
- MTTR and failure modes (schema drift, late data, throttling)
Get lineage into a graph: OpenLineage + Marquez or your catalog (Atlan, DataHub). If you can’t see dependencies, you’ll optimize the wrong thing.

I’ve seen teams jump straight to “bigger cluster.” Don’t. In our case, 63% of time was wasted reading CSVs repeatedly and re-aggregating unchanged partitions.

Kill Wasted I/O: Columnar, Partitioning, and Pushdown

I/O is the silent budget killer. Move raw to columnar and let your engines do the cheap math where the data sits.

Store data in columnar formats:
- Parquet for general; Delta Lake/Iceberg for ACID and time travel.
- Compress with snappy by default; avoid gzip on columnar unless you truly need it.
Partition by time (and only what filters on): daily/hourly ingest_date beats high-cardinality traps.
Enable pruning and predicate pushdown: cluster/Z-Order heavy filter keys.

Example: Convert raw CSV to Delta and partition by ingest date with Spark 3.5 on Databricks:

from pyspark.sql import functions as F

raw = (spark.read.option("header", True)
       .csv("s3://acme-raw/orders/*.csv"))
bronze = raw.withColumn("ingest_date", F.to_date("created_at"))

(bronze.write
  .format("delta")
  .mode("overwrite")
  .partitionBy("ingest_date")
  .option("compression", "snappy")
  .save("s3://acme-bronze/orders"))

On Snowflake, avoid scanning entire tables. Use clustering on frequently filtered columns and right-size the warehouse:

ALTER WAREHOUSE ETL_WH SET 
  AUTO_SUSPEND = 60, AUTO_RESUME = TRUE,
  MIN_CLUSTER_COUNT = 1, MAX_CLUSTER_COUNT = 2,
  SCALING_POLICY = 'ECONOMY';

ALTER TABLE FACT_ORDERS CLUSTER BY (ORDER_DATE);

In BigQuery, partition and cluster in DDL, and make sure queries include partition filters:

CREATE TABLE `acme.analytics.fact_orders` 
PARTITION BY DATE(order_timestamp)
CLUSTER BY customer_id, status AS
SELECT * FROM `acme.staging.orders_clean`;

The win: bytes scanned dropped by 58% on core facts because queries hit only relevant partitions.

Stop Full Reloads: Go Incremental With CDC + MERGE

Full reloads are for the first day and disaster recovery. After that, you’re burning money.

Capture changes at the source with CDC:
- Debezium to Kafka, or native connectors (Fivetran/Hevo) if you must.
- Land changes as append-only (Bronze), then upsert into curated (Silver/Gold).
Use watermarks and idempotent MERGE patterns.
Handle late-arriving data with a sliding window and reprocessing policy.

dbt 1.7 incremental model example using a watermark:

{{ config(materialized='incremental', unique_key='order_id', on_schema_change='append_new_columns') }}

WITH src AS (
  SELECT *
  FROM {{ source('staging', 'orders_cdc') }}
  {% if is_incremental() %}
    WHERE updated_at > (SELECT coalesce(max(updated_at), '1900-01-01') FROM {{ this }})
  {% endif %}
)

SELECT * FROM src

Snowflake upsert with MERGE (idempotent and watermark-aware):

MERGE INTO analytics.fact_orders t
USING (
  SELECT *
  FROM staging.orders_delta
  WHERE updated_at >= DATEADD(day, -2, CURRENT_DATE())
) s
ON t.order_id = s.order_id
WHEN MATCHED AND s.updated_at > t.updated_at THEN
  UPDATE SET amount = s.amount, status = s.status, updated_at = s.updated_at
WHEN NOT MATCHED THEN
  INSERT (order_id, amount, status, updated_at)
  VALUES (s.order_id, s.amount, s.status, s.updated_at);

That 48-hour window covers late events without reprocessing the entire table. On one retail workload, this change alone cut runtime 35%.

Put Compute Where It’s Cheapest (And Closest)

I’ve seen teams drag terabytes across the wire because “that’s where our ETL runs.” Don’t fight gravity.

If your warehouse is Snowflake/BigQuery, do transforms there with dbt and SQL—avoid exporting to Spark unless you need custom UDFs or ML.
If you’re lake-first (S3/ADLS + Databricks), keep joins in Spark where your Delta/Iceberg tables live.
Right-size compute:
- Snowflake: prefer smaller warehouses with ECONOMY scaling; use Tasks for scheduled SQL.
- Databricks: size executors for I/O, not CPU; enable AQE and broadcast joins on small dims.

Spark 3.5 join hint for a small dimension table:

fact = spark.read.table("bronze.clicks")
dim  = spark.read.table("silver.users_dim")

joined = fact.hint("broadcast", dim).join(dim, "user_id")

Rule of thumb: co-locate compute with data, and prefer pushing filters and aggregations down to the storage engine. We shaved 20 minutes by moving a “top N products by region” from Spark to a single Snowflake SQL task.

Orchestrate Like an Adult: Idempotent, Backpressured, Observable

Airflow (2.8) is fine—if you use it right. Most pipelines die from orchestration debt, not engine choice.

Make tasks idempotent. Writes should be safe to re-run; use temp tables + swaps or MERGE.
Apply backpressure and concurrency caps. Don’t DDoS your own warehouse.
Use retries with jitter; tag and pool heavy tasks.

Minimal Airflow snippet:

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta

with DAG(
    dag_id="etl_orders",
    start_date=datetime(2024, 1, 1),
    schedule_interval="0 3 * * *",
    max_active_runs=1,
    concurrency=32,
    dagrun_timeout=timedelta(hours=6),
    catchup=False,
) as dag:
    def load():
        # idempotent load here
        pass

    PythonOperator(
        task_id="load_fact_orders",
        python_callable=load,
        retries=3,
        retry_delay=timedelta(minutes=5),
        pool="etl-pool",
    )

Also: emit OpenLineage events, and wire alerts to Slack/PagerDuty on SLO breaches, not just task failures.

Quality Gates That Don’t Become Toll Booths

Data reliability must improve while costs drop. That’s achievable if your checks are fast, focused, and tiered.

Contracts at the edges: validate schemas on ingress. Reject or quarantine bad payloads.
Lightweight blocking tests on critical models; non-blocking telemetry for the rest.
Canary on a sample before running full workloads.

dbt tests that actually matter:

version: 2
models:
  - name: fct_orders
    tests:
      - dbt_utils.recency:
          field: order_timestamp
          datepart: hour
          interval: 2
    columns:
      - name: order_id
        tests: [unique, not_null]
      - name: amount
        tests:
          - not_null
          - dbt_expectations.expect_column_values_to_be_between:
              min_value: 0

And a quick canary pattern in SQL to fail fast if row counts go sideways by >20%:

WITH last_7 AS (
  SELECT COUNT(*) AS c FROM analytics.fact_orders
  WHERE order_timestamp >= DATEADD(day, -7, CURRENT_DATE())
), last_1 AS (
  SELECT COUNT(*) AS c FROM analytics.fact_orders
  WHERE order_timestamp >= DATEADD(day, -1, CURRENT_DATE())
)
SELECT CASE WHEN l1.c < l7.c * 0.8 THEN RAISE_ERROR('Anomalous drop in orders') END
FROM last_1 l1 CROSS JOIN last_7 l7;

Zero-trust your upstreams, but don’t strangle throughput with heavyweight checks on every table.

What “Good” Looked Like After We Fixed It

We didn’t add a new platform. We removed waste and added guardrails. Results from a recent retail client (Snowflake + dbt + Airflow, with CDC landing in S3/Delta):

Runtime: 6.2h → 2.4h (-61% critical path)
Snowflake credits: -42% monthly on ETL workloads
Storage: -28% scanned due to pruning and clustering
SLA attainment: 94.1% → 99.7% for 6 AM dashboards
MTTR: 84 min → 19 min post-incident (better idempotency + alerts)
Business impact: Finance had clean daily close by 7 AM, Marketing shifted to hourly campaign pivots without extra infra

We tracked two KPIs that aligned us with leadership:

Cost per million rows processed (warehouse credits + compute DBUs)
SLA adherence rate (freshness and completeness)

When you report these, nobody asks “why not Spark vs. SQL?” They see value.

A Pragmatic Migration Plan You Can Start This Sprint

Establish SLOs and baselines. Turn on cost and lineage telemetry.
Convert hot paths to columnar with partitioning and pushdown.
Replace full reloads with CDC + incremental models and MERGE.
Right-size compute and co-locate heavy joins.
Add lightweight quality gates and a canary stage.
Tighten orchestration: idempotency, backpressure, retries, alerts.
Review weekly: top 3 cost offenders and SLO breaches; fix ruthlessly.

If your ETL “optimization” makes you faster but less reliable, you didn’t optimize—you gambled. The playbook above gives you both speed and safety.

If you want a second set of eyes, this is exactly the kind of mess we clean up at GitPlumbers. We don’t push tools; we fix pipelines so your team can ship safely and your CFO stops side-eyeing your credit burn.

Related Resources

Key takeaways

Profile before you optimize—baseline runtime, I/O, and spend, then attack the top three offenders.
Convert raw to columnar (Parquet/Delta/Iceberg), partition by high-cardinality time keys, and let engines push filters.
Stop full reloads: adopt CDC + incremental models with idempotent MERGE patterns and watermarks.
Right-size compute: auto-suspend, economy scaling, and co-locate joins to avoid cross-system shuffles.
Build guardrails that don’t throttle: fast contract checks, dbt tests, and canary runs on samples.
Orchestrate for resilience: idempotent tasks, backpressure, retries with jitter, and concurrency caps.
Measure business impact: track SLA attainment, cost per row processed, and MTTR—not just wall-clock time.

Implementation checklist

Define data SLOs (freshness, completeness, accuracy) tied to business SLAs.
Instrument pipelines with runtime, bytes scanned, rows processed, and cost per run.
Convert raw CSV/JSON to Parquet/Delta; enforce partitioning and clustering.
Implement incremental loads via CDC (Debezium) or watermarks; use `MERGE` for upserts.
Tune compute: auto-suspend warehouses, use economy scaling, size Spark executors for I/O.
Add lightweight quality gates (dbt tests, canaries) that fail fast and alert precisely.
Cap concurrency and set backpressure in Airflow; make tasks idempotent.
Report outcomes monthly: runtime reduction, spend delta, SLA attainment, MTTR.

Questions we hear from teams

How do I know if I should keep transforms in the warehouse vs. Spark?: If 80% of your transforms are SQL-friendly (aggregations, joins, windows) and your data already lives in Snowflake/BigQuery, keep them there. Move to Spark/Databricks when you need custom UDFs, complex sessionization, or ML that benefits from the lakehouse. Always co-locate compute with data to avoid network egress and shuffles.
What’s the quickest win to cut ETL costs without a rewrite?: Stop full reloads. Implement incremental loads with a watermark and `MERGE`. Pair that with partitioning and predicate pushdown. These two changes usually deliver 30–50% reductions in runtime and scanned bytes.
Won’t more quality checks slow my pipeline down?: Not if you tier them. Put fast, blocking checks on critical tables (uniqueness, not-null, recency). Make heavier profiling or anomaly detection non-blocking and run it out-of-band. Use canary runs on samples before full execution to fail fast.
How do I handle late-arriving data and still stay fast?: Use a sliding window strategy. Reprocess the last N days (e.g., 2–3) with idempotent `MERGE`s. Track watermarks per source and only re-open partitions that need it. This keeps the main flow fast while maintaining accuracy.
What metrics should I report to leadership?: Report SLA attainment (freshness/completeness), runtime on the critical path, cost per million rows processed, and MTTR. Leaders don’t care about engine flame graphs; they care that the dashboard is on time and the bill is predictable.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Get an ETL Cost Review See how we stabilized a broken data platform