The ETL That Ate Your Cloud Bill: How We Cut 68% Runtime and 45% Cost Without Rewriting Everything
Real-world ETL optimization that shrinks runtime and spend while boosting reliability, data quality, and business impact.
You don’t scale your way out of bad ETL; you design your way out of it.Back to all posts
The 3 a.m. ETL That Ate Your Cloud Bill
I’ve walked into more than one shop where a “temporary” ETL turned into a nightly seven-hour money fire. One client had dbt on Snowflake, Airflow, and Fivetran. Good tools, bad defaults: full-table refreshes, no partitioning, Snowflake warehouses idling at LARGE all night. We didn’t rewrite. We profiled, flipped a few switches, and rewired the worst offenders.
- Result: runtime down 68% (7h → 2h15m), cost down 45%, freshness SLO (2h) met consistently.
- No heroics, no platform migration, no 6-month program. Two sprints. That’s the playbook below.
If your ETL needs a hug and a coffee to finish on time, it’s not a scaling problem—it’s a layout and control problem.
Find the bottlenecks and the dollar leaks
Before touching code, get visibility. The cost is hiding in wide joins, full refreshes, and small-file hell.
Profile queries and jobs
- Snowflake:
QUERY_HISTORY,WAREHOUSE_LOAD_HISTORYto find long scans and queued time. - BigQuery:
INFORMATION_SCHEMA.JOBSfor bytes processed;--maximum_bytes_billedlocally. - Spark/Databricks: Spark UI stages with skew and shuffle read metrics.
- Airflow/Dagster: longest DAG tasks, retries, backfill stacks.
- Snowflake:
Attribute cost and detect anomalies
- Tag workloads: project/warehouse labels and Airflow
poolper domain. - Pipe query metadata to Prometheus/Datadog; alert on cost per run and freshness SLOs.
- Tag workloads: project/warehouse labels and Airflow
-- BigQuery: top 10 spenders yesterday
SELECT job_id, user_email, total_bytes_processed/1e9 AS gb,
(total_bytes_processed*5e-6) AS est_cost_usd, statement_type
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE start_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
AND statement_type IN ('SELECT','CREATE_TABLE_AS_SELECT')
ORDER BY total_bytes_processed DESC
LIMIT 10;What we usually find:
- Full refreshes reprocessing years of history for a 1% delta.
- No partition pruning, so every query scans everything.
- Spark writing thousands of tiny files because
maxPartitionBytesis default. - Orchestrators launching all backfills at once, DDoS’ing the warehouse.
Stop full refreshes: go incremental, CDC-first
I’ve seen teams cut runtime in half just by switching materialized='incremental' in dbt and using MERGE. Add CDC and you’re done.
- Use
is_incremental()with a watermarked column (updated_at). - Prefer
MERGEtodelete+insertfor correctness and performance. - Handle late-arriving updates and soft deletes explicitly.
-- models/fct_orders.sql (dbt on Snowflake/BigQuery)
{{ config(materialized='incremental', unique_key='order_id', incremental_strategy='merge') }}
with src as (
select * from {{ source('app', 'orders') }}
{% if is_incremental() %}
where updated_at > (
select coalesce(max(updated_at), '1970-01-01') from {{ this }}
)
{% endif %}
)
select
order_id,
customer_id,
total_amount,
status,
updated_at
from srcCDC beats batch diffing. If you can, stream changes from Debezium → Kafka → object storage table, then MERGE.
-- Snowflake MERGE pattern with deletes handled
MERGE INTO analytics.orders t
USING staging.orders_delta s
ON t.order_id = s.order_id
WHEN MATCHED AND s.op = 'D' THEN DELETE
WHEN MATCHED AND s.op IN ('U','I') THEN UPDATE SET
customer_id = s.customer_id,
total_amount = s.total_amount,
status = s.status,
updated_at = s.updated_at
WHEN NOT MATCHED AND s.op IN ('I','U') THEN INSERT (
order_id, customer_id, total_amount, status, updated_at
) VALUES (
s.order_id, s.customer_id, s.total_amount, s.status, s.updated_at
);Tip: if you inherited vibe-coded ELT that reselects everything “just to be safe,” replace it with a CDC-backed incremental model. Less compute, fewer races, better SLAs.
Storage layout that lets compute breathe
Your warehouse is fast; your layout is not. Fix it.
- Partition by date/time you filter on. Cluster by high-cardinality columns used in joins.
- Use columnar formats (
Parquet/ORC) withZSTD/Snappy. Target 128–512MB file sizes. - Compact small files regularly. In Delta/Iceberg, use maintenance calls.
-- BigQuery table with pruning
CREATE OR REPLACE TABLE analytics.orders
PARTITION BY DATE(updated_at)
CLUSTER BY customer_id AS
SELECT * FROM staging.orders_clean;-- Databricks Delta: compact and improve skipping
OPTIMIZE analytics.orders ZORDER BY (customer_id);-- Apache Iceberg compaction (Spark SQL)
CALL system.rewrite_data_files(table => 'analytics.orders');Spark write tuning matters:
# Spark SQL configs (Databricks or spark-submit)
--conf spark.sql.adaptive.enabled=true \
--conf spark.sql.adaptive.coalescePartitions.enabled=true \
--conf spark.sql.adaptive.skewJoin.enabled=true \
--conf spark.sql.autoBroadcastJoinThreshold=104857600 \
--conf spark.sql.files.maxPartitionBytes=134217728 \
--conf spark.sql.optimizer.dynamicPartitionPruning=true- AQE + dynamic partition pruning kills wasted shuffles.
maxPartitionBytesavoids a blizzard of 2MB files.- Broadcast small dimension tables; don’t hash-join everything.
Compute tuning: settings that actually move the needle
You don’t need a bigger cluster. You need the right switches.
- Snowflake
- Turn on
auto_suspend(30–120s) andauto_resume. - Use
ECONOMYscaling; right-size toXSMALL/SMALLand scale out only under load. - Separate ELT and BI into warehouses to avoid noisy neighbors.
- Turn on
# Terraform: right-size Snowflake warehouse
resource "snowflake_warehouse" "etl_wh" {
name = "ETL_WH"
warehouse_size = "XSMALL"
auto_suspend = 60
auto_resume = true
min_cluster_count = 1
max_cluster_count = 1
scaling_policy = "ECONOMY"
statement_queued_timeout_in_seconds = 60
comment = "Right-sized for nightly dbt; auto-suspend saves $$$"
tags = { env = "prod", owner = "data-platform" }
}- BigQuery
- Use materialized views for stable aggregations.
- Cap ad-hoc queries with
--maximum_bytes_billed. - Clustered/partitioned tables cut scanned bytes dramatically.
bq query --use_legacy_sql=false \
--maximum_bytes_billed=1000000000 \
'SELECT customer_id, SUM(total_amount) FROM analytics.orders
WHERE updated_at >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
GROUP BY customer_id'- Spark/Databricks
- Use spot/preemptible for non-critical batch.
- Cache only genuinely reused datasets; evict aggressively.
- Kill long-tail skew with salting or AQE skew join.
Rule of thumb: if a change doesn’t reduce bytes scanned, shuffle written, or idle compute, it probably won’t move your bill.
Reliability and quality guardrails (so optimizations don’t bite you)
Speed without reliability is a rollback waiting to happen. Bake in checks as you optimize.
- Data contracts at boundaries: define schemas and SLAs with producers.
- DQ tests on critical tables: nulls, ranges, referential integrity.
- Canary runs: run optimized job in parallel for a week; diff outputs before cutover.
- Lineage: OpenLineage/Marquez or built-in Dagster asset lineage to spot blast radius.
# Great Expectations: quick guardrails
from great_expectations.dataset import PandasDataset
def validate_orders(df):
ds = PandasDataset(df)
ds.expect_column_values_to_not_be_null("order_id")
ds.expect_column_values_to_be_between("total_amount", min_value=0, strict_min=True)
ds.expect_column_values_to_match_regex("status", "^(paid|pending|failed|refunded)$")
result = ds.validate()
if not result["success"]:
raise ValueError("DQ failed: orders validation")In dbt, add tests:
# models/schema.yml
tables:
- name: fct_orders
tests:
- not_null:
column_name: order_id
- relationships:
to: ref('dim_customers')
field: customer_idOrchestration and backfills without cluster meltdowns
Most outages I’ve seen come from orchestration misconfig, not bad SQL.
- Use
max_active_runs=1for heavy DAGs; throttle withpools. - Make sensors deferrable or use
reschedulemode—don’t block a worker thread. - Apply jittered retries to avoid synchronized thundering herds.
- Backfill in slices (per day/week) with concurrency caps.
# Airflow: sane defaults
from airflow import DAG
from airflow.operators.empty import EmptyOperator
from datetime import datetime, timedelta
default_args = {
"owner": "data",
"retries": 2,
"retry_delay": timedelta(minutes=5),
"retry_exponential_backoff": True,
"max_retry_delay": timedelta(minutes=30),
}
with DAG(
dag_id="etl_orders",
start_date=datetime(2024, 1, 1),
schedule_interval="0 * * * *",
catchup=False,
max_active_runs=1,
default_args=default_args,
tags=["etl"],
) as dag:
start = EmptyOperator(task_id="start", pool="etl_pool")
transform = EmptyOperator(task_id="transform_orders", pool="etl_pool")
end = EmptyOperator(task_id="end", pool="etl_pool")
start >> transform >> endIf Airflow backfills still swamp you, stagger them by date window and use dedicated, smaller warehouses for backfill work.
Prove it worked: the scoreboard
No one believes an optimization without numbers. Track these before/after:
- Runtime per DAG and per model (p50/p95)
- Cost per run and per TB processed
- Freshness SLO adherence (e.g., 95% under 2 hours)
- DQ pass rate and incidents per month
- MTTR for failed pipelines
What we’ve delivered at GitPlumbers in two sprints, repeatedly:
- 45–60% reduction in Snowflake credits by right-sizing + incremental models.
- 50–70% runtime reductions by partitioning + compaction + AQE.
- Freshness SLO from 62% → 98% by stopping full refreshes and taming backfills.
- DQ incident rate down 40% with contracts + tests + canary releases.
If you’re fighting AI-generated “vibe code” in your pipelines, the same rules apply: prune data early, prove correctness with tests, and keep orchestration on a short leash. That’s how you ship safely and cheaply.
Where to start this week
- Build a top-10 cost/time list from warehouse and orchestrator metadata.
- Convert two biggest full refreshes to
MERGE-based incrementals. - Partition/cluster their target tables; compact small files.
- Enable Spark AQE or set Snowflake auto-suspend; cap BigQuery bytes billed.
- Add two DQ tests and a canary diff before flipping traffic.
Ping us when you hit the weird edge cases—we’ve seen them. GitPlumbers lives for code rescue and vibe code cleanup when AI “helpers” leave landmines in production.
Related Resources
Key takeaways
- Profile first: 80% of cost and time sit in 20% of jobs (wide joins, full refreshes, small-file hell).
- Go incremental with proper `MERGE` semantics and CDC; stop reprocessing history.
- Fix storage layout: partition, cluster, compact, and choose sane file sizes/compression.
- Right-size compute: enable Spark AQE, Snowflake auto-suspend, BigQuery materialized views and bytes caps.
- Bake in reliability: data contracts, DQ tests, canary runs, lineage, and idempotency.
- Throttle orchestration: pools, max_active_runs, deferrable sensors to avoid backfill meltdowns.
- Prove outcomes with metrics: runtime, cost/job, freshness, DQ pass rate, MTTR, and SLOs.
Implementation checklist
- Map top 10 slowest/costliest jobs from warehouse and orchestrator metadata.
- Convert obvious full refreshes to incremental (`MERGE`) with late-arriving data handled.
- Apply partitioning + clustering where supported; compact small files to 128–512MB.
- Enable Spark AQE + dynamic partition pruning; tune Snowflake/BigQuery concurrency and limits.
- Add DQ tests (Great Expectations/dbt) on critical tables; adopt simple data contracts.
- Throttle Airflow with pools and `max_active_runs`; make sensors deferrable/reschedule.
- Publish a FinOps dashboard for cost per pipeline and freshness SLO adherence.
Questions we hear from teams
- Incremental vs CDC: which should I prioritize?
- CDC. If you can get reliable change streams (Debezium, Fivetran CDC, native logs), you cut compute and improve correctness. If CDC isn’t available, implement dbt incrementals with `MERGE` and a robust watermark, then backfill occasionally.
- We have many small tables—do partitioning and clustering still help?
- Yes, but be selective. Partition only where pruning occurs on most queries (e.g., `event_date`). Otherwise, cluster by a join/filter column with good cardinality. Avoid over-partitioning; too many partitions can hurt performance.
- Are materialized views worth it in BigQuery/Snowflake?
- For stable aggregates queried frequently, yes. They reduce scanned bytes and latency. Watch refresh costs and staleness windows; use them for hot paths, not everything.
- How do I prevent cost spikes from backfills?
- Throttle with Airflow pools and `max_active_runs`, use smaller dedicated warehouses (Snowflake) or limited slots (BigQuery reservations), and slice backfills by date. Monitor cost/run and stop-on-budget guards.
- Delta vs Iceberg—does it matter for cost?
- Both can be cheap if you compact and prune. Cost differences usually come from table maintenance discipline and query patterns, not the table format itself.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
