The Nightly ETL That Ate Our Cloud Bill — And the Fix That Cut Runtime 85%
War story: a retail data platform burning money on full refreshes, fixed with incremental loads, storage layout, and right-sized compute — without sacrificing data trust.
You don’t need a new data platform. You need to stop scanning what you don’t use and prove the rest is correct.Back to all posts
The nightly ETL that torched the budget
I walked into a retail client where the nightly ETL chewed through seven hours and a five-figure monthly bill. BigQuery sat on the warehouse side, Airflow on GKE, and a fleet of Spark jobs on Dataproc stitched together years of “temporary” fixes. The killer? Full refreshes everywhere. Fact tables scanning 300 TB per run, because “it’s safer.” It wasn’t safe — it was slow, expensive, and unreliable. Missed SLAs, stale dashboards, and a finance team sharpening knives.
We cut runtime from 7h → 42m and reduced cost 62% in six weeks. No new platform. We changed how the work got done: incremental loads, storage layout, right-sized compute, and quality gates that stop bad data early. Here’s the exact playbook.
Stop full refreshes: go incremental without losing trust
Full refreshes are a crutch. If you don’t trust your lineage, “just rebuild it” feels safe — until it isn’t. The alternative is incremental processing that’s provably correct and easy to backfill.
- Use CDC (Debezium, Fivetran, HVR) to land deltas instead of daily dumps.
- For append-only, use watermarks on
updated_ator ingestion time. - For upserts, use MERGE in the warehouse so you avoid cross-system joins.
- Make every incremental job idempotent and backfill-safe.
dbt makes this boring (that’s a compliment):
-- models/orders.sql
{{ config(materialized='incremental', unique_key='order_id', incremental_strategy='merge') }}
select
o.order_id,
o.customer_id,
o.status,
o.total_amount,
o.updated_at
from {{ source('staging', 'orders_delta') }} o
{% if is_incremental() %}
where o.updated_at > (select coalesce(max(updated_at), '1900-01-01') from {{ this }})
{% endif %}Snowflake version of the upsert:
merge into analytics.orders as t
using staging.orders_delta as s
on t.order_id = s.order_id
when matched then update set
customer_id = s.customer_id,
status = s.status,
total_amount = s.total_amount,
updated_at = s.updated_at
when not matched then insert (order_id, customer_id, status, total_amount, updated_at)
values (s.order_id, s.customer_id, s.status, s.total_amount, s.updated_at);BigQuery equivalent (partitioned table recommended):
merge `analytics.orders` t
using `staging.orders_delta` s
on t.order_id = s.order_id
when matched then update set
customer_id = s.customer_id,
status = s.status,
total_amount = s.total_amount,
updated_at = s.updated_at
when not matched then insert (order_id, customer_id, status, total_amount, updated_at)
values (s.order_id, s.customer_id, s.status, s.total_amount, s.updated_at);- Target: 90–99% reduction in scanned bytes for large facts.
- SLA impact: Freshness moves from “overnight” to “hourly” without a cluster rewrite.
- Trust: Pair with data tests (next section) so incremental isn’t “hope and pray.”
Design for pruning: partitions, clustering, and file hygiene
Warehouses and engines are good at pruning if you give them something to prune.
- Partition by time (daily for facts; monthly for big facts) and cluster by 1–2 selective columns you actually filter on (e.g.,
customer_id,store_id). - Compact small files. Aim for 128–512 MB Parquet objects. Thousands of 5 MB files will drown your job in overhead.
- Columnar formats (
Parquet,ORC) with snappy compression for general use.
BigQuery DDL that prunes well:
create table if not exists analytics.events_partitioned
partition by date(event_ts)
cluster by customer_id, event_type as
select * from raw.events;Delta Lake compaction and clustering for Databricks:
OPTIMIZE bronze.events ZORDER BY (customer_id, event_date);S3/ADLS lake hygiene (Databricks, EMR, Spark):
# Weekly compaction job (Spark)
spark-submit \
--conf spark.sql.files.maxPartitionBytes=268435456 \
--conf spark.sql.parquet.mergeSchema=false \
compact_parquet.py- Expect 30–70% runtime cuts on scan-heavy steps just from pruning/compaction.
- Bonus: lower shuffle volume and smaller memory footprints.
Push work down: compute where the data lives
I keep seeing Airflow tasks that yank terabytes out to Spark or pandas because “that’s how we’ve always done it” — or worse, AI-generated code stitched together from Stack Overflow. Stop moving data around to get work done.
- Prefer SQL pushdown. Let Snowflake/BigQuery/Databricks do the heavy lifting.
- Use Polars or DuckDB locally only for small/medium datasets (GBs) or dev loops.
- Replace cross-warehouse joins with staged models in the destination warehouse.
Example: an Airflow task that used to export to Spark → rewrite to pushdown in-warehouse:
# airflow/dags/orders_pipeline.py
from airflow import DAG
from airflow.providers.snowflake.operators.snowflake import SnowflakeOperator
from airflow.utils.dates import days_ago
with DAG(
dag_id="orders_pipeline",
start_date=days_ago(1),
schedule_interval="0 * * * *",
catchup=False,
) as dag:
merge_orders = SnowflakeOperator(
task_id="merge_orders",
sql="sql/merge_orders.sql",
snowflake_conn_id="snowflake_default",
)- Result: eliminate egress, serialization, and auth thrash. Your scanned bytes and egress GB line items drop immediately.
Right-size the engine: stop paying for idle executors
Spark defaults are cruel. I’ve seen AI-generated DAGs that hardcode 2,000 shuffle partitions for 10 GB jobs. You’re paying for shuffle you don’t need.
- Turn on dynamic allocation and speculation. Keep shuffle partitions sane.
- Use spot/preemptible nodes where retries are safe; keep stateful ops on on-demand.
- Cache wisely (warehouse result caches are almost free; cluster memory is not).
Spark example:
spark-submit \
--conf spark.sql.shuffle.partitions=400 \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.dynamicAllocation.minExecutors=2 \
--conf spark.dynamicAllocation.maxExecutors=50 \
--conf spark.executor.memory=6g \
--conf spark.executor.cores=2 \
--conf spark.speculation=true \
jobs/txn_enrichment.pyKubernetes autoscaling for Airflow workers (CeleryExecutor/Ray):
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: airflow-worker
spec:
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70- Target: 30–50% cluster cost reduction with fewer idle cores and right-sized shuffles.
- KPI: dollars per TB processed and dollars per successful run trend down week-over-week.
Ship with guardrails: contracts and quality gates
Speed without trust is a short-lived victory. The fix is machine-checkable contracts and fast tests that run with every load.
- Author data contracts for key producers (proto/JSON schema, Avro). Version them and fail fast on breaking changes.
- Add Great Expectations or dbt tests to critical models.
- Enforce row-level idempotency and dedup on natural keys.
- Add SLOs: freshness (<60 min), completeness (>99%), and accuracy checks.
dbt example (tests + freshness):
# models/schema.yml
version: 2
models:
- name: orders
tests:
- not_null:
column_name: order_id
- unique:
column_name: order_id
- relationships:
to: ref('customers')
field: customer_id
config:
freshness:
warn_after: {count: 45, period: minute}
error_after: {count: 90, period: minute}Airflow gate using Great Expectations:
from great_expectations_provider.operators.great_expectations import GreatExpectationsOperator
quality = GreatExpectationsOperator(
task_id="validate_orders",
data_context_root_dir="/opt/gx",
checkpoint_name="orders_checkpoint",
)- KPI: MTTR for data incidents drops because you catch issues at the boundary, not in the CFO’s deck.
Orchestrate for freshness and backfills without drama
Late-arriving data and backfills cause most of the “we had to full refresh” folklore. Make backfills boring.
- Parameterize jobs by date/hour; never hardcode “today.”
- Use sensors sparingly; prefer explicit upstream signals (e.g., file counts + watermark).
- Build replay windows and dead-letter queues for bad batches.
- Document SLA/SLOs and route alerts to the owning team.
Airflow backfill-safe DAG:
from airflow import DAG
from airflow.operators.empty import EmptyOperator
from airflow.operators.bash import BashOperator
from airflow.utils.dates import days_ago
with DAG(
dag_id="incremental_orders",
schedule_interval="@hourly",
start_date=days_ago(7),
catchup=True,
) as dag:
start = EmptyOperator(task_id="start")
transform = BashOperator(
task_id="dbt_run_incremental",
bash_command="dbt run --select orders --vars 'run_hour={{ ds_nodash }}'",
)
validate = BashOperator(
task_id="dbt_test_orders",
bash_command="dbt test --select orders",
)
start >> transform >> validate- Result: deterministic backfills, predictable reruns, and fewer midnight “hotfixes.”
Measure what matters: cost and reliability observability
If you can’t see it, you can’t fix it. Glue your lineage to your costs and SLOs.
- OpenLineage/Marquez to trace jobs and connect them to tables.
- Prometheus + Grafana for pipeline metrics (latency, rows processed, error rate).
- Warehouse billing exports (Snowflake usage history, BigQuery
INFORMATION_SCHEMA.JOBS_*) to track cost per model. - Budgets and alerts (GCP Budgets, AWS Budgets, Snowflake resource monitors) to prevent surprises.
BigQuery job cost view (scan bytes per model):
select
job_id,
statement_type,
total_bytes_processed/1e12 as tb_scanned,
end_time,
regexp_extract(query, r"ref\('([^']+)'\)") as dbt_model
from region-us.INFORMATION_SCHEMA.JOBS
where start_time >= timestamp_sub(current_timestamp(), interval 7 day)
order by end_time desc;- KPIs: cost per TB, rows per dollar, freshness SLO attainment, and failed runs per week.
The boring truth: optimization is mostly deleting waste — not adding tech. If your ETL was “written” by an AI copilot, assume it defaulted to expensive choices. Do a vibe code cleanup before you scale it.
Key takeaways
- Stop full refreshes. Use CDC, watermarks, and MERGE to process only changes.
- Design storage for pruning. Partition by time, cluster by query keys, compact files.
- Right-size engines. Dynamic allocation, fewer shuffle partitions, and spot/preemptible nodes.
- Bake in quality. Contracts and tests block bad data before it propagates.
- Instrument costs and freshness. Set budgets, SLOs, and automated backpressure before you page the team.
Implementation checklist
- Inventory top 10 slowest/most expensive jobs with lineage and query logs.
- Add incremental strategies (CDC/watermarks) to those jobs first; prove ROI fast.
- Partition fact tables by date; cluster by 1–2 high-selectivity keys.
- Compact small files weekly; target 128–512 MB Parquet files.
- Turn on Spark dynamic allocation and set sane shuffle partitions (e.g., 200–800).
- Replace cross-warehouse copies with pushdown SQL in the destination warehouse.
- Add Great Expectations/dbt tests to critical models with SLAs and alerts.
- Introduce cost and freshness dashboards with budgets and alerting.
- Make jobs idempotent and backfill-safe; document replay windows.
- Kill zombie jobs; archive unused tables; delete stale materializations.
Questions we hear from teams
- How do I decide between streaming and micro-batch for incremental loads?
- If your freshness SLO is below ~5 minutes and the source supports event streams, use streaming (Kafka, Pub/Sub, Kinesis) with exactly-once semantics and checkpointed state. Otherwise, hourly micro-batches with CDC tables achieve 80–90% of the benefit at a fraction of the complexity. Start with micro-batch; move slices to streaming only where the business case is clear (fraud, pricing, alerts).
- When should I still do a full refresh?
- Use full refreshes for initial backfills, major model redesigns, or when upstreams are known-bad and you need to re-derive truth from raw logs. Make it explicit: a one-off job with a capped window and a post-refresh snapshot diff to validate parity. Never leave full refreshes as the default path in production DAGs.
- What about schema drift from upstream services?
- Use explicit data contracts and schema registries. Enforce backward-compatible changes; fail fast on breaking changes at ingestion. In warehouses, prefer permissive landing zones (VARIANT/JSON columns) with controlled projections into typed models, guarded by tests. Add alerts on column count and nullability deltas.
- Our codebase is a mess of AI-generated Airflow DAGs and notebooks. Where do we start?
- Triage the top 10 jobs by cost and runtime with lineage. Replace notebook glue with versioned jobs. Add incremental strategies, storage pruning, and right-sized compute first. In parallel, layer dbt tests and Great Expectations on business-critical models. We call this a vibe code cleanup — we keep what’s correct, delete the cargo cult, and instrument everything.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
