How do I decide between streaming and micro-batch for incremental loads?

If your freshness SLO is below ~5 minutes and the source supports event streams, use streaming (Kafka, Pub/Sub, Kinesis) with exactly-once semantics and checkpointed state. Otherwise, hourly micro-batches with CDC tables achieve 80–90% of the benefit at a fraction of the complexity. Start with micro-batch; move slices to streaming only where the business case is clear (fraud, pricing, alerts).

When should I still do a full refresh?

Use full refreshes for initial backfills, major model redesigns, or when upstreams are known-bad and you need to re-derive truth from raw logs. Make it explicit: a one-off job with a capped window and a post-refresh snapshot diff to validate parity. Never leave full refreshes as the default path in production DAGs.

What about schema drift from upstream services?

Use explicit data contracts and schema registries. Enforce backward-compatible changes; fail fast on breaking changes at ingestion. In warehouses, prefer permissive landing zones (VARIANT/JSON columns) with controlled projections into typed models, guarded by tests. Add alerts on column count and nullability deltas.

Our codebase is a mess of AI-generated Airflow DAGs and notebooks. Where do we start?

Triage the top 10 jobs by cost and runtime with lineage. Replace notebook glue with versioned jobs. Add incremental strategies, storage pruning, and right-sized compute first. In parallel, layer dbt tests and Great Expectations on business-critical models. We call this a vibe code cleanup — we keep what’s correct, delete the cargo cult, and instrument everything.

Data-engineering · Dec 10, 2025 · 9 minute read

The Nightly ETL That Ate Our Cloud Bill — And the Fix That Cut Runtime 85%

War story: a retail data platform burning money on full refreshes, fixed with incremental loads, storage layout, and right-sized compute — without sacrificing data trust.

Nate Coleman

Principal, Data Engineering at GitPlumbers

Nate has spent 20 years building and rescuing data platforms at Fortune 500s and unicorn startups. Ex-Netflix data infra, ex-Snowflake field CTO team. He leads GitPlumbers’ data practice focused on performance, reliability, and getting business value out the door.

You don’t need a new data platform. You need to stop scanning what you don’t use and prove the rest is correct.

Back to all posts

The nightly ETL that torched the budget

I walked into a retail client where the nightly ETL chewed through seven hours and a five-figure monthly bill. BigQuery sat on the warehouse side, Airflow on GKE, and a fleet of Spark jobs on Dataproc stitched together years of “temporary” fixes. The killer? Full refreshes everywhere. Fact tables scanning 300 TB per run, because “it’s safer.” It wasn’t safe — it was slow, expensive, and unreliable. Missed SLAs, stale dashboards, and a finance team sharpening knives.

We cut runtime from 7h → 42m and reduced cost 62% in six weeks. No new platform. We changed how the work got done: incremental loads, storage layout, right-sized compute, and quality gates that stop bad data early. Here’s the exact playbook.

Stop full refreshes: go incremental without losing trust

Full refreshes are a crutch. If you don’t trust your lineage, “just rebuild it” feels safe — until it isn’t. The alternative is incremental processing that’s provably correct and easy to backfill.

Use CDC (Debezium, Fivetran, HVR) to land deltas instead of daily dumps.
For append-only, use watermarks on updated_at or ingestion time.
For upserts, use MERGE in the warehouse so you avoid cross-system joins.
Make every incremental job idempotent and backfill-safe.

dbt makes this boring (that’s a compliment):

-- models/orders.sql
{{ config(materialized='incremental', unique_key='order_id', incremental_strategy='merge') }}

select
  o.order_id,
  o.customer_id,
  o.status,
  o.total_amount,
  o.updated_at
from {{ source('staging', 'orders_delta') }} o
{% if is_incremental() %}
  where o.updated_at > (select coalesce(max(updated_at), '1900-01-01') from {{ this }})
{% endif %}

Snowflake version of the upsert:

merge into analytics.orders as t
using staging.orders_delta as s
on t.order_id = s.order_id
when matched then update set
  customer_id = s.customer_id,
  status = s.status,
  total_amount = s.total_amount,
  updated_at = s.updated_at
when not matched then insert (order_id, customer_id, status, total_amount, updated_at)
values (s.order_id, s.customer_id, s.status, s.total_amount, s.updated_at);

BigQuery equivalent (partitioned table recommended):

merge `analytics.orders` t
using `staging.orders_delta` s
on t.order_id = s.order_id
when matched then update set
  customer_id = s.customer_id,
  status = s.status,
  total_amount = s.total_amount,
  updated_at = s.updated_at
when not matched then insert (order_id, customer_id, status, total_amount, updated_at)
values (s.order_id, s.customer_id, s.status, s.total_amount, s.updated_at);

Target: 90–99% reduction in scanned bytes for large facts.
SLA impact: Freshness moves from “overnight” to “hourly” without a cluster rewrite.
Trust: Pair with data tests (next section) so incremental isn’t “hope and pray.”

Design for pruning: partitions, clustering, and file hygiene

Warehouses and engines are good at pruning if you give them something to prune.

Partition by time (daily for facts; monthly for big facts) and cluster by 1–2 selective columns you actually filter on (e.g., customer_id, store_id).
Compact small files. Aim for 128–512 MB Parquet objects. Thousands of 5 MB files will drown your job in overhead.
Columnar formats (Parquet, ORC) with snappy compression for general use.

BigQuery DDL that prunes well:

create table if not exists analytics.events_partitioned
partition by date(event_ts)
cluster by customer_id, event_type as
select * from raw.events;

Delta Lake compaction and clustering for Databricks:

OPTIMIZE bronze.events ZORDER BY (customer_id, event_date);

S3/ADLS lake hygiene (Databricks, EMR, Spark):

# Weekly compaction job (Spark)
spark-submit \
  --conf spark.sql.files.maxPartitionBytes=268435456 \
  --conf spark.sql.parquet.mergeSchema=false \
  compact_parquet.py

Expect 30–70% runtime cuts on scan-heavy steps just from pruning/compaction.
Bonus: lower shuffle volume and smaller memory footprints.

Push work down: compute where the data lives

I keep seeing Airflow tasks that yank terabytes out to Spark or pandas because “that’s how we’ve always done it” — or worse, AI-generated code stitched together from Stack Overflow. Stop moving data around to get work done.

Prefer SQL pushdown. Let Snowflake/BigQuery/Databricks do the heavy lifting.
Use Polars or DuckDB locally only for small/medium datasets (GBs) or dev loops.
Replace cross-warehouse joins with staged models in the destination warehouse.

Example: an Airflow task that used to export to Spark → rewrite to pushdown in-warehouse:

# airflow/dags/orders_pipeline.py
from airflow import DAG
from airflow.providers.snowflake.operators.snowflake import SnowflakeOperator
from airflow.utils.dates import days_ago

with DAG(
    dag_id="orders_pipeline",
    start_date=days_ago(1),
    schedule_interval="0 * * * *",
    catchup=False,
) as dag:

    merge_orders = SnowflakeOperator(
        task_id="merge_orders",
        sql="sql/merge_orders.sql",
        snowflake_conn_id="snowflake_default",
    )

Result: eliminate egress, serialization, and auth thrash. Your scanned bytes and egress GB line items drop immediately.

Right-size the engine: stop paying for idle executors

Spark defaults are cruel. I’ve seen AI-generated DAGs that hardcode 2,000 shuffle partitions for 10 GB jobs. You’re paying for shuffle you don’t need.

Turn on dynamic allocation and speculation. Keep shuffle partitions sane.
Use spot/preemptible nodes where retries are safe; keep stateful ops on on-demand.
Cache wisely (warehouse result caches are almost free; cluster memory is not).

Spark example:

spark-submit \
  --conf spark.sql.shuffle.partitions=400 \
  --conf spark.dynamicAllocation.enabled=true \
  --conf spark.dynamicAllocation.minExecutors=2 \
  --conf spark.dynamicAllocation.maxExecutors=50 \
  --conf spark.executor.memory=6g \
  --conf spark.executor.cores=2 \
  --conf spark.speculation=true \
  jobs/txn_enrichment.py

Kubernetes autoscaling for Airflow workers (CeleryExecutor/Ray):

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: airflow-worker
spec:
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

Target: 30–50% cluster cost reduction with fewer idle cores and right-sized shuffles.
KPI: dollars per TB processed and dollars per successful run trend down week-over-week.

Ship with guardrails: contracts and quality gates

Speed without trust is a short-lived victory. The fix is machine-checkable contracts and fast tests that run with every load.

Author data contracts for key producers (proto/JSON schema, Avro). Version them and fail fast on breaking changes.
Add Great Expectations or dbt tests to critical models.
Enforce row-level idempotency and dedup on natural keys.
Add SLOs: freshness (<60 min), completeness (>99%), and accuracy checks.

dbt example (tests + freshness):

# models/schema.yml
version: 2
models:
  - name: orders
    tests:
      - not_null:
          column_name: order_id
      - unique:
          column_name: order_id
      - relationships:
          to: ref('customers')
          field: customer_id
    config:
      freshness:
        warn_after: {count: 45, period: minute}
        error_after: {count: 90, period: minute}

Airflow gate using Great Expectations:

from great_expectations_provider.operators.great_expectations import GreatExpectationsOperator

quality = GreatExpectationsOperator(
    task_id="validate_orders",
    data_context_root_dir="/opt/gx",
    checkpoint_name="orders_checkpoint",
)

KPI: MTTR for data incidents drops because you catch issues at the boundary, not in the CFO’s deck.

Orchestrate for freshness and backfills without drama

Late-arriving data and backfills cause most of the “we had to full refresh” folklore. Make backfills boring.

Parameterize jobs by date/hour; never hardcode “today.”
Use sensors sparingly; prefer explicit upstream signals (e.g., file counts + watermark).
Build replay windows and dead-letter queues for bad batches.
Document SLA/SLOs and route alerts to the owning team.

Airflow backfill-safe DAG:

from airflow import DAG
from airflow.operators.empty import EmptyOperator
from airflow.operators.bash import BashOperator
from airflow.utils.dates import days_ago

with DAG(
    dag_id="incremental_orders",
    schedule_interval="@hourly",
    start_date=days_ago(7),
    catchup=True,
) as dag:
    start = EmptyOperator(task_id="start")

    transform = BashOperator(
        task_id="dbt_run_incremental",
        bash_command="dbt run --select orders --vars 'run_hour={{ ds_nodash }}'",
    )

    validate = BashOperator(
        task_id="dbt_test_orders",
        bash_command="dbt test --select orders",
    )

    start >> transform >> validate

Result: deterministic backfills, predictable reruns, and fewer midnight “hotfixes.”

Measure what matters: cost and reliability observability

If you can’t see it, you can’t fix it. Glue your lineage to your costs and SLOs.

OpenLineage/Marquez to trace jobs and connect them to tables.
Prometheus + Grafana for pipeline metrics (latency, rows processed, error rate).
Warehouse billing exports (Snowflake usage history, BigQuery INFORMATION_SCHEMA.JOBS_*) to track cost per model.
Budgets and alerts (GCP Budgets, AWS Budgets, Snowflake resource monitors) to prevent surprises.

BigQuery job cost view (scan bytes per model):

select
  job_id,
  statement_type,
  total_bytes_processed/1e12 as tb_scanned,
  end_time,
  regexp_extract(query, r"ref\('([^']+)'\)") as dbt_model
from region-us.INFORMATION_SCHEMA.JOBS
where start_time >= timestamp_sub(current_timestamp(), interval 7 day)
order by end_time desc;

KPIs: cost per TB, rows per dollar, freshness SLO attainment, and failed runs per week.

The boring truth: optimization is mostly deleting waste — not adding tech. If your ETL was “written” by an AI copilot, assume it defaulted to expensive choices. Do a vibe code cleanup before you scale it.

Related Resources

Key takeaways

Stop full refreshes. Use CDC, watermarks, and MERGE to process only changes.
Design storage for pruning. Partition by time, cluster by query keys, compact files.
Right-size engines. Dynamic allocation, fewer shuffle partitions, and spot/preemptible nodes.
Bake in quality. Contracts and tests block bad data before it propagates.
Instrument costs and freshness. Set budgets, SLOs, and automated backpressure before you page the team.

Implementation checklist

Inventory top 10 slowest/most expensive jobs with lineage and query logs.
Add incremental strategies (CDC/watermarks) to those jobs first; prove ROI fast.
Partition fact tables by date; cluster by 1–2 high-selectivity keys.
Compact small files weekly; target 128–512 MB Parquet files.
Turn on Spark dynamic allocation and set sane shuffle partitions (e.g., 200–800).
Replace cross-warehouse copies with pushdown SQL in the destination warehouse.
Add Great Expectations/dbt tests to critical models with SLAs and alerts.
Introduce cost and freshness dashboards with budgets and alerting.
Make jobs idempotent and backfill-safe; document replay windows.
Kill zombie jobs; archive unused tables; delete stale materializations.

Questions we hear from teams

How do I decide between streaming and micro-batch for incremental loads?: If your freshness SLO is below ~5 minutes and the source supports event streams, use streaming (Kafka, Pub/Sub, Kinesis) with exactly-once semantics and checkpointed state. Otherwise, hourly micro-batches with CDC tables achieve 80–90% of the benefit at a fraction of the complexity. Start with micro-batch; move slices to streaming only where the business case is clear (fraud, pricing, alerts).
When should I still do a full refresh?: Use full refreshes for initial backfills, major model redesigns, or when upstreams are known-bad and you need to re-derive truth from raw logs. Make it explicit: a one-off job with a capped window and a post-refresh snapshot diff to validate parity. Never leave full refreshes as the default path in production DAGs.
What about schema drift from upstream services?: Use explicit data contracts and schema registries. Enforce backward-compatible changes; fail fast on breaking changes at ingestion. In warehouses, prefer permissive landing zones (VARIANT/JSON columns) with controlled projections into typed models, guarded by tests. Add alerts on column count and nullability deltas.
Our codebase is a mess of AI-generated Airflow DAGs and notebooks. Where do we start?: Triage the top 10 jobs by cost and runtime with lineage. Replace notebook glue with versioned jobs. Add incremental strategies, storage pruning, and right-sized compute first. In parallel, layer dbt tests and Great Expectations on business-critical models. We call this a vibe code cleanup — we keep what’s correct, delete the cargo cult, and instrument everything.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to a GitPlumbers architect about cutting your ETL bill Read the retail CDC case study