The 9:05 AM Dashboard Freeze: Warehouse Optimizations That Actually Move the Needle
Stop paying for more compute to hide bad modeling. Practical techniques to cut query latency, lift reliability, and ship business value.
You can’t out-compute a bad model. Fix the layout, then turn the knobs.Back to all posts
9:05 AM, dashboards freeze. Here's why your warehouse chokes
I’ve lived the 9:05 AM freeze at retailers, fintechs, and SaaS unicorns. Sales wakes up, the exec dashboard fans out a dozen high-cost queries, and your Snowflake credits or BigQuery slots light up like a Christmas tree. Latency goes from 4s to 45s. Then someone suggests “just add more compute.” That’s like fixing a leaky pipe by turning up the water pressure.
What actually works: treat performance as a product. Prioritize by business value, fix the physical layout, design joins on purpose, keep the table formats healthy, and materialize with discipline—without compromising data quality. Here’s the playbook we run at GitPlumbers when a warehouse is bleeding time and money.
Start with the queries that pay the bills
Before touching models, make it measurable and ruthlessly prioritized.
- Instrument query SLOs: e.g., 95th percentile dashboard < 5s during 8–10 AM.
- Tag workloads by business impact: revenue dashboards, risk ops, ad-hoc.
- Collect top offenders by cost and latency (per user, per hour) using
INFORMATION_SCHEMAor platform observability (e.g.,SnowflakeQuery History,BigQueryINFORMATION_SCHEMA.JOBS_BY_*). - Build a simple Pareto: top 10 queries usually cause 80% of pain.
Example: BigQuery to find expensive/slow queries during peak:
SELECT
user_email,
query,
total_slot_ms/1000 AS slot_seconds,
total_bytes_billed/POW(10, 12) AS tb_billed,
(end_time - start_time) AS runtime
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE start_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
AND statement_type='SELECT'
AND DATE(start_time) BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY) AND CURRENT_DATE()
ORDER BY runtime DESC
LIMIT 50;Outcome you want: a ranked backlog with owners, tied to business KPIs. Example target: reduce 95p latency from 28s to <6s and cut peak compute 30% within 60 days.
Model and storage layout: partition, cluster, sort, encode
Most teams leave 2–5x performance on the floor by ignoring physical design. Columnar warehouses love selective scans and sorted data.
- Partition on time for append-heavy facts; daily is fine unless you truly need hourly.
- Cluster/sort on frequently filtered columns and major join keys.
- Pick column encodings/compression (e.g.,
RedshiftENCODE zstd), and keep stats fresh. - Avoid over-partitioning—1 day partitions beat 1 hour unless your queries genuinely need it.
BigQuery example:
CREATE OR REPLACE TABLE analytics.fact_orders
PARTITION BY DATE(created_at)
CLUSTER BY customer_id, status AS
SELECT * FROM staging.orders_clean;Redshift example:
CREATE TABLE fact_orders (
order_id BIGINT ENCODE zstd,
customer_id BIGINT ENCODE zstd,
created_at TIMESTAMP ENCODE zstd,
status VARCHAR(16) ENCODE zstd,
amount NUMERIC(12,2) ENCODE zstd
)
DISTKEY(customer_id)
SORTKEY(created_at, customer_id);
ANALYZE fact_orders;
VACUUM SORT ONLY fact_orders;Databricks Delta example:
OPTIMIZE analytics.fact_orders ZORDER BY (customer_id, created_at);dbt incremental with partition + cluster:
-- models/fact_orders.sql
{{
config(
materialized='incremental',
unique_key='order_id',
on_schema_change='sync_all_columns',
partition_by={'field': 'created_at', 'data_type': 'timestamp', 'granularity': 'day'},
cluster_by=['customer_id','status']
)
}}
SELECT ... FROM {{ ref('orders_clean') }}
{% if is_incremental() %}
WHERE created_at >= date_sub(current_date(), interval 7 day)
{% endif %}What we see in practice: 40–70% scan reduction on heavy filters, 2–5x faster joins on sorted keys, and far more predictable latency during peak.
Win joins and aggregations before they start
You can’t out-compute a bad join. Design for the physical reality.
- Broadcast small dimensions; don’t shuffle a 10-row calendar table across nodes.
- Co-locate high-volume join keys (Redshift
DISTKEY, Snowflake clustering micro-partitions, Sparkbroadcast). - Pre-aggregate big facts for the shapes your BI uses; don’t roll up 5B rows at read time.
- Watch for skew (e.g.,
customer_id = NULLor dominant categories) and salt or filter.
Spark SQL broadcast hint:
SELECT /*+ BROADCAST(dim_customers) */ f.*
FROM fact_orders f
JOIN dim_customers ON f.customer_id = dim_customers.customer_id;Materialized rollup for BI queries:
CREATE MATERIALIZED VIEW analytics.mv_orders_daily
AS
SELECT DATE(created_at) AS order_date,
customer_segment,
COUNT(*) AS orders,
SUM(amount) AS revenue
FROM analytics.fact_orders
GROUP BY 1,2;Two changes that routinely save our clients:
- Move “Top-N last 7 days” to a pre-aggregated table refreshed hourly → 6–10x latency reduction.
- Broadcast the 50K-row dimension instead of shuffling 200M-row fact → 3–5x runtime improvement.
Kill the small-file problem and keep tables healthy
If you’re on Parquet with Delta Lake, Iceberg, or Hudi, the small-file problem will quietly wreck performance and reliability. Lots of 10KB–200KB files mean high planning overhead and poor predicate pushdown.
- Compact regularly:
OPTIMIZE(Delta),REWRITE DATA(Iceberg), or a compaction job. - Vacuum old snapshots and tombstones so your metadata doesn’t balloon.
- Set write batch sizes to produce 128MB–1GB files depending on engine.
- Ensure
ANALYZE/statistics are up to date after compaction.
Delta Lake compaction and housekeeping:
OPTIMIZE analytics.fact_orders; -- compacts small files
OPTIMIZE analytics.fact_orders ZORDER BY (created_at, customer_id);
VACUUM analytics.fact_orders RETAIN 168 HOURS; -- 7 daysAirflow task to automate weekly compaction:
from airflow import DAG
from airflow.providers.databricks.operators.databricks import DatabricksRunNowOperator
from datetime import datetime
dag = DAG('delta_optimize', start_date=datetime(2024,1,1), schedule='@weekly')
DatabricksRunNowOperator(
task_id='optimize_fact_orders',
databricks_conn_id='databricks_default',
job_id=1234,
dag=dag
)One retail client saw planning time drop from 9s to 1.5s and end-to-end latency from 22s to 6s on key dashboards after a month of automated compaction and Z-Ordering.
Cache and materialize with discipline
Caching is not a strategy; it’s a tool. Used well, it covers the 80% use-cases and keeps SLAs green. Used poorly, it lies to you.
- Use result cache only for repeatable, deterministic queries; invalidate on upstream changes.
- Prefer materialized views or
dbtincrementals for hot aggregates with explicit freshness SLOs. - Version and document: what query does this MV accelerate, what is the SLA, and when does it refresh/fail?
Snowflake example with query acceleration and freshness guardrails:
CREATE OR REPLACE MATERIALIZED VIEW analytics.mv_revenue_7d
CLUSTER BY (customer_segment)
AS
SELECT customer_segment,
DATE_TRUNC('day', created_at) AS d,
SUM(amount) AS revenue
FROM analytics.fact_orders
WHERE created_at >= DATEADD(day,-7,CURRENT_TIMESTAMP())
GROUP BY 1,2;
-- Freshness check via task
CREATE OR REPLACE TASK analytics.refresh_mv_revenue_7d
WAREHOUSE = ANALYTICS_MEDIUM
SCHEDULE = 'USING CRON 0 * * * * UTC'
AS ALTER MATERIALIZED VIEW analytics.mv_revenue_7d REBUILD;dbt freshness and SLA-aligned checks:
# models/schema.yml
models:
- name: mv_revenue_7d
config:
materialized: table
tags: ['hotpath']
tests:
- dbt_expectations.expect_row_values_to_have_recent_timestamp:
column: d
datepart: day
interval: 1Don’t hide bad lineage with a cache. If upstream is flaky, fix the pipeline. Caches should accelerate, not launder stale data.
Bake quality and governance into the performance plan
Performance without reliability is theater. Every optimization should come with tests, contracts, and guardrails.
- Add not-null/unique checks on keys and join columns; null-heavy columns kill join performance and correctness.
- Enforce schema contracts at ingestion (e.g.,
Iceberg/Deltaconstraints, dbt contracts). - Monitor freshness and anomaly detection; alert on regression in 95p runtime or bytes scanned.
Great Expectations quick check on a dimension table:
from great_expectations.dataset import PandasDataset
class CustomersDataset(PandasDataset):
pass
df = load_customers()
cd = CustomersDataset(df)
cd.expect_column_values_to_not_be_null('customer_id')
cd.expect_column_values_to_be_unique('customer_id')
cd.expect_column_values_to_match_regex('email', r"^[^@\s]+@[^@\s]+\.[^@\s]+$")
results = cd.validate()
if not results['success']:
raise Exception('Quality gate failed for dim_customers')Warehouse-level governance:
Snowflake: separate warehouses for ELT, BI, and data science; set resource monitors and auto-suspend.BigQuery: reservations and assignments per workload; set slot caps for ad-hoc.Redshift: WLM queues and concurrency scaling; pin BI to a queue with short query timeout.
Measure what matters: availability of core dashboards, 95p latency during peak, and MTTR for failing refreshes.
The 90-day optimization playbook (what we actually run)
We’ve done this at banks, SaaS unicorns, and public retailers. The order matters.
- Instrument and prioritize
- Define SLOs, tag workloads, build the top-20 offender list.
- Target: identify 60% of peak cost within two weeks.
- Storage layout fixes
- Partition/cluster/sort key changes on 3–5 core facts; rebuild with compression and stats.
- Target: 30–50% scan reduction on key queries.
- Join and pre-aggregation design
- Broadcast small dims, co-locate keys, build 2–3 hot MVs.
- Target: 3–6x speedups on top dashboards.
- Compaction and health
- Automate
OPTIMIZE/VACUUMweekly; set file size thresholds; update stats. - Target: planning time < 2s for hot tables.
- Automate
- Quality and governance
- Add tests on keys/freshness; implement workload isolation and cost guards.
- Target: zero critical incidents from stale/bad data.
- Tight feedback loop
- Weekly review of SLOs, cost, and regressions; tune or roll back.
Typical outcome in 60–90 days:
- 95p dashboard latency: 28s → 4–6s
- Peak compute cost: -25% to -40%
- Incident MTTR (data freshness): 3h → <30m
- Analyst happiness (measured via survey/BI adoption): up 20–30%
And yes, we’ve seen teams turn off entire “accelerator” features after realizing a well-modeled table plus compaction beats magic knobs.
Key takeaways
- Optimize where the money is: rank queries by business impact, not by who yells loudest.
- Storage layout wins are cheap and compounding: partition, cluster/sort, encode, and keep stats fresh.
- Fix joins by design: distribution/broadcast strategies beat brute-force compute.
- Small files kill performance; compaction should be a routine job, not an incident response.
- Materialize with intent and enforce freshness and invalidation rules to avoid stale lies.
- Performance and reliability are inseparable—add tests, constraints, and monitoring as you optimize.
Implementation checklist
- Instrument query SLOs and tie them to revenue or operational KPIs
- Partition on time, cluster/sort on high-cardinality filters and join keys
- Choose column encodings and compression; keep statistics current
- Adopt Delta/Iceberg with compaction; schedule OPTIMIZE/VACUUM
- Design joins: broadcast small dims, avoid skew, pre-aggregate facts
- Use materialized views or dbt incrementals with explicit freshness rules
- Add data quality tests and schema contracts; fail fast on bad data
- Set resource governance: queues/warehouses/slots per workload
Questions we hear from teams
- How do I know if partitioning or clustering actually helped?
- Measure before/after on the same queries: bytes scanned, partitions/files touched, and 95p runtime during peak. On BigQuery, check `query_plan` and `referenced_partitions`. On Snowflake, look at micro-partitions pruned in the Query Profile. You should see fewer partitions/data files read and a meaningful drop in scan bytes and runtime. If not, your filter selectivity, clustering key choice, or stats may be off.
- Are materialized views worth it versus dbt incrementals?
- Use MVs when the engine can maintain them efficiently for your pattern (simple group-bys, filters) and you need sub-minute freshness with automatic invalidation. Use dbt incrementals when logic is complex, data volumes are large, or you need full control over refresh windows and testing. We often run both: MV for hot, simple aggregates; dbt for shaped marts. Always set freshness SLOs and alerting.
- Can I just buy query acceleration features to fix this?
- Acceleration (e.g., Snowflake Query Acceleration Service, BigQuery BI Engine, Databricks Serverless) can help, but they’re multipliers on good design, not substitutes. If your tables are poorly partitioned, joins skewed, and files tiny, accelerators will be expensive band-aids. We install them after we’ve fixed layout and joins so the dollars translate to predictable wins.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
