Self‑Service Analytics Without the Monday Morning Pager: Building a Data Viz Platform That Actually Holds Up
Dashboards people trust, without a hero culture. The blueprint we use to ship self‑service analytics that don’t burn the team.
“Stop treating dashboards as art projects; treat them like products with SLOs.”Back to all posts
The Scene You’ve Lived Through
Two quarters into your “self‑service” push, Finance has three versions of revenue, Product can’t find churn, and the Monday AM Looker dashboard 500s because a downstream model changed a column from INT to STRING. I’ve seen this movie at a Series C fintech on BigQuery and at a Fortune 100 on Snowflake. Same plot: heroic analysts, vibe dashboards, and a data team playing whack‑a‑mole.
Here’s the version that doesn’t page you: treat dashboards as products, data as APIs with contracts, and your platform as code. No silver bullets—just the boring, repeatable stuff that works.
The Platform Pattern That Doesn’t Flake
Self‑service analytics that stick share the same spine:
- Contracts at the edges: Producers publish schemas with SLAs; consumers rely on stable shapes. Use
data contractsand CDC (e.g.,Debezium+Kafka) intoDelta Lake/Iceberg. - Transformations as code:
dbtmodels with tests; versioned in Git; deployed viaArgoCD/GitOpsor CI. - Quality gates:
Great Expectations+ dbt tests as blockers, not FYIs. - Observability and lineage:
OpenLineage+MarquezorDataHub; pipeline and dataset metrics toPrometheusandGrafana. - Metrics/semantic layer: Fix definitions in code, not in slide decks (dbt metrics,
Lightdash, or Looker’s semantic model). - Thin BI: Tools like
Apache Superset,Metabase, orLookerconsuming governed models—no rogue SQL against raw tables. - Security at the warehouse: Row‑level and column masking in
Snowflake/BigQuery/Databricks, not per‑dashboard.
You can swap vendors, but the contract→quality→metrics→BI flow is non‑negotiable if you want reliability.
Make It Measurable: SLOs for Data Products
If you don’t measure reliability, “self‑service” will regress to “ping the data team.” Define 3-4 SLOs and wire alerts.
- Freshness SLO: e.g., Orders model updated within 15 minutes, 95% of the time.
- Completeness SLO:
<1%missing critical fields per day. - Accuracy SLO: Reconciles to source within 0.5% daily.
- Timeliness SLO: Key dashboards render in <5s p95.
Expose metrics from your jobs and datasets to Prometheus. Don’t be fancy at first—export “last load timestamp” and “row count” labels, then write an alert. Example alert rule:
# prometheus/alerts/data-freshness.yaml
groups:
- name: data-freshness
rules:
- alert: OrdersModelStale
expr: (time() - dataset_last_load_timestamp_seconds{dataset="orders_model"}) > 900
for: 10m
labels:
severity: page
annotations:
summary: "Orders model freshness SLO violation"
description: "Orders model has not updated in >15m. Check Airflow DAG and upstream CDC."Keep SLOs visible in Grafana, next to run logs and lineage. Your on‑call will thank you.
Quality Gates That Actually Block Bad Data
I’ve lost count of teams that “monitor quality” but ship broken dashboards because tests don’t fail the build. Fix that.
- Write dbt schema tests for shape and known constraints.
- Use Great Expectations for richer field‑level expectations and distribution checks.
- Fail fast: wire both into CI so bad data never reaches BI.
Example dbt tests:
# models/orders.yml
version: 2
models:
- name: orders
columns:
- name: order_id
tests:
- not_null
- unique
- name: order_total
tests:
- not_null
- accepted_values:
values:
- ">=0"
- name: order_status
tests:
- accepted_values:
values: ["pending", "paid", "shipped", "canceled"]A simple Great Expectations suite:
{
"dataset_name": "orders_model",
"expectations": [
{"expectation_type": "expect_column_values_to_not_be_null", "kwargs": {"column": "order_id"}},
{"expectation_type": "expect_column_values_to_be_between", "kwargs": {"column": "order_total", "min_value": 0}},
{"expectation_type": "expect_column_values_to_be_in_set", "kwargs": {"column": "order_status", "value_set": ["pending","paid","shipped","canceled"]}}
]
}And wire it into a pipeline that fails on violations. Airflow example with OpenLineage:
# dags/orders_pipeline.py
from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime
import os
os.environ["OPENLINEAGE_URL"] = "http://marquez:5000"
os.environ["OPENLINEAGE_NAMESPACE"] = "analytics"
with DAG(
dag_id="orders_pipeline",
start_date=datetime(2025, 1, 1),
schedule_interval="*/15 * * * *",
catchup=False,
) as dag:
dbt_run = BashOperator(
task_id="dbt_run",
bash_command="dbt build --project-dir /opt/dbt --profiles-dir /opt/dbt",
)
ge_validate = BashOperator(
task_id="ge_validate",
bash_command="great_expectations checkpoint run orders_checkpoint",
)
dbt_run >> ge_validateIf ge_validate fails, nothing moves downstream. That’s the point.
Stop Debating “Revenue” in Slack: Freeze It in a Metrics Layer
I’ve seen entire quarters lost to semantic drift. A sane metrics layer stops it.
- Pick your layer: dbt metrics (with MetricFlow), Lightdash, or Looker’s semantic model.
- Put metrics in Git: names, grain, dimensions, filters—reviewed in PRs.
- Expose as APIs: BI tools query the layer, not raw tables.
A simple dbt metric:
# models/metrics.yml
metrics:
- name: revenue
label: Revenue
model: ref('orders')
calculation_method: sum
expression: order_total
timestamp: order_date
time_grains: [day, week, month]
dimensions: [country, channel]
filters:
- field: order_status
operator: is
value: paidPair this with row‑level security at the warehouse. Example in Snowflake:
-- Restrict to a user's region via session tag
CREATE OR REPLACE ROW ACCESS POLICY region_rls AS (region STRING) RETURNS BOOLEAN ->
CURRENT_ROLE() IN ('ANALYST_GLOBAL') OR region = CURRENT_ACCOUNT();
ALTER TABLE analytics.orders ADD ROW ACCESS POLICY region_rls ON (region);Now Finance, Sales, and Product all pull “Revenue” with the same filter semantics and security guarantees.
GitOps the Whole Stack: Version, Review, Deploy
The day we put Superset dashboards, dbt models, GE suites, and alert rules under Git with review gates, the 6AM pages stopped.
- Version everything: SQL, metrics, dashboard JSON, alert rules, even Superset database config.
- Deploy with ArgoCD: Argo watches the repo; changes flow to staging then prod with approvals.
- Canary datasets: materialize
orders_canaryoff a subset and run dashboards against it in parallel before flipping.
Example ArgoCD Application for Superset config:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: superset
spec:
project: default
source:
repoURL: 'https://github.com/yourorg/analytics-infra'
path: k8s/superset
targetRevision: main
destination:
server: 'https://kubernetes.default.svc'
namespace: analytics
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=truePair it with SRE basics: canary deployment of transformations, circuit breakers on flaky sources, and a rollback playbook you can run half‑asleep.
What “Good” Looks Like in 90 Days
This is the rollout we run at GitPlumbers when we’re called to rescue a “self‑service” effort that’s bleeding trust.
Days 0‑30: Contracts and Gates
- Map the top 10 datasets that drive the business (ARR, orders, churn).
- Define SLOs and publish them in Grafana.
- Add dbt tests + GE suites to those datasets; fail the build on violations.
- Turn on lineage (OpenLineage + Marquez/DataHub).
Days 31‑60: Metrics Layer and GitOps
- Stand up metrics in dbt/Lightdash/Looker; codify 5 core metrics with PR review.
- GitOps the BI tool and pipeline configs with ArgoCD.
- Implement RLS and masking in the warehouse; remove BI‑tool‑level hacks.
Days 61‑90: Adoption and Cleanup
- Publish 10 “golden” dashboards; deprecate duplicates.
- Office hours, docs in the catalog (OpenMetadata/DataHub), and dashboard ownership.
- Instrument p95 render time; add alerts on slow queries.
- Tackle vibe code cleanup—replace LLM‑generated spaghetti SQL with tested models.
KPIs we track:
- MTTR for broken dashboards: target <60 minutes.
- SLO compliance: >95% freshness, >99% completeness on tier‑1 datasets.
- Adoption: +30% MAU on golden dashboards, -50% zombie dashboard views.
- Time‑to‑answer for core questions (e.g., “yesterday’s revenue by channel”): <10s.
Receipts: A Real Outcome, Not Hype
At a consumer fintech (Snowflake + dbt + Superset), we:
- Reduced broken-dashboard incidents by 68% in 8 weeks by failing builds on dbt/GE tests.
- Cut MTTR from ~9h to 45m with OpenLineage + Prometheus run metrics in Grafana.
- Collapsed 14 revenue definitions into 1 metric in Git; Finance stopped arguing after two sprints.
- Improved dashboard p95 render time from 11.2s to 3.8s by materializing aggregates + warehouse tuning (virtual warehouse auto‑suspend 60s, result cache on).
No heroics. Just good engineering and a platform that enforces reality over vibes.
What I’d Do Differently (and What to Do Tomorrow)
- Don’t start with the BI tool. Start with contracts, quality, and semantics.
- Keep the first SLOs blunt and achievable. Fancy comes later.
- Treat LLM‑generated “starter” SQL as scaffolding, not production. Do a vibe code cleanup pass.
- Make product owners own their metrics. Engineering can’t arbitrate “ARR” forever.
Tomorrow morning:
- Pick one dataset and one dashboard. Add tests, wire a freshness alert, define a metric. Ship the change via Git.
- In two weeks, measure MTTR and adoption. If they’re not moving, call us. We’ve pulled a lot of teams out of this ditch.
Key takeaways
- Self‑service works only when data products have explicit contracts and SLOs for freshness, completeness, and accuracy.
- Quality gates belong in the pipeline, not in a PM’s calendar—use dbt tests and Great Expectations to block bad data.
- A metrics layer (dbt metrics, Lightdash, or Looker’s semantic layer) prevents “N different definitions of revenue.”
- GitOps your analytics: version dashboards, tests, and alerts; deploy with ArgoCD; observe with Prometheus and lineage tooling.
- Track business value: adoption, time‑to‑insight, and MTTR. Kill zombie dashboards and celebrate the ones that move the needle.
Implementation checklist
- Define 3-4 data SLOs (freshness, completeness, accuracy, timeliness) and wire Prometheus alerts.
- Add dbt schema tests + Great Expectations suites; fail the pipeline on contract violations.
- Stand up a metrics layer and freeze business definitions in code.
- Version dashboards and semantic configs; deploy via GitOps (ArgoCD).
- Implement row‑level security and PII policies at the warehouse, not in the BI tool.
- Instrument lineage (OpenLineage + Marquez/DataHub) to cut MTTR on incidents.
- Run a 90‑day adoption program: office hours, golden datasets, and dashboard deprecation.
Questions we hear from teams
- Which BI tool should we pick for self‑service?
- Pick the one that best fits your metrics layer and security model. Superset and Metabase are great for open tooling and GitOps; Looker is strong if you’ll commit to its semantic model. The tool is secondary to contracts, tests, and a metrics layer. Thin BI on top of governed models beats feature‑rich BI on raw data every time.
- dbt tests or Great Expectations?
- Both. Use dbt tests for structural guarantees (nulls, uniques, referential integrity). Use Great Expectations for richer distribution and field‑level checks. Run both in CI and fail the pipeline on violations.
- How do we measure if self‑service is delivering business value?
- Track adoption (MAU on golden dashboards), time‑to‑answer for key questions, MTTR for incidents, and SLO compliance on tier‑1 datasets. Tie dashboards to OKRs—if a dashboard doesn’t support a decision, it’s a candidate for deprecation.
- We have a lot of LLM‑generated SQL. Safe to use?
- Treat it as scaffolding. Run a vibe code cleanup: refactor into dbt models, add tests, and codify metrics. We routinely see 20–40% performance and reliability gains by replacing AI‑generated ad‑hoc SQL with modeled transformations.
- Do we need data catalog and lineage from day one?
- Turn on lineage early (OpenLineage + Marquez/DataHub). A catalog (DataHub/OpenMetadata) becomes important as you scale past ~50 models or multiple teams. Lineage is crucial for cutting MTTR during incidents and for safe deprecations.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
