The A/B Pipeline That Lied To Us (And How We Stopped Shipping Fake Wins)

If your A/B results swing wildly or arrive late, your pipeline—not your product—is the problem. Here’s the battle-tested design GitPlumbers uses to make experiment data boringly reliable.

Bad experiment data costs more than failed experiments; it teaches your org to ship fake wins.
Back to all posts

The mess you’ve probably lived through

I walked into a team last year where every second experiment was a “+3% conversion” win—until the wins didn’t replicate in prod. Classic symptoms:

  • Late or missing exposure events, backfilled from app logs
  • Variant assignment based on a flaky cookie that rotated on login
  • Metric definitions spread across Looker, dbt, and a PM’s spreadsheet
  • No SRM checks; canaries silently biasing traffic after a deploy

We rebuilt the pipeline in four weeks. The result: decision time dropped from 3 days to 4 hours, SRM incidents fell by 80%, and the “wins” that shipped stopped reversing. Here’s exactly how we designed it.

Principles: make experiment data boring again

The only way A/B testing works is if the data is boring and predictable. That requires:

  • Exposure-first architecture: log assignment the moment it happens; everything else joins to that.
  • Deterministic bucketing: stable IDs and salts; no changing the hash function mid-flight.
  • Single source of metric truth: metrics defined in code (dbt), versioned, and tested.
  • Automated quality gates: SRM detection, outlier handling, and guardrail metrics.
  • Observability: lineage, freshness SLOs, anomaly alerts.

Tools that fit: GrowthBook or LaunchDarkly for assignment, Kafka/Kinesis or Segment/RudderStack for events, dbt for metric modeling, Airflow or Dagster for orchestration, Snowflake/BigQuery as warehouse. We’ve also integrated with Optimizely and homegrown assignment services; the principles stand.

Nail exposure and bucketing (or pack it up)

Non-negotiables:

  • Stable identity: user_id or device_id. If you must, stitch with a user_key map table; document the precedence.
  • Consistent hash: variant = hash(user_id + exp_salt) % k. Log the salt and the algorithm version.
  • Synchronous exposure: the assignment event is emitted at decision time, not later.

Define a contract for the exposure event and version it.

# exposure_event.contract.yaml
name: experiment_exposure
version: 1
fields:
  - name: exp_key        # marketing_page_banner_v3
    type: string
    required: true
  - name: exp_salt       # immutable per experiment rollout
    type: string
    required: true
  - name: variant        # control|treatment|treatment_b
    type: string
    required: true
  - name: user_id
    type: string
    required: true
  - name: algo_version   # e.g., mmh3_v1
    type: string
    required: true
  - name: ts
    type: timestamp
    required: true
metadata:
  pii: false
  owner: growth-platform
  retention_days: 365

Deterministic bucketing with a salt and recorded algorithm version makes “replay the assignments” possible.

# bucketing.py
import mmh3

def assign_variant(user_id: str, exp_key: str, exp_salt: str, variants: list[str]):
    key = f"{user_id}:{exp_key}:{exp_salt}"
    bucket = mmh3.hash(key, signed=False) % len(variants)
    return variants[bucket]

# Example
variants = ["control", "treatment"]
assign_variant("u_1234", "pricing_btn_v2", "f3f2a8", variants)

Don’t let the app team “optimize” this. We’ve seen “sticky session” hacks and A/B cookies that rotate on login destroy experiment validity.

Model metrics once (dbt) and test them like product code

We force metric definitions into dbt models and tests so everyone uses the same definition.

-- models/metrics/m_order_conversion.sql
{{ config(materialized='incremental', unique_key='user_id_date') }}

with exposures as (
  select exp_key, variant, user_id, date(ts) as dt
  from raw.experiment_exposure
),
orders as (
  select user_id, date(order_ts) as dt, 1 as has_order
  from raw.orders where status = 'paid'
)
select
  e.exp_key,
  e.variant,
  e.user_id,
  e.dt as date,
  coalesce(max(o.has_order), 0) as converted
from exposures e
left join orders o using (user_id, dt)
group by 1,2,3,4

Add tests so bad data can’t sneak in.

# models/metrics/schema.yml
version: 2
models:
  - name: m_order_conversion
    tests:
      - dbt_utils.unique_combination_of_columns:
          combination_of_columns: [exp_key, variant, user_id, date]
    columns:
      - name: variant
        tests:
          - accepted_values:
              values: ["control", "treatment", "treatment_b"]
      - name: converted
        tests:
          - not_null
          - accepted_values:
              values: [0, 1]

We also wire in Great Expectations or Soda Core for raw-layer checks. Example GE suite for exposure volume and schema:

# great_expectations/expectations/exposure_suite.yml
expect_table_columns_to_match_set:
  column_set: [exp_key, exp_salt, variant, user_id, algo_version, ts]
expect_table_row_count_to_be_between:
  min_value: 10000
expect_column_values_to_not_be_null:
  column: user_id

Add lineage with OpenLineage/Marquez so when a test fails, you can see blast radius in seconds.

Quality gates: SRM, CUPED, and guardrails that auto-stop bad tests

If SRM or a guardrail fails, stop the pipeline from publishing “green” results.

  • SRM (Sample Ratio Mismatch): detects assignment bias; fail fast when variant share deviates from expected.
  • CUPED: pre-experiment covariance to reduce variance; label results as CUPED-adjusted.
  • Guardrails: latency, error rate, support tickets—don’t ship a +1% conversion that burns your SLOs.

BigQuery SQL for SRM (chi-square) on daily assignments:

-- checks/srm_check.sql
with counts as (
  select exp_key, variant, count(*) as n
  from raw.experiment_exposure
  where date(ts) = current_date()
  group by 1,2
),
expected as (
  select exp_key, sum(n) as total, 2 as k from counts group by 1
),
chi as (
  select c.exp_key,
         sum(power(c.n - e.total / e.k, 2) / (e.total / e.k)) as chi2
  from counts c join expected e using (exp_key)
  group by 1
)
select exp_key,
       chi2,
       1 - chi_square_cdf(1, chi2) as p_value
from chi

CUPED example using pre-period metric y_pre and post y_post:

-- models/analysis/cuped.sql
with base as (
  select user_id, exp_key, variant, y_pre, y_post
  from features.user_pre_post -- precomputed per-user aggregates
),
coef as (
  select exp_key,
         covar_pop(y_post, y_pre) / var_pop(y_pre) as theta
  from base group by 1
)
select b.exp_key, b.variant,
       (b.y_post - c.theta * b.y_pre) as y_cuped
from base b join coef c using (exp_key)

Prometheus alert when SRM p-value < 0.01 for any active experiment (you’ll push SRM results to a metrics endpoint):

# prometheus/alerts.yml
groups:
- name: experiment-alerts
  rules:
  - alert: SRMDetected
    expr: experiment_srm_pvalue{env="prod"} < 0.01
    for: 10m
    labels:
      severity: page
    annotations:
      summary: "SRM detected in {{ $labels.exp_key }}"
      description: "Assignment imbalance. Investigate bucketing or traffic filters."

We’ve seen SRM catch CDN caching gone wild, a bot wave, and a feature flag rollout that skipped Safari. Worth every minute to automate.

Orchestrate for speed-to-decision (not batch vanity)

The anti-pattern is one giant nightly DAG. Optimize for decision latency.

  • Micro-batches: exposures every 5 minutes; outcomes hourly; metrics rolling aggregates.
  • Cache pre-aggregates: materialize variant-level metrics per experiment; store in a narrow “results” table.
  • Separate compute pools: warehouse resource groups for experiment jobs so BI doesn’t starve them (Snowflake warehouses / BigQuery reservations).

Airflow 2.8 sketch:

# dags/experiment_results.py
from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime, timedelta

def dbt(cmd):
    return BashOperator(task_id=cmd.replace(':','_'), bash_command=f"dbt run --select {cmd}")

with DAG(
    dag_id="experiment_results",
    start_date=datetime(2024, 1, 1),
    schedule_interval="*/15 * * * *",
    catchup=False,
    max_active_runs=1,
    default_args={"retries": 2, "retry_delay": timedelta(minutes=5)},
):
    exposures = dbt("+raw.experiment_exposure")
    metrics   = dbt("models/metrics+")
    srm       = dbt("checks.srm_check")
    publish   = dbt("models/public.experiment_results")

    exposures >> metrics >> srm >> publish

The published table should be boring and canonical:

  • exp_key, variant, start_ts, end_ts, n_users, metric means/SEs, CUPED flag, SRM p-value, guardrail statuses, power estimate, last_updated.

Data contracts, identity, and reality checks that save your bacon

Hard-learned guardrails from real incidents:

  • Identity stitching: when user_id changes on login, keep a user_aliases table and pick the earliest id for bucketing; never rebucket mid-experiment.
  • Traffic filters: exclude bots and employees consistently. Pipe Cloudflare bot scores into the warehouse; filter upstream, not in the BI layer.
  • Time zones: pick one (UTC). Don’t mix device local time with server time in exposure/outcomes.
  • Late events: enforce max_lag_minutes; anything beyond goes to a quarantine table and is surfaced in alerts.
  • Version everything: event contracts, metric SQL, hash algorithm, and experiment params. Put them in Git; use PRs.
  • AI-generated code: we’ve seen “vibe coding” produce subtly different metric joins across models. Lock metric logic into a single dbt model and reference it; fail CI if someone re-implements it elsewhere.

Proving business value: the numbers that matter

What we measure when we rebuild an A/B pipeline:

  • Decision latency: time from exposure ingest to updated results. Target: < 1 hour for active experiments.
  • SRM incident rate: per-month incidents; target: 80% reduction after quality gates.
  • Replication rate: percentage of “wins” that hold up in ramp. Target: +20–30% improvement.
  • Cost per decision: warehouse credits per experiment per day. Target: -25% via pre-aggregates and cache.
  • MTTR on broken experiments: from alert to fix. Target: under 2 hours with lineage + Prometheus + clear ownership.

A recent engagement on BigQuery + GrowthBook + dbt v1.6:

  • Cut decision latency from 2.5 days to 4 hours
  • SRM alerts dropped by 80%
  • Reduced BigQuery cost for experiment workloads by 28%
  • Exec trust bounced back because “wins” stopped flipping at rollout

What we’d do again (and what to avoid)

Do this:

  1. Expose at assignment with a contract and tests.
  2. Deterministic bucketing with logged salt and hash version.
  3. Central metric models in dbt; don’t fork metric logic.
  4. Automated SRM/CUPED/guardrails with alerts that block publish.
  5. Micro-batch orchestration optimized for decision latency.
  6. Lineage + freshness SLOs so on-call isn’t guessing.

Avoid this:

  • Backfilling exposures from app logs (“we’ll fix it later”). You won’t.
  • Multiple bucketing implementations (web vs. iOS). Pick one library.
  • BI-layer metric definitions. They will diverge.
  • “We don’t need CUPED.” If variance is high, you do.
  • Letting AI assistants drift metric logic. Use tests and contracts to keep it honest.

If you want this to be boring and fast, GitPlumbers will wire it up with your stack—not a rip-and-replace. We’ve done it on Snowflake, BigQuery, and even old Redshift with duct tape and good alerts.

Related Resources

Key takeaways

  • Expose assignment before effects: an exposure-first architecture is non-negotiable.
  • Use deterministic bucketing with stable user identity and salts; log assignment events with data contracts.
  • Model metrics in dbt with tests and lineage; keep definitions in one place or expect disagreements forever.
  • Automate quality gates: SRM detection, CUPED variance reduction, and guardrail metrics with alerts.
  • Make the pipeline observable: lineage, freshness SLAs, anomaly detection, and Prometheus alerts reduce MTTR.
  • Optimize for decision latency, not batch vanity; cache and pre-aggregate to shorten the time-to-decision.

Implementation checklist

  • Define a versioned data contract for exposure and outcome events.
  • Implement deterministic bucketing with a stable user_id and experiment-specific salt.
  • Log exposure events synchronously at assignment; avoid retroactive backfills.
  • Centralize metric definitions in dbt with tests for nulls, ranges, and uniqueness.
  • Automate SRM checks and fail fast on assignment imbalance.
  • Apply CUPED for high-variance metrics and document it next to the metric code.
  • Instrument freshness, volume, and anomaly alerts in Prometheus (or your stack).
  • Publish a single canonical experiment result table with metadata (power, SRM, guardrails, CUPED used).

Questions we hear from teams

How do we retrofit exposure-first design if our app only emits outcome events?
Stand up an assignment service (GrowthBook SDKs or a tiny Go/Python service) that deterministically buckets with a logged salt and emits exposure events synchronously at decision time. Backfill only to support historical dashboards—never to calculate current experiment results.
Is CUPED always worth it?
If your metric is noisy (think revenue per user, time on site), CUPED typically reduces variance 10–30%. Compute the pre-period feature once daily and join. Be explicit in reporting (flags/notes) so PMs understand adjusted vs. raw.
Where should metric definitions live: dbt, BI, or the experiment tool?
dbt. The experiment tool consumes pre-aggregated tables; BI builds on the same models. We’ve seen BI-defined metrics drift. Keep SQL in Git with tests and versioning.
How do we detect identity churn or bot traffic biasing results?
Add diagnostics: user_id alias churn rate, exposure-to-outcome lag distribution, bot score distributions by variant. Alert on anomalies and exclude upstream via a shared filter model referenced by all metrics.
Can we do this on Redshift or Databricks?
Yes. Replace Snowflake/BigQuery with your warehouse, keep the same patterns: contracts, deterministic bucketing, dbt models, and quality gates. We’ve shipped this on Redshift RA3 and on Databricks SQL with Delta tables.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to GitPlumbers about stabilizing your A/B pipeline See how we prevent vibe-coded metrics from shipping

Related resources