The A/B Test That Lied: Designing Data Pipelines That Stop Gaslighting Your Team

Reliable experimentation isn’t a stats problem. It’s an engineering problem: assignment, exposure, identity, late events, and metric drift. Here’s how to build an A/B testing pipeline you can defend in a postmortem—and trust in a board deck.

Reliable A/B testing isn’t about fancier statistics. It’s about treating assignment, exposure, and metrics like production systems—with contracts, tests, and incident response.
Back to all posts

When your experiment dashboard is wrong, everyone loses

I’ve watched teams “win” an A/B test, ship the variant, and then spend the next quarter explaining why revenue dropped anyway. The postmortem always starts the same: “the stats looked significant.” And it ends the same: “turns out the data pipeline was… complicated.”

If your experimentation stack is built on a pile of client-side events, a best-effort identity graph, and a dashboard that quietly changes definitions, you don’t have an A/B testing program. You have a confidence theater.

What reliable looks like in practice:

  • Assignment is a fact (who was eligible, what variant they got, when it happened).
  • Exposure is a fact (who actually saw the thing).
  • Outcomes are computed from versioned metrics (so “conversion” doesn’t mutate mid-quarter).
  • Pipelines are re-playable and testable (so backfills don’t rewrite history).

This is the stuff GitPlumbers gets called in for after the third “rerun the experiment” incident and the CFO starts asking why the growth team can’t agree on a number.

Start with an experiment ledger: assignment + exposure as first-class data

The #1 reliability move is separating assignment, exposure, and outcome. Most teams blur these and then wonder why counts don’t match.

Assignment (authoritative)

Assignment should be written by something you trust:

  • Server-side in the API (/checkout, /search, etc.), or
  • An edge layer you control (e.g., Cloudflare Workers) if you must

Client-side-only assignment is where bots, ad blockers, retries, and “my tab crashed” go to ruin your day.

A minimal assignment event shape:

# experiment_assignment_v1.yaml
name: experiment_assignment
version: 1
primary_key: [assignment_id]
fields:
  assignment_id: { type: string, required: true }
  experiment_id:  { type: string, required: true }
  variant_id:     { type: string, required: true }
  unit_type:      { type: string, required: true, allowed: [user_id, device_id, account_id] }
  unit_id:        { type: string, required: true }
  assigned_at:    { type: timestamp, required: true }
  eligibility_hash: { type: string, required: false }
  reason:         { type: string, required: false } # e.g. "targeting", "holdout", "override"
  source:         { type: string, required: true }  # service/app version

Key properties:

  • Idempotent: retries don’t create multiple assignments.
  • Immutable: you don’t “fix” assignments later; you append corrections.
  • Auditable: you can explain why someone got variant B.

Exposure (separate, and messy by nature)

Exposure is “did they actually see it?” This is where late events and client weirdness live.

  • Exposure should reference the same experiment_id + unit_id.
  • It should have its own dedupe key (exposure_id or (unit_id, experiment_id, exposure_at, surface) hash).

Why this split pays off

  • You can analyze intent-to-treat (assignment-based) vs treatment-on-treated (exposure-based).
  • You can detect instrumentation failures: “assignments happened, exposures didn’t.”
  • You avoid the classic bug: counting users in a variant because they fired a conversion event, not because they were assigned.

Build a pipeline that’s boring in the right places (bronze/silver/gold + backfills)

I’m allergic to clever experimentation pipelines. The reliable ones look boring:

  • Bronze (raw): append-only events exactly as received
  • Silver (clean): deduped, typed, contract-validated tables
  • Gold (analytics): experiment-ready fact tables (assignment/exposure/outcomes)

This pattern works in Snowflake, BigQuery, Redshift, Databricks Delta, Iceberg—doesn’t matter. The point is: raw is sacred; curated is derived.

Bronze: keep the original envelope

Store:

  • received_at (warehouse ingest time)
  • sent_at (client/server time)
  • event_id (producer-generated UUID)
  • raw payload (VARIANT in Snowflake, JSON in BigQuery)

Late arrivals are normal. Your job is to make them visible, not pretend they don’t exist.

Silver: deterministic dedupe + typing

In dbt, a typical silver model for assignments:

-- models/silver/assignments.sql
{{ config(materialized='incremental', unique_key='assignment_id') }}

with src as (
  select
    payload:assignment_id::string as assignment_id,
    payload:experiment_id::string  as experiment_id,
    payload:variant_id::string     as variant_id,
    payload:unit_type::string      as unit_type,
    payload:unit_id::string        as unit_id,
    payload:assigned_at::timestamp as assigned_at,
    received_at
  from {{ source('bronze', 'events') }}
  where event_name = 'experiment_assignment'
)

select *
from src
qualify row_number() over (
  partition by assignment_id
  order by received_at desc
) = 1

{% if is_incremental() %}
  and received_at >= dateadd('day', -3, current_timestamp())
{% endif %}

Notes from the trenches:

  • Incremental + sliding window handles late events without full rebuilds.
  • Dedupe uses received_at so retries collapse deterministically.

Gold: experiment-ready joins (and guardrails)

A “gold” table often looks like: one row per assigned unit, with exposure flags and outcomes.

-- models/gold/experiment_cohorts.sql
with a as (
  select * from {{ ref('assignments') }}
),

x as (
  select
    experiment_id,
    unit_id,
    min(exposure_at) as first_exposure_at
  from {{ ref('exposures') }}
  group by 1,2
),

o as (
  select
    unit_id,
    min(order_at) as first_order_at,
    sum(revenue) as revenue_14d
  from {{ ref('orders_clean') }}
  group by 1
)

select
  a.experiment_id,
  a.variant_id,
  a.unit_type,
  a.unit_id,
  a.assigned_at,
  x.first_exposure_at,
  case when x.first_exposure_at is not null then 1 else 0 end as exposed,
  o.first_order_at,
  o.revenue_14d
from a
left join x
  on a.experiment_id = x.experiment_id
 and a.unit_id = x.unit_id
left join o
  on a.unit_id = o.unit_id

The business value: when someone asks, “Are we measuring the right cohort?”, you can answer with SQL, not vibes.

Bake in data quality: contracts, dbt tests, and SRM alarms

If your pipeline can silently produce bad experiment data, it will—usually on the one test the CEO cares about.

Data contracts (stop schema drift before it lands)

Use contracts at the ingestion boundary:

  • Protobuf/Avro + schema registry if you’re on Kafka
  • JSON schema validation if you’re on Segment/Snowplow-esque payloads

At minimum, enforce:

  • required fields (experiment_id, variant_id, unit_id, timestamps)
  • allowed enumerations (unit_type)
  • stable semantics (what does “exposure” mean?)

dbt tests that actually catch experiment issues

# models/silver/assignments.yml
version: 2
models:
  - name: assignments
    columns:
      - name: assignment_id
        tests:
          - not_null
          - unique
      - name: experiment_id
        tests: [not_null]
      - name: variant_id
        tests: [not_null]
      - name: unit_id
        tests: [not_null]

  - name: experiment_cohorts
    tests:
      - dbt_utils.expression_is_true:
          expression: "assigned_at <= coalesce(first_exposure_at, assigned_at)"

That last test catches a real class of bugs: exposure timestamps accidentally backfilled earlier than assignment due to timezone parsing or event replay order.

SRM (Sample Ratio Mismatch) detection isn’t optional

If your allocation is 50/50 and you see 55/45, you don’t have an experiment result—you have an instrumentation incident.

You can compute SRM daily and page on it. A lightweight approach:

-- srm_check.sql
select
  experiment_id,
  count_if(variant_id = 'A') as n_a,
  count_if(variant_id = 'B') as n_b,
  (n_a::float / nullif(n_a + n_b, 0)) as pct_a
from analytics.experiment_cohorts
where assigned_at >= dateadd('day', -1, current_date())
group by 1
having abs(pct_a - 0.5) > 0.02

Wire it into your orchestration layer (Airflow/Dagster) with a hard fail:

# airflow snippet
from airflow.decorators import task

@task
def check_srm():
    # run query, raise Exception if rows returned
    ...

What I’ve seen work: treat SRM like an SLO. If SRM > threshold for > N hours, the experiment is paused automatically.

Metric truth: one definition, versioned, and reproducible

This is where experiments go to die: “conversion” means three different things in three dashboards.

Fixing it is less glamorous than Bayesian debates:

  • Define metrics once (dbt models, a semantic layer like dbt metrics / MetricFlow, or a service like Transform).
  • Version them (breaking changes get conversion_rate_v2).
  • Store the analysis dataset snapshot inputs (tables + partitions + code SHA).

A practical pattern: curated metric tables + analysis views

  • Gold tables produce clean facts (orders_clean, sessions_clean, experiment_cohorts).
  • Metric models compute outcomes with explicit windows.

Example: revenue within 14 days of assignment, exposure-aware.

-- models/gold/metrics/revenue_14d.sql
select
  experiment_id,
  variant_id,
  count(*) as assigned_units,
  sum(case when exposed = 1 then 1 else 0 end) as exposed_units,
  avg(revenue_14d) as avg_revenue_14d
from {{ ref('experiment_cohorts') }}
where assigned_at >= dateadd('day', -30, current_date())
group by 1,2

Reproducibility tip that saves careers

When an exec asks, “Why did this number change since last week?”, you want to say:

“Because we shipped conversion_rate_v2 on Jan 12, and this report is pinned to v1.”

Not: “Uh… maybe the dashboard refreshed?”

Concrete reliability wins (the metrics your leadership will actually care about)

When we rebuild experimentation pipelines at GitPlumbers, the before/after is usually measurable within a month:

  • Experiment reruns drop by 30–60% (because SRM and join issues get caught in CI, not after launch).
  • Time-to-decision drops from ~10–14 days to ~3–7 days (less arguing, fewer “data looks off” freezes).
  • Incident MTTR on data issues drops by 40%+ (ledger + bronze retention makes debugging possible).
  • Revenue-impacting rollouts get safer (exposure logging + guardrails prevents “we shipped to nobody” and “we shipped to everyone” disasters).

The unsexy secret: reliability is compounding. The second experiment is cheaper than the first. The tenth experiment is dramatically cheaper—if you don’t let the pipeline rot.

A minimal implementation plan you can execute without boiling the ocean

  1. Add assignment logging server-side and make it idempotent.
  2. Add exposure logging (even if messy) and keep it separate from assignment.
  3. Stand up bronze/silver/gold with append-only raw retention.
  4. Add dbt tests + contract checks in CI, and fail deployments on violations.
  5. Implement SRM alerts and define who gets paged.
  6. Move metrics into a versioned, centralized definition (dbt/semantic layer) and pin analyses.

Run the first A/A test after step 3. If your A/A can’t stay flat, stop and fix the pipeline. I’ve seen teams waste six months optimizing product changes when the real problem was a broken join.

# Typical “make it real” CI sequence
dbt deps
dbt seed
dbt run --select silver+ gold+
dbt test --select silver+ gold+

When you inherit a haunted experimentation stack

If you’re sitting on a pile of legacy tracking code, half-migrated GA360, maybe some Segment events, and a warehouse full of “final_v7” tables… yeah. Been there.

GitPlumbers usually starts with:

  • A data lineage walk (what feeds the experiment dashboard, actually)
  • A ledger retrofit (assignment/exposure as immutable facts)
  • A quality harness (contracts + dbt/GE/Soda + SRM paging)
  • A metric freeze + versioning plan so you can ship changes without rewriting history

If you want a second set of eyes, we can do a short pipeline reliability review and tell you—plainly—where your experiments are lying and how to stop it.

Related Resources

Key takeaways

  • Treat experiment assignment and exposure as first-class, immutable facts (a ledger), not derived artifacts.
  • Separate raw ingestion from curated experimentation models (bronze/silver/gold) so backfills don’t rewrite history silently.
  • Automate data contracts + tests (schema, uniqueness, referential integrity, SRM alarms) in CI so bad data can’t ship.
  • Version metrics and experiment analyses; a “metric drift” incident is still an incident.
  • Measure outcomes: fewer reruns, lower incident MTTR, shorter time-to-decision, and auditability for revenue-impacting changes.

Implementation checklist

  • Raw events are immutable and re-playable (with `event_id`, `sent_at`, `received_at`).
  • Assignment is logged server-side (or via a trusted edge) and stored as a ledger.
  • Exposure is logged separately from assignment; analysis uses exposure-aware cohorts.
  • Identity resolution is deterministic and documented (no silent backfills changing user counts).
  • Deduping and late-event handling are explicit and tested.
  • dbt/Great Expectations/Soda tests cover schema, uniqueness, not-null, and joins.
  • SRM detection runs automatically with paging thresholds.
  • Metrics are versioned and computed in a single place (semantic layer or curated dbt models).
  • Every experiment result is reproducible from a pinned dataset snapshot + code SHA.

Questions we hear from teams

Do we really need exposure logging? Isn’t assignment enough?
Assignment-only (intent-to-treat) is valid, but exposure logging is how you detect delivery failures and interpret effect sizes. Without exposure, you can’t tell “variant didn’t work” from “variant wasn’t shown.” In practice, teams use assignment for primary analysis and exposure for diagnostics/secondary reads.
What’s the fastest way to catch broken experiments in production?
Automate SRM detection and page on it. SRM is the canary for bucketing bugs, targeting drift, and missing events. Pair it with a simple assignment-to-exposure rate check by platform/app version.
Which tools should we use: dbt, Great Expectations, Soda, Monte Carlo?
dbt tests cover a lot (schema/uniqueness/relationships) and are easy to operationalize. Great Expectations or Soda helps when you need richer validations and reporting. Monte Carlo-style observability is great once you have stable models—don’t use it as a substitute for contracts and deterministic pipelines.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Book an experimentation pipeline reliability review See how we fix broken analytics pipelines

Related resources