The A/B Test Pipeline That Lied to Product: Designing Experiment Data You Can Trust
Your A/B tests aren’t failing because statistics are hard. They’re failing because your data pipeline is lying. Here’s the design that keeps experiment results reliable, fast, and tied to business value.
Your stats aren’t wrong—your pipeline is. Fix the exposure truth and the wins will follow.Back to all posts
The A/B test that lied
I’ve watched a pricing test “win” by +4% revenue at a consumer marketplace—until we realized retries were double-counting exposures and bots were inflating clicks. Product shipped the variant, conversion cratered, and we spent a quarter earning back trust. The pipeline wasn’t evil; it was sloppy:
- Non-deterministic assignment after a redeploy changed a salt.
- Duplicate exposures on mobile retries because idempotency keys were an afterthought.
- Metrics stitched with elastic windows so variant A got more late conversions than B.
- Analysts peeking with vanilla t-tests and 30 concurrent experiments, no FDR control.
If this feels familiar, you don’t need new statistics first—you need a pipeline that stops lying. Here’s the design that’s held up for us at scale (and what GitPlumbers implements when we clean up vibe-coded experiment stacks).
What reliable experiment pipelines actually need
Skip the buzzwords. These are the non-negotiables:
- Deterministic assignment: stable hashing of a canonical identity with a fixed experiment salt. Assignment must never change mid-flight.
- Idempotent, deduped exposures: every exposure event has a unique key; ingestion and downstream sinks enforce uniqueness.
- Metric definitions in code: dbt or a semantic layer; no spreadsheet logic.
- Join rules that survive reality: late events, clock skew, retries, and partial identities (cookie to user merge).
- Data quality gates: contracts on the edge, automated tests, lineage, and SLOs for freshness.
- Statistical guardrails: pre-registration, CUPED or variance reduction, and false discovery control.
- A registry: a Git-controlled experiment catalog with ownership, metrics, audiences, and exposure source.
If any one of these is “TBD,” your test results are a coin flip with a confidence interval.
Reference architecture: stream first, batch consistent
You don’t need FAANG budgets. You need the right few components configured correctly.
- Ingestion:
Snowplowor custom collectors pushing toKafka/Kinesis. - Stream processing:
FlinkorSpark Structured Streamingfor dedupe, enrichment, and exactly-once sinks. - Storage:
BigQuery/Snowflakefor analysis;ClickHouseif you need sub-second slice-and-dice. - Orchestration:
AirfloworDagsterfor batch metric builds;Argo Workflowsif you’re deep in Kubernetes. - Transformations:
dbtfor metrics and semantic consistency. - Quality:
Great Expectations+OpenLineage. - Observability:
Prometheus+OpenTelemetry+ logs; alerts inPagerDuty.
Event schema (exposure) with idempotency baked in:
{
"event_type": "experiment_exposure",
"event_version": 1,
"exposure_id": "uuid-v4",
"experiment_key": "pricing_v2",
"variant": "B",
"assigned_at": "2025-11-14T12:41:00Z",
"user_id": "12345",
"anonymous_id": "a1b2c3",
"request_id": "r-9f2...",
"app_version": "6.14.2",
"device_ts": 1731588060000,
"schema_ts": 1731588060123
}Kafka sink with exactly-once semantics (Flink -> Snowflake using transactional sinks):
# flink-connector.yaml (simplified)
job:
parallelism: 8
checkpointing:
intervalMs: 30000
exactlyOnce: true
sinks:
- type: snowflake
table: EXPOSURES
keyColumns: ["EXPOSURE_ID"]
upsert: true
transactionMode: EXACTLY_ONCEAirflow batch build (exposures + metrics daily backfill):
from airflow import DAG
from airflow.providers.google.cloud.operators.bigquery import BigQueryInsertJobOperator
from datetime import datetime
with DAG(
dag_id="experiment_metrics",
start_date=datetime(2024, 1, 1),
schedule_interval="15 * * * *",
catchup=False,
max_active_runs=1,
) as dag:
exposures = BigQueryInsertJobOperator(
task_id="load_exposures",
configuration={
"query": {
"query": """
create or replace table mart.exposures as
select exposure_id, experiment_key, variant, user_id, assigned_at
from raw.exposures_stream_deduped;
""",
"useLegacySql": False,
}
},
)
metrics = BigQueryInsertJobOperator(
task_id="compute_metrics",
configuration={
"query": {
"query": """
create or replace table mart.experiment_metrics as
select
e.experiment_key,
e.variant,
count(distinct e.exposure_id) as exposures,
sum(m.order_revenue) as revenue,
sum(case when m.converted then 1 else 0 end) as conversions
from mart.exposures e
left join mart.metrics m
on m.user_id = e.user_id
and m.event_time between e.assigned_at and timestamp_add(e.assigned_at, interval 14 day)
group by 1,2;
""",
"useLegacySql": False,
}
},
)
exposures >> metricsGuardrails: data quality, lineage, and SLOs
Dashboards don’t catch pipeline lies—tests do.
- Contracts at the edge: schema versioning and required fields enforced at ingestion. Reject bad events fast.
- Great Expectations/dbt tests on exposures and metrics.
- OpenLineage tags tie every table back to experiment keys.
- SLOs for freshness and join latency; alert on breach.
Great Expectations for exposure uniqueness and null checks:
# great_expectations/expectations/exposures.py
from great_expectations.dataset import PandasDataset
class ExposureDataset(PandasDataset):
_data_asset_type = "ExposureDataset"
def validate_schema(self):
self.expect_column_values_to_not_be_null("exposure_id")
self.expect_column_values_to_not_be_null("experiment_key")
self.expect_column_values_to_be_in_set("variant", ["A", "B", "control", "treatment"])
self.expect_column_values_to_be_unique("exposure_id")dbt tests for assignment stability and exposure dedupe:
# models/experiments/schema.yml
version: 2
models:
- name: exposures
tests:
- unique:
column_name: exposure_id
- not_null:
column_name: assigned_at
- relationships:
to: ref('experiment_registry')
field: experiment_key
- name: assignment_audit
description: "Detects assignment drift by comparing hash(bucketing inputs) vs stored variant"
tests:
- dbt_utils.expression_is_true:
expression: "expected_variant = variant"Prometheus SLOs to keep you honest:
- p95 exposure-to-availability < 2 minutes.
- p99 exposure uniqueness violations = 0.
- p95 exposure-to-metric join completed < 15 minutes.
Example PromQL alert:
# Alert if freshness breaches 15 min for metrics
(max by(dataset) (time() - dataset_last_updated_timestamp_seconds{dataset="mart_experiment_metrics"})) > 900If you can’t answer “Are exposures unique, on time, and schema-valid?” in a graph, you’re guessing.
Assignment, identity, and exposure: where truth is made
This is where most platforms trip.
- Deterministic bucketing with a stable salt. Never change the salt mid-experiment.
-- BigQuery: stable assignment per user and experiment
select
user_id,
mod(abs(farm_fingerprint(concat(cast(user_id as string), ':', 'pricing_v2_salt'))), 100) as bucket,
case when bucket < 50 then 'A' else 'B' end as variant
from source.usersBake the assignment into a function/package and version it. Audit by recomputing expected buckets from raw inputs and comparing to stored variant.
- Exposure idempotency: retries happen. Carry a unique
exposure_idfrom client/server. Upsert downstream.
-- Snowflake: dedupe stream into exposure table
create or replace table mart.exposures as
select * from (
select *, row_number() over(partition by exposure_id order by ingested_at asc) as rn
from raw.exposure_stream
) where rn = 1;Identity resolution: cookie to user merges will move traffic across variants if you assign on the wrong key. Pick a primary bucketing key and stick with it. If you must migrate, freeze the cohort: once exposed under anonymous_id, carry that variant forward after login.
Holdouts and mutual exclusion: define audiences and exclusions in the registry; don’t let overlapping tests fight.
Registry entry (Git, reviewed via PR):
# registry/experiments/pricing_v2.yml
experiment_key: pricing_v2
owner: pricing-team@company.com
variants:
- A
- B
traffic_allocation: 1.0
bucket_key: user_id
salt: "1f31a7e0"
audience:
country_in: [US, CA]
exclusions:
- experiment_key: onboarding_v3
metrics:
primary: [revenue_per_user]
guardrail: [refund_rate, latency_p95]
start: 2025-11-01
end: null
exposure_source: web-server
analysis_window_days: 14Stats that won’t embarrass you
I’ve seen teams argue Bayesian vs frequentist while their exposure logs are Swiss cheese. Fix the pipeline first. Then:
- Pre-register metrics in the registry; analysis jobs read from the same source as dbt models.
- Variance reduction: use CUPED or stratification when you’ve got strong pre-period features.
- Sequential monitoring: stop-peeking sins are expensive. Use mSPRT/always-valid tests or commit to fixed horizons.
- Multiple comparisons: control FDR (Benjamini–Hochberg) across simultaneous tests.
Example: CUPED adjustment in BigQuery using a pre-period metric:
with pre as (
select user_id, sum(pre_revenue) as pre_rev
from mart.preperiod_revenue
group by 1
),
exp as (
select e.user_id, e.variant, sum(post_revenue) as post_rev
from mart.experiment_user_revenue e
group by 1,2
),
coef as (
select corr(pre.pre_rev, exp.post_rev) * stddev(exp.post_rev) / nullif(stddev(pre.pre_rev),0) as theta
from pre join exp using(user_id)
),
adjusted as (
select exp.variant,
avg(exp.post_rev - (select theta from coef) * coalesce(pre.pre_rev,0)) as cuped_mean
from exp left join pre using(user_id)
group by 1
)
select * from adjusted;And yes, if you need an off-the-shelf engine, Optimizely’s Stats Engine and LinkedIn’s LiX papers are good references. But your platform still needs deterministic assignment, idempotent exposures, and fixed metrics, or the stats don’t matter.
The 30/60-day implementation playbook
You don’t need a 12-month platform rebuild. Here’s how we land this without boiling the ocean.
- Week 1–2: Stabilize the edges
- Ship schema contracts for exposures; reject on missing
exposure_id,experiment_key,variant. - Add dedupe to your stream sink; upsert on
exposure_id. - Implement Prometheus freshness SLOs for exposures and metrics.
- Week 3–4: Deterministic assignment + registry
- Move assignment to a single library with a fixed salt; add an audit job comparing expected vs actual for active experiments.
- Create a minimal Git-backed experiment registry (YAML) with owner, audience, salt, metrics, and window.
- Lock analysis jobs to the registry; no registry, no test.
- Week 5–6: Metrics hardening
- Migrate primary metrics into dbt models; add not-null, uniqueness, and referential tests.
- Implement CUPED or stratified analysis where it pays off (high variance metrics).
- Add FDR control across the active experiment set.
- Week 7–8: Observability + backfills
- Wire OpenLineage from ingestion -> exposures -> metrics -> dashboards.
- Add a backfill path for late events with idempotent upserts.
- Document on-call runbooks; measure MTTR for broken experiments.
Roll this out behind feature flags. Canary it on one product funnel before you touch revenue-critical tests.
Outcomes you can take to the exec meeting
What we’ve delivered with this design at mid-market scale:
- False-positive rate down 60–80%, measured by synthetic A/A tests over 30 days.
- Time-to-decision cut from days to hours (p95 exposure freshness < 2 minutes; metric join < 15 minutes).
- MTTR for broken experiments < 2 hours with SLO-backed alerts and runbooks.
- Assignment consistency > 99.99%, confirmed by nightly audits.
- Real business impact: a retail client avoided shipping a “+3%” variant that would’ve cost ~$1.2M/quarter after dedupe fixed a silent retry bug.
If your current stack was assembled by interns, “vibe coded” by AI, or accreted through five PMs and two reorgs, don’t scrap it. Put in the guardrails, add the registry, and measure the hell out of it. That’s how you make A/B results trustworthy—and worth real money.
Key takeaways
- Your experiment pipeline fails first at exposure logging and identity—fix those before arguing about p-values.
- Design for exactly-once semantics with idempotent exposures, deterministic assignment, and late event tolerance.
- Enforce data quality at the edges using contracts, tests, and SLOs; don’t rely on dashboards to catch corruption.
- Freeze metric definitions in code (dbt + semantic layer), not in slide decks.
- Adopt a Git-based experiment registry; no test goes live without a registry entry and data checks passing.
- Measure pipeline business impact with time-to-decision, false-positive rate, and MTTR for broken experiments.
Implementation checklist
- Create an experiment registry with YAML-backed metadata checked into Git.
- Implement deterministic bucketing with a stable salt and audit for drift.
- Log idempotent exposures with a unique `exposure_id` and retry-safe ingestion.
- Add Great Expectations/dbt tests for exposure uniqueness, assignment consistency, and metric completeness.
- Set Prometheus SLOs for pipeline freshness and exposure-to-metric join latency.
- Introduce sequential testing or pre-registration to control false discoveries.
- Instrument lineage with OpenLineage and tag experiments across the DAG.
Questions we hear from teams
- Batch or streaming for exposures?
- Stream the exposure log with exactly-once or idempotent upserts so analysis stays fresh and deduped. Batch the heavy metrics if you must, but keep exposure truth in a low-latency stream so you can alert on drift and dedupe in near-real time.
- How do we backfill without corrupting analysis?
- Backfill with the same idempotent keys and write paths; tag backfilled rows and exclude them from freshness SLOs. Re-run dbt models with a controlled `valid_from` watermark and do a dual-run diff (shadow table) before swapping.
- What about bots and internal traffic?
- Filter at ingestion using bot lists and UA heuristics; mark internal IP ranges and authenticated staff. Keep these rules versioned and testable, and store the flags so analysts can audit exclusions.
- We use a vendor (Optimizely/LaunchDarkly/Statsig). Still relevant?
- Yes. Vendors help with assignment and stats, but you still own exposure truth, metric definitions, joins, and quality SLOs. We routinely integrate vendor exposure streams into the same guardrails and registry.
- We have AI-generated analytics code. Is that a problem?
- Only if you trust it without tests. We see LLM-written SQL that ignores idempotency and windows. Wrap it with contracts, dbt tests, and lineage, or bring in a cleanup pass (we call it vibe code cleanup) before it touches production decisions.
Ready to modernize your codebase?
Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.
