Do I need a feature flag vendor to do this right?

No, but it helps. The critical piece is deterministic, server-side assignment and explicit exposure logging. You can implement that with LaunchDarkly/Optimizely, or with a small in-house service. What matters: a seeded hash, consistent unit IDs, and durable logging.

How do I handle bots and crawlers in experiments?

Filter them at ingestion (user-agent lists, behavior heuristics) and maintain an allowlist for analysis. Don’t rely only on UA strings—add rate and path-pattern heuristics. Exclude flagged traffic from SRM and metrics by default, but report the excluded proportion as a guardrail.

What about sequential peeking?

If you peek, you need sequential tests (e.g., SPRT, alpha spending). Alternatively, fix your horizon (e.g., 7 days) and don’t peek. Whatever you choose, encode it in your pipeline configs so analysts can’t accidentally change the rules mid-flight.

We have cross-device users. Can we still trust results?

Yes, if your identity graph is good enough and consistent with assignment. Prefer account-level randomization when feasible. If not, constrain the experiment to contexts where ID is stable (e.g., post-login flows) or accept dilution and quantify it in pre-experiment power analysis.

How do I backfill without breaking decisions already made?

Version your metrics. Write new results to a new table/version with a migration note and comparison report. Don’t rewrite history in-place. If the backfill changes conclusions, escalate with a formal experiment audit—it’s a data incident, treat it like one.

Data-engineering · Oct 30, 2025 · 10 minute read

Stop Shipping Fake Wins: The A/B Pipeline That Doesn’t Lie

Design A/B testing data pipelines that produce trustworthy results and business value, not ghost wins.

Alex Ramirez

Principal Data & Platform Architect

20 years building and rescuing data platforms for high-growth SaaS, marketplaces, and media. Formerly at Etsy and a few unicorns you’ve definitely copied dashboards from.

“Most A/B test failures aren’t statistical—they’re pipeline failures. Fix the plumbing, and the truth gets a lot quieter and a lot clearer.”

Back to all posts

The experiment that “won” and cost us 7% revenue

A few years back, a DTC retailer saw a “+3.2% lift” in AOV from a checkout tweak. Execs celebrated. Two weeks later, finance flagged a 7% revenue dip on iOS. We traced it to the A/B pipeline: exposure events lagged page renders by ~6 seconds on mobile. Safari’s backgrounding killed the request half the time. Our metric model joined outcomes to the earliest exposure in a 7-day window, so many buyers were counted as exposed to treatment when they actually saw control. Classic ghost win.

I’ve seen this movie at SaaS, marketplaces, and media companies running Optimizely, LaunchDarkly, homegrown flag systems, even Amplitude experiments. The root cause isn’t p-values—it’s unreliable data plumbing. If your pipeline lies, your p-values are just making the lies look scientific.

The usual A/B data failures (and why leaders should care)

What goes wrong, repeatedly:

SRM (sample ratio mismatch): Control 50% / Treatment 50% expected, you get 57/43 observed. Causes: bot filters, geo gating, client SDK bugs, late exposure logs. If you’re not auto-alerting on SRM within minutes, you’re making decisions on biased samples.
Nondeterministic assignment: Randomization in the client with Math.random() (yes, still happening) or per-request assignment. Users pinball between buckets across devices or sessions.
Missing or late exposures: SPA route changes, ad blockers, mobile backgrounding. Exposure gets recorded after the conversion or not at all.
Double counting: Retries without idempotency keys, ETL duplications, at-least-once delivery without dedupe.
Inconsistent units: You randomize by user_id but analyze by device_id or cookie. Or vice versa.
Schema drift: Fields disappear/rename silently. Nullable booleans change to strings. Warehouse joins start skewing without anyone noticing.
Attribution window drift: Product thinks 7 days, data model uses 14. Re-run backfills and you “move the goalposts” after the fact.

Business impact:

Wasted roadmap capacity (I’ve seen quarters burned chasing false lifts)
SLO and MTTR hits when incidents come from “bad analysis” instead of observable failures
Loss of trust: product rolls their own metrics (shadow IT) and you’re back in data anarchy

An architecture that actually works

Here’s the pattern we implement at GitPlumbers. It’s boring in the best way: deterministic, testable, and observable.

Ingestion: Kafka/Kinesis with a schema registry (Confluent) backing Avro/Protobuf or JSON Schema. Require event_id (UUID), event_timestamp (server time in UTC), user_id, experiment_key, variant_key when relevant.
Contracts: Versioned schemas with compatibility rules. Violations get parked in a dead-letter topic with alerts—don’t let them leak into the warehouse.
Storage: Land raw events in object storage (s3:///gs://) partitioned by dt=YYYY-MM-DD. Load into BigQuery/Snowflake staging tables with ingestion-time partitions.
Transform: dbt builds idempotent models: stg_events → fct_exposures, fct_outcomes, and finally fct_experiment_metrics.
Assignment service: Server-side, deterministic hash. Log both assignment and exposure explicitly. Don’t infer exposure from clicks.
Quality gates: Great Expectations/Soda + dbt tests on freshness, uniqueness, relationships. Fail fast in CI.
Monitoring: SRM checks, freshness SLOs, unknown-variant rate. Alert in Slack/PagerDuty with links to lineage (OpenLineage/Marquez or Monte Carlo).

Principle: exposure and assignment are first-class facts, not derived guesses.

Idempotent dedupe in the warehouse:

-- BigQuery/Snowflake: dedupe raw events by event_id and choose earliest timestamp
create or replace table mart.stg_events_dedup as
select
  any_value(event_id) as event_id,
  any_value(user_id) as user_id,
  any_value(event_name) as event_name,
  any_value(experiment_key) as experiment_key,
  any_value(variant_key) as variant_key,
  min(event_timestamp) as event_timestamp
from raw.events
where dt between {{ start_date }} and {{ end_date }}
qualify row_number() over (partition by event_id order by event_timestamp) = 1;

Deterministic assignment (server-side):

# python
import mmh3

def assign_variant(user_id: str, experiment_key: str, traffic_pct: int, seed: int = 13) -> str:
    key = f"{seed}:{experiment_key}:{user_id}"
    bucket = mmh3.hash(key, signed=False) % 100
    return "treatment" if bucket < traffic_pct else "control"

Or in BigQuery for audits:

-- bigquery
with params as (select 13 as seed, 'checkout_copy' as experiment_key)
select
  user_id,
  abs(mod(farm_fingerprint(concat(cast(seed as string), ':', experiment_key, ':', user_id)), 100)) as bucket,
  case when bucket < 50 then 'treatment' else 'control' end as variant_key
from users

Make exposures impossible to fake

You need two separate events:

assignment: when the server decides the variant
exposure: when the user could reasonably perceive the treatment (e.g., component rendered)

Both must carry user_id, experiment_key, variant_key, and timestamps.

Build canonical exposure and outcome tables:

-- One exposure per (user, experiment) at earliest true exposure time
create or replace table mart.fct_exposures as
select
  experiment_key,
  user_id,
  any_value(variant_key) as variant_key,
  min(event_timestamp) as exposed_at
from mart.stg_events_dedup
where event_name = 'experiment_exposure'
group by 1,2;

-- Outcomes joined in a fixed window (document this!)
create or replace table mart.fct_outcomes as
select
  e.experiment_key,
  e.user_id,
  e.variant_key,
  e.exposed_at,
  sum(case when o.event_name = 'purchase' then o.amount else 0 end) as revenue_7d,
  countif(o.event_name = 'purchase') as orders_7d
from mart.fct_exposures e
left join mart.stg_events_dedup o
  on o.user_id = e.user_id
 and o.event_timestamp between e.exposed_at and timestamp_add(e.exposed_at, interval 7 day)
group by 1,2,3,4;

Then metrics (per experiment):

create or replace table mart.fct_experiment_metrics as
select
  experiment_key,
  variant_key,
  count(*) as users,
  sum(orders_7d) as orders,
  sum(revenue_7d) as revenue,
  avg(revenue_7d) as aov
from mart.fct_outcomes
group by 1,2;

A few hard-won rules:

Unit consistency: Randomize and analyze on the same ID. If you must use devices, create a stable household_id or account_id and accept trade-offs.
Time discipline: All timestamps UTC; conversions use a documented window. Don’t silently change windows between runs.
Mobile realities: Fire exposure server-side where possible (e.g., render API response includes exposure log). For client-only exposures, queue to a durable channel and flush on app foreground events.

Contracts, tests, and SRM checks: your early warning system

Stop schema drift and silent metric skew with contracts and automated tests.

Data contract example (JSON Schema):

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "title": "experiment_exposure",
  "type": "object",
  "required": ["event_id", "event_timestamp", "user_id", "experiment_key", "variant_key"],
  "properties": {
    "event_id": {"type": "string", "pattern": "^[0-9a-f-]{36}$"},
    "event_timestamp": {"type": "string", "format": "date-time"},
    "user_id": {"type": "string"},
    "experiment_key": {"type": "string"},
    "variant_key": {"type": "string", "enum": ["control", "treatment"]}
  }
}

dbt tests that actually catch issues:

# models/experiments.yml
version: 2
models:
  - name: fct_exposures
    tests:
      - not_null:
          column_name: user_id
      - dbt_utils.unique_combination_of_columns:
          combination_of_columns: [experiment_key, user_id]
      - relationships:
          to: ref('dim_experiments')
          field: experiment_key
sources:
  - name: raw
    tables:
      - name: events
        freshness:
          warn_after: {count: 15, period: minute}
          error_after: {count: 60, period: minute}

SRM monitoring (compute chi-squared). If it trips, you stop the analysis and page someone.

with counts as (
  select experiment_key, variant_key, count(*) as n
  from mart.fct_exposures
  where exposed_at >= timestamp_sub(current_timestamp(), interval 1 hour)
  group by 1,2
), totals as (
  select experiment_key, sum(n) as N from counts group by 1
), expected as (
  select c.experiment_key, c.variant_key,
         t.N * 0.5 as e -- adjust for planned allocation
  from counts c join totals t using (experiment_key)
)
select c.experiment_key,
       sum(power(c.n - e.e, 2) / e.e) as chi2
from counts c join expected e using (experiment_key, variant_key)
group by 1
having chi2 > 3.841 -- p < 0.05 with 1 dof

Add variance reduction (CUPED) to shrink confidence intervals:

-- Pre-period spend as covariate
with base as (
  select o.user_id, o.experiment_key, o.variant_key, o.revenue_7d,
         pre.revenue_7d as pre_rev
  from mart.fct_outcomes o
  left join mart.fct_outcomes_preperiod pre
    using (user_id, experiment_key)
), theta as (
  select corr(revenue_7d, pre_rev) * stddev(revenue_7d) / nullif(stddev(pre_rev),0) as t
  from base
)
select b.experiment_key, b.variant_key,
       avg(b.revenue_7d - t * b.pre_rev) as cuped_revenue
from base b cross join theta
group by 1,2;

Quality guardrails to page on:

Freshness SLO violated for exposures or outcomes
SRM p-value < 0.05 for >15 minutes
Unknown variant_key rate > 0.5%
Sudden drop in exposure→outcome linkage rate

Ship it like software: CI/CD, backfills, lineage, and MTTR

Your experiment pipeline deserves the same DevOps discipline as production services.

GitOps for analytics: Version everything. dbt models, contracts, Airflow/Dagster code, threshold configs. Use PR checks with unit datasets.
CI checks: Run dbt build with seed data, execute Great Expectations suites, and lint SQL.
Idempotent backfills: Partition by date and experiment. Never mutate history without a migration note.
Lineage and observability: Emit OpenLineage events from Airflow to Marquez/Monte Carlo. Tie alerts to owners.

Airflow DAG with idempotent daily partitions and SLA:

# airflow
from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime, timedelta

def cmd(ds):
    return f"dbt build --select tag:experiments --vars 'run_date: {ds}'"

dag = DAG(
    'experiment_metrics',
    start_date=datetime(2024,1,1),
    schedule='@hourly',
    catchup=True,
    default_args={'retries': 2, 'retry_delay': timedelta(minutes=5), 'sla': timedelta(minutes=20)}
)

build = BashOperator(task_id='dbt_build', bash_command=cmd("{{ ds }}"), dag=dag)

Backfill safely:

dbt build --select tag:experiments --vars '{"start_date": "2025-01-01", "end_date": "2025-01-31"}'

SLOs we track with leaders:

Freshness SLO: exposures and outcomes <15 minutes lag, 99% monthly
SRM detection MTTA <10 minutes; MTTR <60 minutes
Unknown-variant rate <0.2% per experiment
Re-run determinism: repeated builds produce identical aggregates (checksum)

Results in the wild + a one-page checklist

What happens when you get this right:

A consumer subscription company migrating from Amplitude experiments to LaunchDarkly + Snowflake + dbt cut SRM incidents by 83% and reduced experiment decision time from 7 days to 3 days (variance reduction + reliable data). Unknown-variant rate dropped from 1.6% to 0.1% in two sprints.
A marketplace with BigQuery + Dataflow + Kafka saw MTTR on experiment data incidents fall from 6 hours to 35 minutes after adding schema contracts and SRM paging. Finance stopped finding “surprise” reversals post-launch.

Use this checklist this week:

Confirm the unit of randomization and audit that it matches analysis joins.
Move assignment server-side and make it deterministic with a seeded hash.
Emit explicit assignment and exposure events with UTC timestamps.
Add event_id and build warehouse dedupe.
Stand up contracts (schema registry) and park violations.
Materialize canonical fct_exposures, fct_outcomes, and fct_experiment_metrics.
Automate SRM checks and set thresholds; page on-call when tripped.
Add CUPED pre-period covariates for your top 1–2 KPIs.
Wire CI to run dbt tests and QE suites on every PR.
Define freshness and quality SLOs; report them monthly to product leadership.

If you want a second set of eyes, GitPlumbers has rebuilt these pipelines for teams on Snowflake, BigQuery, and Databricks, with feature flags from LaunchDarkly, Optimizely, and homegrown systems. We’ll help you stop shipping fake wins and start shipping confidently.

Related Resources

Key takeaways

Your A/B pipeline fails when assignment and exposure are nondeterministic—fix that first.
Use data contracts and schema registry to stop silent event drift.
Build dedicated, idempotent exposure, assignment, and outcome tables; don’t join raw clickstream on the fly.
Continuously monitor SRM and freshness SLOs; alert within minutes, not days.
Bake in variance reduction (CUPED) and guardrail metrics so product doesn’t cherry-pick wins.
Operationalize with CI/CD (dbt tests, data quality checks) and safe backfills to avoid reintroducing lies.

Implementation checklist

Decide the unit of randomization (user, account) and enforce it end-to-end.
Implement deterministic assignment with a seeded hash; log an explicit `assignment` and `exposure` event.
Create idempotent event ingestion with `event_id` and dedupe windows.
Stand up contracts (JSON Schema/Avro) in a schema registry; reject/park bad events.
Materialize `dim_experiments`, `fct_exposures`, `fct_outcomes`, and `fct_experiment_metrics` in the warehouse.
Add dbt tests for not-null, unique combinations, relationships, and freshness.
Automate SRM checks and metric anomaly detection; page on-call if violated.
Add CUPED columns (pre-period covariate) to reduce variance; document windows.
Wire Airflow/Dagster for daily recompute with idempotent partitions and backfill scripts.
Track SLOs: freshness, SRM false-positive rate, unknown-variant rate, MTTR.

Questions we hear from teams

Do I need a feature flag vendor to do this right?: No, but it helps. The critical piece is deterministic, server-side assignment and explicit exposure logging. You can implement that with LaunchDarkly/Optimizely, or with a small in-house service. What matters: a seeded hash, consistent unit IDs, and durable logging.
How do I handle bots and crawlers in experiments?: Filter them at ingestion (user-agent lists, behavior heuristics) and maintain an allowlist for analysis. Don’t rely only on UA strings—add rate and path-pattern heuristics. Exclude flagged traffic from SRM and metrics by default, but report the excluded proportion as a guardrail.
What about sequential peeking?: If you peek, you need sequential tests (e.g., SPRT, alpha spending). Alternatively, fix your horizon (e.g., 7 days) and don’t peek. Whatever you choose, encode it in your pipeline configs so analysts can’t accidentally change the rules mid-flight.
We have cross-device users. Can we still trust results?: Yes, if your identity graph is good enough and consistent with assignment. Prefer account-level randomization when feasible. If not, constrain the experiment to contexts where ID is stable (e.g., post-login flows) or accept dilution and quantify it in pre-experiment power analysis.
How do I backfill without breaking decisions already made?: Version your metrics. Write new results to a new table/version with a migration note and comparison report. Don’t rewrite history in-place. If the backfill changes conclusions, escalate with a formal experiment audit—it’s a data incident, treat it like one.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Talk to GitPlumbers about experiment data reliability Get the Data Reliability Playbook