Your A/B Test “Won” — Then Finance Asked Why Revenue Didn’t Move

Design experiment data pipelines that survive late events, dupes, schema drift, and metric ambiguity—so your decisions hold up in the board deck and the postmortem.

If your A/B tests don’t have deterministic assignment, real exposure logs, and automated SRM gates, you’re not running experiments—you’re running opinions with charts.
Back to all posts

The day your “stat sig” result meets the real world

I’ve watched teams celebrate a clean p-value, ship the winner, and then get blindsided when revenue doesn’t move—or worse, moves the other direction. The postmortem is always the same:

  • “Maybe seasonality?”
  • “Maybe novelty effects?”
  • “Maybe the metric was wrong?”

Nine times out of ten, it’s not the statistics. It’s the data pipeline: duplicated events, inconsistent assignment, exposure tracked in the wrong place, late-arriving conversions, or a metric definition that changed halfway through the experiment.

If you want experiment results that survive scrutiny from finance, growth, and your future investors, you need a pipeline designed for reproducibility, data quality, and business value delivery—not just a dashboard that updates.

What actually breaks experiment results (the unsexy failure modes)

Here are the repeat offenders I’ve seen across B2C apps, PLG SaaS, and marketplace stacks:

  • Sample Ratio Mismatch (SRM): expected 50/50, observed 53/47. Often caused by bucketing bugs, caching, client-side assignment, or filtering that disproportionately drops one variant.
  • Exposure ≠ assignment: you assign users, but you don’t reliably log when they could see the treatment. Then you “analyze” a mix of exposed and unexposed users and call it causal.
  • Join leakage: user identifiers differ across systems (anonymous_id vs user_id vs account_id). Your join drops mobile users, logged-out traffic, or anything behind ITP.
  • Duplicates and retries: mobile SDK retries, webhook retries, Kafka replays, or backfills inflate conversions in one cohort.
  • Late events: payments arrive days later, CRM writes lag, attribution windows get mangled, and your “final” result keeps shifting.
  • Metric definition drift: someone “fixes” revenue to exclude refunds midway through the test, but the experiment comparison mixes old/new logic.

Plain-English definition the first time we use it:

  • Technical debt is the accumulated cost of shortcuts (like ad-hoc experiment SQL) that turns into slower delivery and higher incident risk later.
  • An SLO (Service Level Objective) is a reliability target, e.g., “95% of experiment metrics updated by 9am UTC with <0.5% missing exposures.”

A pipeline shape that holds up: assignment → exposure → outcomes

The core design principle: experiments are a first-class dataset with three canonical facts.

  1. Assignment (who was supposed to get what)
  2. Exposure (who actually had a chance to experience it)
  3. Outcomes (the business events you measure)

If you blur these, you end up with “trust me” analytics.

Canonical tables (minimum viable)

  • experiment_assignments
    • keys: experiment_id, unit_id, assignment_id
    • columns: variant, assigned_at, bucketing_version, config_snapshot_hash
  • experiment_exposures
    • keys: experiment_id, unit_id, exposure_id
    • columns: variant, exposed_at, surface (where exposure happened)
  • experiment_outcomes
    • keys: unit_id, event_id
    • columns: event_type, event_at, value (e.g., revenue), dimensions

Business outcome: when these are clean, you can answer “Did the treatment change X?” without arguing about the plumbing.

One non-negotiable: deterministic unit identity

Pick the unit (unit_id) you’re randomizing on and document it:

  • Consumer app: usually user_id, with a fallback to device_id until login
  • B2B SaaS: often account_id (or workspace_id) if the feature impacts teams

If you need a hierarchy, define it explicitly (e.g., account_id > user_id > device_id) and be consistent across assignment and exposure. Otherwise SRM and leakage become a lifestyle.

Ingestion: idempotency beats “exactly-once” fairy tales

I’ve been doing this long enough to tell you: exactly-once delivery is a lie the moment you have mobile clients, retries, or backfills. Design for idempotent ingestion.

Practical pattern: event fingerprint + dedup window

Generate a deterministic fingerprint (hash) from stable fields and dedup within a window appropriate to the source.

  • For server-side assignment/exposure: dedup on assignment_id/exposure_id
  • For client events: fingerprint unit_id + event_type + event_at + properties (careful with timestamps—round if needed)

Example (BigQuery SQL) to build a deduped staging table:

create or replace table analytics_stg.experiment_exposures_dedup as
with ranked as (
  select
    *,
    to_hex(sha256(concat(
      cast(experiment_id as string), '|',
      cast(unit_id as string), '|',
      cast(variant as string), '|',
      cast(timestamp_trunc(exposed_at, second) as string), '|',
      coalesce(surface, '')
    ))) as exposure_fingerprint,
    row_number() over (
      partition by
        experiment_id, unit_id, variant,
        timestamp_trunc(exposed_at, second), surface
      order by ingested_at desc
    ) as rn
  from raw.experiment_exposures
)
select * except(rn)
from ranked
where rn = 1;

Late events: pick a watermark policy and stick to it

Define a watermark like:

  • “We consider outcomes final 7 days after exposure.”
  • “We recompute the last 14 days nightly to catch late payments/refunds.”

That’s a business decision (cash cycle, refund window) expressed as data policy. Without it, your experiment readouts become moving targets and nobody trusts them.

Measurable outcome: teams that implement dedup + watermarks typically cut “metric volatility” (day-to-day result swings) by 50–80% and reduce reruns/backfill incidents.

Transform layer: version your metrics or your results won’t reproduce

If your experiment analysis is “some SQL in a notebook” you’re one edit away from rewriting history.

Here’s what actually works:

  • Use dbt (or equivalent) to model exposures, attribution windows, and metric definitions
  • Version your metrics: metric_revenue_v1, metric_revenue_v2
  • Snapshot experiment configs (allocation, targeting rules, bucketing salt) so you can reproduce the cohort

A minimal dbt model test setup (Snowflake/BigQuery compatible) that prevents silent breakage:

# models/experiments/schema.yml
version: 2

models:
  - name: fct_experiment_exposures
    columns:
      - name: exposure_id
        tests: [unique, not_null]
      - name: experiment_id
        tests: [not_null]
      - name: unit_id
        tests: [not_null]
      - name: variant
        tests:
          - accepted_values:
              values: ['control', 'treatment']

  - name: mart_experiment_metrics_daily
    columns:
      - name: experiment_id
        tests: [not_null]
      - name: metric_name
        tests: [not_null]
      - name: as_of_date
        tests: [not_null]
      - name: exposed_units
        tests:
          - dbt_utils.expression_is_true:
              expression: "exposed_units > 0"

Guardrail: exposure-based attribution

When you compute outcomes, attribute from exposed_at, not assigned_at.

  • Avoids counting users who were assigned but never saw the feature
  • Makes ramp decisions safer (you’re measuring the change people experienced)

If leadership is debating “why do we need exposures?”, the answer is simple: it prevents shipping based on an effect that only exists because instrumentation lied.

Quality gates: catch bad experiments before they hit a decision meeting

Experiment pipelines need the same discipline as production services: observability (knowing what’s happening) and gates (blocking bad outputs).

1) SRM check (automated)

SRM is the canary. Run it daily per experiment and alert if it crosses threshold.

-- SRM check: chi-square statistic for 50/50 split
with counts as (
  select
    experiment_id,
    variant,
    count(distinct unit_id) as n
  from fct_experiment_exposures
  where experiment_id = @experiment_id
  group by 1,2
), pivoted as (
  select
    experiment_id,
    max(case when variant = 'control' then n end) as n_control,
    max(case when variant = 'treatment' then n end) as n_treatment
  from counts
  group by 1
)
select
  experiment_id,
  n_control,
  n_treatment,
  -- expected counts for 50/50
  ((n_control - (n_control+n_treatment)/2.0)^2)/((n_control+n_treatment)/2.0)
  +
  ((n_treatment - (n_control+n_treatment)/2.0)^2)/((n_control+n_treatment)/2.0)
  as chi_square
from pivoted;

Operationally:

  • Set a threshold like chi_square > 10.83 (roughly p<0.001 for df=1)
  • When it triggers, freeze the readout and investigate bucketing/instrumentation

2) Instrumentation health checks

Run reconciliation checks that answer:

  • Are exposures down >X% day-over-day?
  • Did the ratio of exposures to assignments change?
  • Did key outcome events drop to zero for a segment?

This is where tools like Great Expectations or dbt tests + alerting (PagerDuty/Slack) earn their keep.

3) “A/A tests” in the background

An A/A test assigns users to two identical variants. You should see no systematic lift. If you do, your pipeline or metric is biased.

Teams that run continuous A/A often catch:

  • device/platform join bias
  • bot traffic leaks
  • timezone truncation bugs
  • “helpful” dedup that deletes only one variant’s events

Measurable outcome: putting SRM + instrumentation checks in place typically reduces “invalid experiments” by 30–60% and cuts time wasted in analysis churn.

Shipping business value: a decision framework that doesn’t burn runway

Founders care about runway and credibility. Engineering leaders care about MTTR and not getting paged over dashboards. Here’s the decision framework I use.

Fix vs rebuild your experiment pipeline

Fix if:

  • Your assignment/exposure/outcome data exists but is messy
  • You can’t reproduce results, but the core system works
  • Most failures are data quality and modeling, not architecture

Rebuild if:

  • Assignment is client-side only and can’t be trusted
  • You can’t link exposure to outcome without fragile multi-hop joins
  • Schema drift is constant and nobody owns it
  • You’re blocked on basic questions like “who was exposed?”

A targeted remediation (often 2–6 weeks) can get you to:

  • Daily experiment metrics by 9am with a defined SLO
  • <0.5% missing exposures on key surfaces
  • Auto-flagged SRM and instrumentation regressions
  • Reproducible results after backfills and refunds

Where GitPlumbers fits (without the theater)

Most teams don’t need a 6-month “data platform initiative.” They need someone to read the crime scene.

  • Book a GitPlumbers code audit focused on your experiment pipeline (assignment service, event schemas, dbt models, Airflow/Dagster jobs). You get a prioritized risk list, quick wins, and a realistic fix-vs-rebuild plan.
  • If you want signal fast, run GitPlumbers Automated Insights on the repos that touch experiments (SDKs, ingestion, transformations). It surfaces structural risks, brittle joins, missing tests, and security gaps that often correlate with bad data.
  • If you’re short on senior bandwidth, assemble a fractional team for remediation (data engineering + analytics engineering + backend) matched to what the audit finds.

Practical next steps (what I’d do this week)

  1. Write down your canonical facts: assignment, exposure, outcomes. If you can’t point to the tables, you don’t have them.
  2. Implement SRM alerts for every live experiment. Treat SRM like a failing health check.
  3. Add idempotent dedup at ingestion (fingerprints + window) and document your late-event watermark.
  4. Move metric logic into versioned models (dbt is fine) and stop copying SQL into notebooks.
  5. Define one SLO for experiment freshness and completeness, and measure it.

If you do nothing else: fix exposure logging and SRM gates. That’s where the biggest trust gains come from, fast.

Related Resources

Key takeaways

  • Treat experiment data as a product: one canonical assignment + exposure model, versioned metrics, and hard quality gates.
  • Most “invalid tests” come from pipeline bugs: duplicates, late events, mismatched joins, and metric definition drift.
  • Add automated SRM + instrumentation health checks to stop bad experiments before decisions get made.
  • Use idempotent ingestion and deterministic keys so backfills don’t rewrite history.
  • Version metric definitions and log experiment configuration snapshots to keep results reproducible.
  • If your experiment system is held together by ad-hoc SQL, a targeted audit will save you from a rebuild later.

Implementation checklist

  • One deterministic `unit_id` per experiment (user_id/device_id/account_id) and a documented hierarchy
  • Assignment logged server-side with `assignment_id` and `variant`, plus config snapshot/version
  • Exposure logged at the moment the user could experience the change (not just page view)
  • Idempotent ingestion with event fingerprints and dedup windows
  • Late-event handling policy (watermarks) and backfill playbook
  • Metric layer with versioned definitions and experiment-safe attribution windows
  • Automated SRM checks + alerting, wired into the experiment dashboard
  • Daily reconciliation (raw events vs modeled exposures/conversions) with thresholds and owners

Questions we hear from teams

Do we really need exposure events if we already have assignments?
Yes, if you care about causal interpretation. Assignment says who *should* get the treatment; exposure says who *could have experienced* it. Without exposures, you’ll mis-measure impact when features aren’t rendered, clients cache, users bounce, or rollout flags behave differently across platforms.
How often should we recompute experiment metrics to handle late events?
Pick a watermark aligned to your business reality (payments/refunds/CRM lag). Common patterns: recompute the last 7–14 days nightly, and declare results final after a fixed window (e.g., 7 days post-exposure) for decision-making consistency.
What’s the fastest way to detect broken experiments in production?
Automated SRM checks + exposure volume anomaly detection. SRM catches bucketing/assignment issues; volume checks catch instrumentation drops. Wire both into alerts and block dashboards from showing “final” conclusions when gates fail.
We have ad-hoc SQL powering our experiment dashboard—what should we fix first?
Start with (1) canonical exposure modeling, (2) dedup/idempotency, and (3) SRM alerts. Then move metric definitions into versioned dbt models with tests so results are reproducible and changes are reviewed like code.

Ready to modernize your codebase?

Let GitPlumbers help you transform AI-generated chaos into clean, scalable applications.

Run Automated Insights on your experiment repos Book a code audit for your A/B testing pipeline

Related resources