Sequential A/B Testing in Production: Always-Valid Inference Playbook

2026-03-08 · computation

Sequential A/B Testing in Production: Always-Valid Inference Playbook

Date: 2026-03-08
Category: knowledge
Domain: computation / statistics / experimentation systems

Why this matters

Most product teams peek at experiments before the planned end date.

If you run a fixed-horizon t-test every few hours and stop when p < 0.05, your false-positive rate can blow past 5%. In practice this means:

Sequential methods let you keep fast iteration without sacrificing inference validity.


Core mental model

There are three common regimes:

  1. Fixed horizon

    • One planned final look.
    • Valid only if you truly avoid peeking decisions.
  2. Group-sequential (planned looks)

    • Predefine K interim looks (e.g., 25/50/75/100% info).
    • Use alpha spending / boundaries (O'Brien-Fleming, Pocock, etc.).
  3. Always-valid / continuous monitoring

    • Decision can happen at any time with valid error control.
    • Implemented via mSPRT / e-values / always-valid p-values.

If your org checks dashboards continuously, design (3) is usually the operationally honest choice.


Practical architecture blueprint

1) Choose your stopping policy before launch

Every experiment config should include:

No ad-hoc switching midstream unless explicitly logged as protocol change.

2) Use alpha budget accounting as a platform primitive

For group-sequential:

For always-valid:

Treat "error budget" like SRE treats reliability budget.

3) Add variance reduction early (CUPED/CUPAC)

Sequential validity does not remove variance pain.

Use pre-experiment covariates:

[ Y'_i = Y_i - \theta (X_i - \bar X) ]

where:

Good covariates reduce required sample size and decision time significantly.

4) Separate "statistical significance" from "ship decision"

Production decisions should require both:

Example ship gate:

5) Enforce one primary endpoint and pre-registered hierarchy

Sequential flexibility + many metrics = multiplicity explosion.

Use:


Decision framework: which method to use?

Use fixed horizon when

Use group-sequential when

Use always-valid when


Recommended defaults for product experimentation platforms


Common failure modes (and fixes)

  1. Peeking with fixed-horizon tests

    • Fix: disallow interim significance display, or move to always-valid.
  2. Boundary confusion in group-sequential setups

    • Fix: show "current critical threshold" directly in experiment UI.
  3. Variance reduction leakage

    • Fix: only use pre-treatment covariates; freeze covariate definition before start.
  4. Metric shopping after interim reads

    • Fix: registry + immutable primary endpoint + audit log.
  5. Stopping on significance but ignoring harm guardrails

    • Fix: dual-key decision contract (lift + safety).
  6. Overtrigger from tiny but irrelevant effects

    • Fix: practical significance threshold / decision-theoretic utility gate.

Rollout plan (minimal-regret)

Phase 1 — Shadow compute

Phase 2 — Guarded adoption

Phase 3 — Platform default

Phase 4 — Continuous governance


Minimal KPI dashboard

If you cannot observe these, you cannot govern sequential experimentation safely.


12-point implementation checklist


One-line takeaway

If your org peeks in real time, use methods that are valid in real time—sequential inference is not a stats luxury, it is experimentation hygiene.


References