Sequential A/B Testing in Production: Always-Valid Inference Playbook
Date: 2026-03-08
Category: knowledge
Domain: computation / statistics / experimentation systems
Why this matters
Most product teams peek at experiments before the planned end date.
If you run a fixed-horizon t-test every few hours and stop when p < 0.05, your false-positive rate can blow past 5%. In practice this means:
- shipping neutral/harmful changes too often,
- overestimating lift,
- learning loops polluted by noise.
Sequential methods let you keep fast iteration without sacrificing inference validity.
Core mental model
There are three common regimes:
Fixed horizon
- One planned final look.
- Valid only if you truly avoid peeking decisions.
Group-sequential (planned looks)
- Predefine K interim looks (e.g., 25/50/75/100% info).
- Use alpha spending / boundaries (O'Brien-Fleming, Pocock, etc.).
Always-valid / continuous monitoring
- Decision can happen at any time with valid error control.
- Implemented via mSPRT / e-values / always-valid p-values.
If your org checks dashboards continuously, design (3) is usually the operationally honest choice.
Practical architecture blueprint
1) Choose your stopping policy before launch
Every experiment config should include:
- primary metric,
- minimum detectable effect or practical significance threshold,
- max runtime (calendar cap),
- statistical engine (
fixed,group_seq,always_valid), - decision rule.
No ad-hoc switching midstream unless explicitly logged as protocol change.
2) Use alpha budget accounting as a platform primitive
For group-sequential:
- track cumulative information fraction,
- compute spent alpha at each look,
- expose remaining alpha in UI.
For always-valid:
- expose confidence sequence or always-valid p-value stream,
- show current stop/no-stop status and expected run-length diagnostics.
Treat "error budget" like SRE treats reliability budget.
3) Add variance reduction early (CUPED/CUPAC)
Sequential validity does not remove variance pain.
Use pre-experiment covariates:
[ Y'_i = Y_i - \theta (X_i - \bar X) ]
where:
- (Y_i): outcome,
- (X_i): pre-period covariate,
- (\theta): covariance-based coefficient.
Good covariates reduce required sample size and decision time significantly.
4) Separate "statistical significance" from "ship decision"
Production decisions should require both:
- statistical rule satisfied, and
- practical/guardrail rule satisfied.
Example ship gate:
- always-valid upper bound on error < 0.05,
- expected lift > +0.3%,
- p95 latency regression < +1%,
- no critical guardrail breach.
5) Enforce one primary endpoint and pre-registered hierarchy
Sequential flexibility + many metrics = multiplicity explosion.
Use:
- one primary metric for stop decision,
- pre-registered secondary metrics,
- fallback multiplicity correction (Holm/online FDR) when needed.
Decision framework: which method to use?
Use fixed horizon when
- experiments are short and cheap,
- team can operationally avoid peeking-based actions,
- low need for early stopping.
Use group-sequential when
- interim reviews are scheduled (e.g., weekly council),
- governance requires preplanned looks,
- auditability is high priority.
Use always-valid when
- dashboards are monitored continuously,
- teams need rapid stop/ship calls,
- platform scale supports many concurrent experiments.
Recommended defaults for product experimentation platforms
- Default engine: always-valid for primary binary/mean metrics.
- Calendar cap: 14 or 21 days (business-cycle coverage).
- Minimum sample floor before any stop (prevents ultra-early noise locks).
- Mandatory guardrail checks before auto-ship.
- CUPED on by default when pre-period signal quality is acceptable.
Common failure modes (and fixes)
Peeking with fixed-horizon tests
- Fix: disallow interim significance display, or move to always-valid.
Boundary confusion in group-sequential setups
- Fix: show "current critical threshold" directly in experiment UI.
Variance reduction leakage
- Fix: only use pre-treatment covariates; freeze covariate definition before start.
Metric shopping after interim reads
- Fix: registry + immutable primary endpoint + audit log.
Stopping on significance but ignoring harm guardrails
- Fix: dual-key decision contract (lift + safety).
Overtrigger from tiny but irrelevant effects
- Fix: practical significance threshold / decision-theoretic utility gate.
Rollout plan (minimal-regret)
Phase 1 — Shadow compute
- Keep existing fixed-horizon decisions.
- Compute sequential decisions in parallel.
- Compare disagreement patterns and false-stop simulations.
Phase 2 — Guarded adoption
- Enable sequential engine for low-risk surfaces.
- Require human approval for early stop decisions.
- Track run-length, stop reasons, and reversal rate.
Phase 3 — Platform default
- Make sequential engine default.
- Keep fixed-horizon as explicit opt-out only.
- Add policy linting in experiment config review.
Phase 4 — Continuous governance
- Quarterly calibration review:
- realized false discovery proxy,
- average decision latency,
- guardrail incident rate,
- variance-reduction effectiveness.
Minimal KPI dashboard
- Median days-to-decision
- Percent early stops
- False-positive proxy (holdout replay / AA tests)
- Effect-size shrinkage from interim to long-run
- Guardrail breach rate after ship
- CUPED variance reduction ratio
If you cannot observe these, you cannot govern sequential experimentation safely.
12-point implementation checklist
- Experiment config stores statistical engine explicitly
- Stop rules are pre-registered and immutable after launch
- Primary metric is unique and auditable
- Multiplicity policy exists for secondary metrics
- Sequential boundaries or always-valid logic is unit-tested
- AA tests validate empirical type-I error before rollout
- CUPED covariate pipeline is leakage-safe
- Practical significance threshold is enforced at decision time
- Guardrails are hard blockers, not warnings
- UI shows current decision state and rationale
- All interim views/actions are audit-logged
- Quarterly calibration review is operationalized
One-line takeaway
If your org peeks in real time, use methods that are valid in real time—sequential inference is not a stats luxury, it is experimentation hygiene.
References
- Johari, Pekelis, Walsh — Always Valid Inference: Bringing Sequential Analysis to A/B Testing (arXiv:1512.04922, 2015/2019)
https://arxiv.org/abs/1512.04922 - Johari, Koomen, Pekelis, Walsh — Always Valid Inference: Continuous Monitoring of A/B Tests (Operations Research, 2022)
https://pubsonline.informs.org/doi/10.1287/opre.2021.2135 - Deng, Xu, Kohavi, Walker — Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data (WSDM, 2013; CUPED)
https://exp-platform.com/Documents/2013-02-CUPED-ImprovingSensitivityOfControlledExperiments.pdf - Pocock — Group Sequential Methods in the Design and Analysis of Clinical Trials (Biometrika, 1977)
- O'Brien, Fleming — A Multiple Testing Procedure for Clinical Trials (Biometrics, 1979)
- Lan, DeMets — Discrete Sequential Boundaries for Clinical Trials (Biometrika, 1983)