Reproducible Backtests: Determinism, Data Lineage, and Experiment Audit Playbook
Date: 2026-03-04
Category: systems
Purpose: A practical operating guide to make backtest results repeatable, auditable, and trustworthy in production quant workflows.
Why this matters
A strategy that looks great once but cannot be reproduced is not alpha—it is an accident.
Most “mysterious” backtest drift comes from a few operational issues:
- Data changed (vendor corrections, restatements, symbol mapping updates)
- Code changed (dependency bumps, default parameter changes)
- Runtime changed (random seeds, thread scheduling, hardware differences)
- Accounting changed (fees, slippage, borrow, FX handling)
If your team cannot answer "which exact inputs generated this equity curve?", your research stack is fragile.
The reproducibility contract (non-negotiables)
Every backtest run should produce and store:
- Code identity
- Git commit SHA
- Dirty/clean working tree state
- Strategy module + parameter set
- Data identity
- Dataset version ID (snapshot hash)
- Universe definition version
- Corporate actions / adjustments version
- Environment identity
- OS, CPU architecture, Python/Rust/Node versions
- Dependency lockfile hash
- Container image digest (if used)
- Execution identity
- Random seed policy
- Thread/concurrency config
- Start/end timestamps and wall-clock duration
- Accounting identity
- Fee/slippage model version
- Borrow/funding assumptions
- FX conversion policy
No run artifact without these fields should be considered “research grade.”
Determinism first: where randomness sneaks in
1) Pseudorandomness
- Set explicit seeds for all RNG sources (numpy, torch, custom simulators)
- Log seed values per run
- Avoid implicit global RNG usage in helper utilities
Pattern:
- Derive per-run master seed
- Derive sub-seeds for modules (
signal,execution,bootstrap) via deterministic splitter
2) Parallelism and ordering
Common pitfall:
- Parallel map/reduce where floating-point accumulation order changes by scheduling
Mitigations:
- Use stable reduction order (sorted keys, deterministic chunk merge)
- For critical metrics, prefer deterministic aggregators over speed-first paths
- Pin thread count in reproducibility mode
3) Floating-point behavior
Small FP differences can snowball through path-dependent execution logic.
Mitigations:
- Use fixed precision for money and shares where possible (decimal or scaled integers)
- Define rounding rules explicitly (banker’s rounding vs floor/ceil)
- Keep consistent BLAS/math library versions for model-heavy pipelines
4) Clock/timezone semantics
Pitfalls:
- Mixed naive/aware timestamps
- DST handling inconsistencies
- Session boundary drift
Mitigations:
- Normalize internal event time to UTC
- Convert to venue-local time only for session logic/reporting
- Version and test the trading calendar package
Data lineage: treat datasets like code releases
Snapshot, don’t “latest”
Never backtest from mutable “latest” tables when making decisions.
Use immutable snapshots with IDs:
market_data_snapshot_idfundamentals_snapshot_idcorp_actions_snapshot_idsymbol_master_snapshot_id
Point-in-time integrity
Enforce:
- No lookahead in fundamentals/events
- Delisting-aware universe history
- Split/dividend adjustments applied with effective dates
Vendor corrections policy
When data vendor backfills/corrects history:
- Create a new snapshot version
- Run drift report vs previous snapshot
- Record impacted strategies and periods
Minimal run manifest (copy/paste schema)
{
"run_id": "bt_2026-03-04_9f1a2c",
"strategy": "mean_reversion_v3",
"params_hash": "sha256:...",
"code": {
"git_sha": "abc123...",
"dirty": false
},
"data": {
"market_snapshot": "md_2026w09_v2",
"corp_actions_snapshot": "ca_2026w09_v1",
"universe_snapshot": "uni_2026w09_v3"
},
"environment": {
"container_digest": "sha256:...",
"lockfile_hash": "sha256:...",
"cpu_arch": "arm64"
},
"execution": {
"master_seed": 4201337,
"threads": 1,
"deterministic_mode": true
},
"cost_model": {
"fees_model": "fees_v4",
"slippage_model": "slip_v12",
"borrow_model": "borrow_v2"
}
}
Store this manifest next to all output artifacts (PNL, trades, diagnostics).
Reproducibility operating modes
Use two explicit modes:
- Fast exploration mode
- Parallelized, approximate, cheap
- Used for ideation only
- Audit mode (decision mode)
- Deterministic settings enforced
- Snapshot-pinned data + locked environment
- Required before promotion / allocation decisions
Do not mix the two when reporting investment-grade numbers.
Drift triage when results don’t match
When rerun output differs, follow this order:
- Manifest diff (code/data/env/cost model)
- Trade count diff (first divergence timestamp)
- Signal diff (feature values and thresholds)
- Execution diff (fill assumptions, latency/slippage settings)
- Accounting diff (fees/borrow/FX)
A “first divergence” tool (binary search on timeline) pays for itself quickly.
Promotion gate checklist (before live or paper promotion)
- Two independent reruns produce identical key metrics in audit mode
- Run manifest complete and archived
- Data snapshots immutable and restorable
- Transaction-cost model versions explicitly pinned
- Calendar/timezone tests pass
- Parameter file and strategy code are commit-pinned
- Notebook-to-production parity confirmed (no hidden notebook state)
- Reviewer can reproduce results in clean environment
If any box is unchecked, strategy status remains research-only.
Practical architecture pattern
A robust setup usually has:
- Data Registry (snapshot IDs + metadata)
- Experiment Registry (manifests + metrics + artifacts)
- Backtest Runner (deterministic/audit mode toggle)
- Diff Service (run-to-run comparison)
- Promotion Policy (hard gate from research to deployment)
This can be implemented incrementally—start with manifest + snapshot pinning, then add automated diffs and policy gates.
Common anti-patterns
- “It ran from my notebook, trust me.”
- Mutable CSV/parquet overwrites without versioning
- Hidden default parameters in library upgrades
- Mixed timezone assumptions across data sources
- Backtests that cannot regenerate the exact trade list
Each anti-pattern increases false confidence more than it increases speed.
Rule of thumb
If you cannot recreate the same trades from the same manifest six months later, you do not have a backtest—you have a story.