Reproducible Backtests: Determinism, Data Lineage, and Experiment Audit Playbook

Date: 2026-03-04
Category: systems
Purpose: A practical operating guide to make backtest results repeatable, auditable, and trustworthy in production quant workflows.

Why this matters

A strategy that looks great once but cannot be reproduced is not alpha—it is an accident.

Most “mysterious” backtest drift comes from a few operational issues:

Data changed (vendor corrections, restatements, symbol mapping updates)
Code changed (dependency bumps, default parameter changes)
Runtime changed (random seeds, thread scheduling, hardware differences)
Accounting changed (fees, slippage, borrow, FX handling)

If your team cannot answer "which exact inputs generated this equity curve?", your research stack is fragile.

The reproducibility contract (non-negotiables)

Every backtest run should produce and store:

Code identity
- Git commit SHA
- Dirty/clean working tree state
- Strategy module + parameter set
Data identity
- Dataset version ID (snapshot hash)
- Universe definition version
- Corporate actions / adjustments version
Environment identity
- OS, CPU architecture, Python/Rust/Node versions
- Dependency lockfile hash
- Container image digest (if used)
Execution identity
- Random seed policy
- Thread/concurrency config
- Start/end timestamps and wall-clock duration
Accounting identity
- Fee/slippage model version
- Borrow/funding assumptions
- FX conversion policy

No run artifact without these fields should be considered “research grade.”

Determinism first: where randomness sneaks in

1) Pseudorandomness

Set explicit seeds for all RNG sources (numpy, torch, custom simulators)
Log seed values per run
Avoid implicit global RNG usage in helper utilities

Pattern:

Derive per-run master seed
Derive sub-seeds for modules (signal, execution, bootstrap) via deterministic splitter

2) Parallelism and ordering

Common pitfall:

Parallel map/reduce where floating-point accumulation order changes by scheduling

Mitigations:

Use stable reduction order (sorted keys, deterministic chunk merge)
For critical metrics, prefer deterministic aggregators over speed-first paths
Pin thread count in reproducibility mode

3) Floating-point behavior

Small FP differences can snowball through path-dependent execution logic.

Mitigations:

Use fixed precision for money and shares where possible (decimal or scaled integers)
Define rounding rules explicitly (banker’s rounding vs floor/ceil)
Keep consistent BLAS/math library versions for model-heavy pipelines

4) Clock/timezone semantics

Pitfalls:

Mixed naive/aware timestamps
DST handling inconsistencies
Session boundary drift

Mitigations:

Normalize internal event time to UTC
Convert to venue-local time only for session logic/reporting
Version and test the trading calendar package

Data lineage: treat datasets like code releases

Snapshot, don’t “latest”

Never backtest from mutable “latest” tables when making decisions.

Use immutable snapshots with IDs:

market_data_snapshot_id
fundamentals_snapshot_id
corp_actions_snapshot_id
symbol_master_snapshot_id

Point-in-time integrity

Enforce:

No lookahead in fundamentals/events
Delisting-aware universe history
Split/dividend adjustments applied with effective dates

Vendor corrections policy

When data vendor backfills/corrects history:

Create a new snapshot version
Run drift report vs previous snapshot
Record impacted strategies and periods

Minimal run manifest (copy/paste schema)

{
  "run_id": "bt_2026-03-04_9f1a2c",
  "strategy": "mean_reversion_v3",
  "params_hash": "sha256:...",
  "code": {
    "git_sha": "abc123...",
    "dirty": false
  },
  "data": {
    "market_snapshot": "md_2026w09_v2",
    "corp_actions_snapshot": "ca_2026w09_v1",
    "universe_snapshot": "uni_2026w09_v3"
  },
  "environment": {
    "container_digest": "sha256:...",
    "lockfile_hash": "sha256:...",
    "cpu_arch": "arm64"
  },
  "execution": {
    "master_seed": 4201337,
    "threads": 1,
    "deterministic_mode": true
  },
  "cost_model": {
    "fees_model": "fees_v4",
    "slippage_model": "slip_v12",
    "borrow_model": "borrow_v2"
  }
}

Store this manifest next to all output artifacts (PNL, trades, diagnostics).

Reproducibility operating modes

Use two explicit modes:

Fast exploration mode
- Parallelized, approximate, cheap
- Used for ideation only
Audit mode (decision mode)
- Deterministic settings enforced
- Snapshot-pinned data + locked environment
- Required before promotion / allocation decisions

Do not mix the two when reporting investment-grade numbers.

Drift triage when results don’t match

When rerun output differs, follow this order:

Manifest diff (code/data/env/cost model)
Trade count diff (first divergence timestamp)
Signal diff (feature values and thresholds)
Execution diff (fill assumptions, latency/slippage settings)
Accounting diff (fees/borrow/FX)

A “first divergence” tool (binary search on timeline) pays for itself quickly.

Promotion gate checklist (before live or paper promotion)

Two independent reruns produce identical key metrics in audit mode
Run manifest complete and archived
Data snapshots immutable and restorable
Transaction-cost model versions explicitly pinned
Calendar/timezone tests pass
Parameter file and strategy code are commit-pinned
Notebook-to-production parity confirmed (no hidden notebook state)
Reviewer can reproduce results in clean environment

If any box is unchecked, strategy status remains research-only.

Practical architecture pattern

A robust setup usually has:

Data Registry (snapshot IDs + metadata)
Experiment Registry (manifests + metrics + artifacts)
Backtest Runner (deterministic/audit mode toggle)
Diff Service (run-to-run comparison)
Promotion Policy (hard gate from research to deployment)

This can be implemented incrementally—start with manifest + snapshot pinning, then add automated diffs and policy gates.

Common anti-patterns

“It ran from my notebook, trust me.”
Mutable CSV/parquet overwrites without versioning
Hidden default parameters in library upgrades
Mixed timezone assumptions across data sources
Backtests that cannot regenerate the exact trade list

Each anti-pattern increases false confidence more than it increases speed.

Rule of thumb

If you cannot recreate the same trades from the same manifest six months later, you do not have a backtest—you have a story.