Reproducible Backtests: Determinism, Data Lineage, and Experiment Audit Playbook

2026-03-04 · systems

Reproducible Backtests: Determinism, Data Lineage, and Experiment Audit Playbook

Date: 2026-03-04
Category: systems
Purpose: A practical operating guide to make backtest results repeatable, auditable, and trustworthy in production quant workflows.


Why this matters

A strategy that looks great once but cannot be reproduced is not alpha—it is an accident.

Most “mysterious” backtest drift comes from a few operational issues:

If your team cannot answer "which exact inputs generated this equity curve?", your research stack is fragile.


The reproducibility contract (non-negotiables)

Every backtest run should produce and store:

  1. Code identity
    • Git commit SHA
    • Dirty/clean working tree state
    • Strategy module + parameter set
  2. Data identity
    • Dataset version ID (snapshot hash)
    • Universe definition version
    • Corporate actions / adjustments version
  3. Environment identity
    • OS, CPU architecture, Python/Rust/Node versions
    • Dependency lockfile hash
    • Container image digest (if used)
  4. Execution identity
    • Random seed policy
    • Thread/concurrency config
    • Start/end timestamps and wall-clock duration
  5. Accounting identity
    • Fee/slippage model version
    • Borrow/funding assumptions
    • FX conversion policy

No run artifact without these fields should be considered “research grade.”


Determinism first: where randomness sneaks in

1) Pseudorandomness

Pattern:

2) Parallelism and ordering

Common pitfall:

Mitigations:

3) Floating-point behavior

Small FP differences can snowball through path-dependent execution logic.

Mitigations:

4) Clock/timezone semantics

Pitfalls:

Mitigations:


Data lineage: treat datasets like code releases

Snapshot, don’t “latest”

Never backtest from mutable “latest” tables when making decisions.

Use immutable snapshots with IDs:

Point-in-time integrity

Enforce:

Vendor corrections policy

When data vendor backfills/corrects history:


Minimal run manifest (copy/paste schema)

{
  "run_id": "bt_2026-03-04_9f1a2c",
  "strategy": "mean_reversion_v3",
  "params_hash": "sha256:...",
  "code": {
    "git_sha": "abc123...",
    "dirty": false
  },
  "data": {
    "market_snapshot": "md_2026w09_v2",
    "corp_actions_snapshot": "ca_2026w09_v1",
    "universe_snapshot": "uni_2026w09_v3"
  },
  "environment": {
    "container_digest": "sha256:...",
    "lockfile_hash": "sha256:...",
    "cpu_arch": "arm64"
  },
  "execution": {
    "master_seed": 4201337,
    "threads": 1,
    "deterministic_mode": true
  },
  "cost_model": {
    "fees_model": "fees_v4",
    "slippage_model": "slip_v12",
    "borrow_model": "borrow_v2"
  }
}

Store this manifest next to all output artifacts (PNL, trades, diagnostics).


Reproducibility operating modes

Use two explicit modes:

  1. Fast exploration mode
    • Parallelized, approximate, cheap
    • Used for ideation only
  2. Audit mode (decision mode)
    • Deterministic settings enforced
    • Snapshot-pinned data + locked environment
    • Required before promotion / allocation decisions

Do not mix the two when reporting investment-grade numbers.


Drift triage when results don’t match

When rerun output differs, follow this order:

  1. Manifest diff (code/data/env/cost model)
  2. Trade count diff (first divergence timestamp)
  3. Signal diff (feature values and thresholds)
  4. Execution diff (fill assumptions, latency/slippage settings)
  5. Accounting diff (fees/borrow/FX)

A “first divergence” tool (binary search on timeline) pays for itself quickly.


Promotion gate checklist (before live or paper promotion)

If any box is unchecked, strategy status remains research-only.


Practical architecture pattern

A robust setup usually has:

This can be implemented incrementally—start with manifest + snapshot pinning, then add automated diffs and policy gates.


Common anti-patterns

  1. “It ran from my notebook, trust me.”
  2. Mutable CSV/parquet overwrites without versioning
  3. Hidden default parameters in library upgrades
  4. Mixed timezone assumptions across data sources
  5. Backtests that cannot regenerate the exact trade list

Each anti-pattern increases false confidence more than it increases speed.


Rule of thumb

If you cannot recreate the same trades from the same manifest six months later, you do not have a backtest—you have a story.