Execution Simulator Fidelity Ladder Validation Playbook
Date: 2026-03-15
Category: knowledge
Audience: small quant teams deploying live execution logic with limited incident budget
Why this playbook exists
Most teams either:
- trust backtests too much (fills are unrealistically easy), or
- overbuild simulation too early (high complexity, low decision value).
The practical answer is a fidelity ladder: use the simplest simulator that can falsify the decision you are currently making, then climb only when error signals justify it.
Core principle
Use simulation as a risk filter, not a PnL oracle.
For each strategy/tactic candidate, ask:
- Does it survive in plausible microstructure stress?
- Does it stay within p95 slippage / completion / reject budgets?
- Does it fail in ways we can operate?
If not, do not promote.
Fidelity ladder (L0 -> L4)
L0 โ Deterministic Cost Skeleton
What it is:
- static spread + fee + simple impact curve
- no queue dynamics
- no event-time uncertainty
Good for:
- rough ranking of high-level schedule families (TWAP/POV/IS variants)
- sanity-checking parameter scales
Do NOT use for:
- maker/taker policy tuning
- timeout/cancel/retry decisions
L1 โ Replay with Causal Latency Injection
What it is:
- historical event replay
- measured latency distributions injected into decision -> send -> ack path
- deterministic policy replay under realistic timing noise
Good for:
- validating deadline/urgency logic
- checking cancel/replace path stability under jitter
Promotion gate to L2:
- repeated policy rank instability across jitter seeds
- p95 degradation concentrated in queue/latency interactions
L2 โ Queue-Aware MBP Simulator
What it is:
- market-by-price (top N levels) reconstruction
- probabilistic queue-position inference
- cancel/refill hazard models
Good for:
- passive vs aggressive switching
- queue-reset tax estimation (cancel/replace cost)
- fill probability realism for child-order timing
Known limitation:
- latent order IDs are missing; queue rank remains estimated
L3 โ Agent-Based Regime Simulator
What it is:
- simulated LP/noise/informed flow agents
- endogenous spread/imbalance/cancel bursts
- explicit regime packs (calm/fragile/panic)
Good for:
- stress testing policy brittleness
- evaluating control-state transitions and hysteresis
- understanding worst-case branch behavior (not point forecasts)
Anti-footgun:
- do not trust absolute bps output
- trust relative robustness ordering and failure signatures
L4 โ Hybrid Digital Twin (Replay + Synthetic Shocks)
What it is:
- real replay backbone + injected counterfactual shocks:
- quote fade bursts
- reject storms
- ack-delay pockets
- venue outage windows
Good for:
- go/no-go for production rollout
- incident-runbook rehearsal
- verifying kill-switch and SAFE-mode behavior
Minimum metrics per ladder run
Track at least:
IS_bps_q50/q90/q95completion_ratiodeadline_miss_ratereject_retry_ratecancel_replace_churnpanic_cross_share(late aggressive completion proportion)
For stress runs also track:
max_drawdown_of_edge(execution-only edge erosion)time_in_defensive_staterecovery_time_after_shock
Promotion policy (practical)
A candidate moves upward only if all are true:
- Tail control: p95 slippage within budget in current ladder tier
- Completion safety: completion ratio above floor
- Operational stability: no runaway retry/cancel loops
- Robustness: ranking remains acceptable across seed/regime perturbations
A candidate moves to limited live canary only if:
- L3/L4 stress behavior is explainable
- rollback conditions are pre-defined
- kill-switch path is dry-run validated
Data contract checklist
Without this, simulator fidelity claims are mostly theater:
- event-time normalized logs (decision/send/ack/fill/cancel)
- venue and tactic identifiers preserved end-to-end
- point-in-time reference prices (no hindsight corrections)
- explicit partial-fill and no-fill outcome records
- deterministic run manifest (code hash, config hash, data snapshot hash)
Common failure modes
Single-regime overfitting
- Fix: force calm/fragile/panic scenario suite in every promotion cycle.
Mean-only validation
- Fix: gate on q95/CVaR-style tails, not just average bps.
Ignoring no-fill branches
- Fix: include opportunity cost and deadline penalties explicitly.
Unreproducible simulator runs
- Fix: seed control + immutable manifests.
Policy complexity outpacing observability
- Fix: keep action space small until diagnostics are reliable.
30-day rollout template
Week 1:
- baseline L1 replay with latency injection
- define hard KPI budgets and rollback thresholds
Week 2:
- add L2 queue-aware features for key symbols/venues
- calibrate queue/cancel hazards
Week 3:
- run L3 stress packs and classify failure signatures
- tune hysteresis and defensive-state transitions
Week 4:
- L4 hybrid shock rehearsals
- canary candidate selection + incident playbook lock-in
Bottom line
Execution simulation should evolve like safety engineering:
- start simple,
- add realism where errors concentrate,
- gate promotions by tail-risk behavior,
- keep everything reproducible.
If your simulator cannot predict every fill, that is fine. If it cannot expose brittle behavior before live capital does, it is not doing its job.
References (starting points)
- ABIDES: Agent-Based Interactive Discrete Event Simulation (market simulation framework)
https://github.com/abides-sim/abides - LOBSTER data overview (event-level order book research data)
https://lobsterdata.com/info/home - Cartea, Jaimungal, Penalva โ Algorithmic and High-Frequency Trading (microstructure & execution foundations)
https://www.cambridge.org/core/books/algorithmic-and-highfrequency-trading/37A2CBB8F3F8E5A4F6B4B53E5EAED7E5 - Kissell โ The Science of Algorithmic Trading and Portfolio Management (execution/TCA practice)
https://www.elsevier.com/books/the-science-of-algorithmic-trading-and-portfolio-management/kissell/978-0-12-401689-7