Execution Simulator Fidelity Ladder Validation Playbook

Date: 2026-03-15
Category: knowledge
Audience: small quant teams deploying live execution logic with limited incident budget

Why this playbook exists

Most teams either:

trust backtests too much (fills are unrealistically easy), or
overbuild simulation too early (high complexity, low decision value).

The practical answer is a fidelity ladder: use the simplest simulator that can falsify the decision you are currently making, then climb only when error signals justify it.

Core principle

Use simulation as a risk filter, not a PnL oracle.

For each strategy/tactic candidate, ask:

Does it survive in plausible microstructure stress?
Does it stay within p95 slippage / completion / reject budgets?
Does it fail in ways we can operate?

If not, do not promote.

Fidelity ladder (L0 -> L4)

L0 — Deterministic Cost Skeleton

What it is:

static spread + fee + simple impact curve
no queue dynamics
no event-time uncertainty

Good for:

rough ranking of high-level schedule families (TWAP/POV/IS variants)
sanity-checking parameter scales

Do NOT use for:

maker/taker policy tuning
timeout/cancel/retry decisions

L1 — Replay with Causal Latency Injection

What it is:

historical event replay
measured latency distributions injected into decision -> send -> ack path
deterministic policy replay under realistic timing noise

Good for:

validating deadline/urgency logic
checking cancel/replace path stability under jitter

Promotion gate to L2:

repeated policy rank instability across jitter seeds
p95 degradation concentrated in queue/latency interactions

L2 — Queue-Aware MBP Simulator

What it is:

market-by-price (top N levels) reconstruction
probabilistic queue-position inference
cancel/refill hazard models

Good for:

passive vs aggressive switching
queue-reset tax estimation (cancel/replace cost)
fill probability realism for child-order timing

Known limitation:

latent order IDs are missing; queue rank remains estimated

L3 — Agent-Based Regime Simulator

What it is:

simulated LP/noise/informed flow agents
endogenous spread/imbalance/cancel bursts
explicit regime packs (calm/fragile/panic)

Good for:

stress testing policy brittleness
evaluating control-state transitions and hysteresis
understanding worst-case branch behavior (not point forecasts)

Anti-footgun:

do not trust absolute bps output
trust relative robustness ordering and failure signatures

L4 — Hybrid Digital Twin (Replay + Synthetic Shocks)

What it is:

real replay backbone + injected counterfactual shocks:
- quote fade bursts
- reject storms
- ack-delay pockets
- venue outage windows

Good for:

go/no-go for production rollout
incident-runbook rehearsal
verifying kill-switch and SAFE-mode behavior

Minimum metrics per ladder run

Track at least:

IS_bps_q50/q90/q95
completion_ratio
deadline_miss_rate
reject_retry_rate
cancel_replace_churn
panic_cross_share (late aggressive completion proportion)

For stress runs also track:

max_drawdown_of_edge (execution-only edge erosion)
time_in_defensive_state
recovery_time_after_shock

Promotion policy (practical)

A candidate moves upward only if all are true:

Tail control: p95 slippage within budget in current ladder tier
Completion safety: completion ratio above floor
Operational stability: no runaway retry/cancel loops
Robustness: ranking remains acceptable across seed/regime perturbations

A candidate moves to limited live canary only if:

L3/L4 stress behavior is explainable
rollback conditions are pre-defined
kill-switch path is dry-run validated

Data contract checklist

Without this, simulator fidelity claims are mostly theater:

event-time normalized logs (decision/send/ack/fill/cancel)
venue and tactic identifiers preserved end-to-end
point-in-time reference prices (no hindsight corrections)
explicit partial-fill and no-fill outcome records
deterministic run manifest (code hash, config hash, data snapshot hash)

Common failure modes

Single-regime overfitting
- Fix: force calm/fragile/panic scenario suite in every promotion cycle.
Mean-only validation
- Fix: gate on q95/CVaR-style tails, not just average bps.
Ignoring no-fill branches
- Fix: include opportunity cost and deadline penalties explicitly.
Unreproducible simulator runs
- Fix: seed control + immutable manifests.
Policy complexity outpacing observability
- Fix: keep action space small until diagnostics are reliable.

30-day rollout template

Week 1:

baseline L1 replay with latency injection
define hard KPI budgets and rollback thresholds

Week 2:

add L2 queue-aware features for key symbols/venues
calibrate queue/cancel hazards

Week 3:

run L3 stress packs and classify failure signatures
tune hysteresis and defensive-state transitions

Week 4:

L4 hybrid shock rehearsals
canary candidate selection + incident playbook lock-in

Bottom line

Execution simulation should evolve like safety engineering:

start simple,
add realism where errors concentrate,
gate promotions by tail-risk behavior,
keep everything reproducible.

If your simulator cannot predict every fill, that is fine. If it cannot expose brittle behavior before live capital does, it is not doing its job.

References (starting points)

ABIDES: Agent-Based Interactive Discrete Event Simulation (market simulation framework)
https://github.com/abides-sim/abides
LOBSTER data overview (event-level order book research data)
https://lobsterdata.com/info/home
Cartea, Jaimungal, Penalva — Algorithmic and High-Frequency Trading (microstructure & execution foundations)
https://www.cambridge.org/core/books/algorithmic-and-highfrequency-trading/37A2CBB8F3F8E5A4F6B4B53E5EAED7E5
Kissell — The Science of Algorithmic Trading and Portfolio Management (execution/TCA practice)
https://www.elsevier.com/books/the-science-of-algorithmic-trading-and-portfolio-management/kissell/978-0-12-401689-7