Counterfactual Execution Replay + Off-Policy Slippage Evaluation Playbook
Date: 2026-02-25
Category: research (quant execution)
TL;DR
Before promoting a new execution policy, run a counterfactual replay lab instead of trusting average backtest slippage.
Log decision-time context + action propensities, then estimate how a candidate policy would have performed using off-policy evaluation (IPS/SNIPS/DR/FQE). Gate rollout with tail metrics (q95/q99, CVaR), not just mean bps.
1) Why this matters in real trading
Execution policies usually fail in production for one of three reasons:
- They were tuned on price-only replay without realistic fill dynamics.
- They looked good on mean slippage but worsened tail events.
- They leaked alpha during stressed liquidity regimes.
A replay + OPE pipeline helps answer a practical question:
“If we had used the new policy last week under the same market states, what would slippage distribution and breach risk look like?”
2) Core setup: behavior policy vs candidate policy
Let:
- (\pi_b(a\mid x)): behavior policy (what live system actually did)
- (\pi_e(a\mid x)): candidate evaluation policy
- (x_t): market + inventory + schedule context at decision time
- (a_t): chosen action (passive/aggressive level, slice size, price offset)
- (r_t): reward (negative execution cost) or directly slippage cost
Goal: estimate candidate value
[ V(\pi_e)=\mathbb{E}{\pi_e}\left[\sum{t=1}^{T} \gamma^{t-1} r_t\right] ]
from logs generated under (\pi_b), without going live.
3) Data contract (what must be logged)
At every child-order decision event, log:
- timestamps (exchange + local receive + local send)
- top-of-book + depth slices + imbalance + spread state
- queue/latency features (estimated queue ahead, ack lag, cancel bursts)
- parent context (remaining qty, urgency, time-to-close, budget burn)
- action chosen (a_t)
- behavior propensity (\pi_b(a_t\mid x_t)) (or score convertible to calibrated probability)
- realized outcomes: fill ratio, fill delay, post-fill markout, final benchmark slippage
If propensity is missing, OPE quality collapses.
4) Slippage objective definition first (non-negotiable)
Pick one primary benchmark and keep it stable:
- Arrival shortfall [ \text{IS}_t = \text{side}\cdot(\bar p_t - p^{arrival}_t) ]
- Decision-to-fill shortfall
- Schedule-relative shortfall (VWAP/POV target)
Then decompose cost:
[ C_t = C_t^{spread} + C_t^{impact} + C_t^{delay} + C_t^{adverse\ markout} ]
The candidate should improve either total (C_t) or tail of (C_t) without violating risk limits.
5) OPE estimators for execution
5.1 IPS (Inverse Propensity Scoring)
[ \hat V_{IPS} = \frac{1}{N}\sum_{i=1}^{N} w_i R_i, \quad w_i = \prod_t \frac{\pi_e(a_{i,t}\mid x_{i,t})}{\pi_b(a_{i,t}\mid x_{i,t})} ]
Pros: unbiased (if overlap + correct propensities).
Cons: huge variance for long horizons / tiny (\pi_b).
5.2 SNIPS (Self-Normalized IPS)
[ \hat V_{SNIPS} = \frac{\sum_i w_i R_i}{\sum_i w_i} ]
Lower variance, small bias. Usually better operationally.
5.3 Doubly Robust (DR)
[ \hat V_{DR}=\frac{1}{N}\sum_i\left[\hat Q(x_i,\pi_e) + w_i\big(R_i-\hat Q(x_i,a_i)\big)\right] ]
Blends direct model + importance correction; robust if one side is decent.
5.4 FQE (Fitted Q Evaluation)
Learn (Q^{\pi_e}(x,a)) via temporal-difference style regression on logged data, then evaluate policy value from initial states. Helpful when action space is richer than binary passive/aggressive.
6) Overlap diagnostics (ship-stop criterion)
Before trusting any estimate, check:
- effective sample size (ESS)
- weight concentration (top-k weight share)
- action support coverage by regime (open/close, low/high vol, spread buckets)
Useful thresholds (example):
- ESS / N < 0.1 => “low-confidence” label
- max weight > 50x median => clipping or reject run
- uncovered regime share > 5% notional => no promotion
Weight clipping example:
[ \tilde w_i = \min(w_i, c) ]
Report sensitivity across clipping caps (c\in{5,10,20,50}).
7) Tail-first evaluation metrics
Do not optimize only (\mathbb{E}[C]). Track:
- mean / median slippage (bps)
- q90 / q95 / q99 slippage
- CVaR(_{95}) (expected slippage beyond 95th percentile)
- breach rate vs symbol-level budget
- underfill ratio + forced catch-up aggression frequency
Candidate promotion should require:
- non-inferior mean slippage,
- improved q95 or CVaR(_{95}), and
- no increase in hard-risk breaches.
8) Counterfactual replay architecture (Vellab-friendly)
8.1 Components
event-log-writer: append-only decision/fill streamfeature-reconstructor: deterministic feature rebuild from raw eventspolicy-simulator: runs candidate (\pi_e) on historical statesope-engine: IPS/SNIPS/DR/FQE + uncertainty bandstail-risk-gate: rollout go/no-go rules
8.2 Determinism requirements
- fixed seeds for stochastic policies
- versioned feature schema + model artifact hash
- strict event-time ordering (exchange timestamp priority; local time secondary)
- replay checksums for reproducibility
9) KIS/KRX practical notes
For KIS live integration, add strict guardrails:
- session-open and closing-auction windows evaluated separately
- VI(Volatility Interruption) and quote-halting intervals isolated from normal regime OPE
- API throttle/latency stress periods tagged in logs to avoid mixing operational artifacts with policy quality
KRX-specific microstructure events can dominate slippage tails; treat them as first-class regimes, not outliers to drop.
10) Rollout ladder
- Offline replay + OPE pass (multi-week, multiple regimes)
- Paper/live-shadow mode (decision logging, no actual routing)
- 5% notional canary with hard kill-switch
- 20% notional with hourly drift checks
- Full rollout only after stable tail metrics
Kill-switch triggers (example):
- q95 slippage deterioration > +15% vs baseline for 2 consecutive windows
- CVaR(_{95}) breach over budget
- fill-failure cluster above threshold in high-urgency buckets
11) Common failure modes
- Propensity miscalibration: policy logs scores but not calibrated probabilities.
- State leakage: replay accidentally uses future info in feature reconstruction.
- Benchmark drift: comparing arrival shortfall this week vs VWAP shortfall last week.
- Mean-only reporting: candidate “wins” by average while blowing up tails.
- No regime slicing: gains in calm periods hide losses at open/close.
12) Minimal implementation checklist
- Decision log includes (x_t, a_t, \pi_b(a_t\mid x_t), r_t)
- Feature schema versioned and reproducible
- OPE engine supports IPS/SNIPS/DR with clipping analysis
- Tail metrics (q95/q99/CVaR) in default report
- Regime-sliced evaluation (vol, spread, session phase, VI)
- Promotion gate encoded as machine-checkable rule
References / Further Reading
- Moallemi & Yuan, A Model for Queue Position Valuation in a Limit Order Book (queue value / execution context)
- Bacry et al., Queue-Reactive Model (LOB dynamics simulation)
- Jusselin et al., Propagators: Transient vs History Dependent Impact (impact modeling)
- Dudík et al., Doubly Robust Policy Evaluation (OPE foundations)
Use this playbook to move execution research from “promising chart” to “risk-aware deployability.”