Counterfactual Execution Replay + Off-Policy Slippage Evaluation Playbook

Date: 2026-02-25
Category: research (quant execution)

TL;DR

Before promoting a new execution policy, run a counterfactual replay lab instead of trusting average backtest slippage.

Log decision-time context + action propensities, then estimate how a candidate policy would have performed using off-policy evaluation (IPS/SNIPS/DR/FQE). Gate rollout with tail metrics (q95/q99, CVaR), not just mean bps.

1) Why this matters in real trading

Execution policies usually fail in production for one of three reasons:

They were tuned on price-only replay without realistic fill dynamics.
They looked good on mean slippage but worsened tail events.
They leaked alpha during stressed liquidity regimes.

A replay + OPE pipeline helps answer a practical question:

“If we had used the new policy last week under the same market states, what would slippage distribution and breach risk look like?”

2) Core setup: behavior policy vs candidate policy

Let:

(\pi_b(a\mid x)): behavior policy (what live system actually did)
(\pi_e(a\mid x)): candidate evaluation policy
(x_t): market + inventory + schedule context at decision time
(a_t): chosen action (passive/aggressive level, slice size, price offset)
(r_t): reward (negative execution cost) or directly slippage cost

Goal: estimate candidate value

[ V(\pi_e)=\mathbb{E}{\pi_e}\left[\sum{t=1}^{T} \gamma^{t-1} r_t\right] ]

from logs generated under (\pi_b), without going live.

3) Data contract (what must be logged)

At every child-order decision event, log:

timestamps (exchange + local receive + local send)
top-of-book + depth slices + imbalance + spread state
queue/latency features (estimated queue ahead, ack lag, cancel bursts)
parent context (remaining qty, urgency, time-to-close, budget burn)
action chosen (a_t)
behavior propensity (\pi_b(a_t\mid x_t)) (or score convertible to calibrated probability)
realized outcomes: fill ratio, fill delay, post-fill markout, final benchmark slippage

If propensity is missing, OPE quality collapses.

4) Slippage objective definition first (non-negotiable)

Pick one primary benchmark and keep it stable:

Arrival shortfall [ \text{IS}_t = \text{side}\cdot(\bar p_t - p^{arrival}_t) ]
Decision-to-fill shortfall
Schedule-relative shortfall (VWAP/POV target)

Then decompose cost:

[ C_t = C_t^{spread} + C_t^{impact} + C_t^{delay} + C_t^{adverse\ markout} ]

The candidate should improve either total (C_t) or tail of (C_t) without violating risk limits.

5) OPE estimators for execution

5.1 IPS (Inverse Propensity Scoring)

[ \hat V_{IPS} = \frac{1}{N}\sum_{i=1}^{N} w_i R_i, \quad w_i = \prod_t \frac{\pi_e(a_{i,t}\mid x_{i,t})}{\pi_b(a_{i,t}\mid x_{i,t})} ]

Pros: unbiased (if overlap + correct propensities).
Cons: huge variance for long horizons / tiny (\pi_b).

5.2 SNIPS (Self-Normalized IPS)

[ \hat V_{SNIPS} = \frac{\sum_i w_i R_i}{\sum_i w_i} ]

Lower variance, small bias. Usually better operationally.

5.3 Doubly Robust (DR)

[ \hat V_{DR}=\frac{1}{N}\sum_i\left[\hat Q(x_i,\pi_e) + w_i\big(R_i-\hat Q(x_i,a_i)\big)\right] ]

Blends direct model + importance correction; robust if one side is decent.

5.4 FQE (Fitted Q Evaluation)

Learn (Q^{\pi_e}(x,a)) via temporal-difference style regression on logged data, then evaluate policy value from initial states. Helpful when action space is richer than binary passive/aggressive.

6) Overlap diagnostics (ship-stop criterion)

Before trusting any estimate, check:

effective sample size (ESS)
weight concentration (top-k weight share)
action support coverage by regime (open/close, low/high vol, spread buckets)

Useful thresholds (example):

ESS / N < 0.1 => “low-confidence” label
max weight > 50x median => clipping or reject run
uncovered regime share > 5% notional => no promotion

Weight clipping example:

[ \tilde w_i = \min(w_i, c) ]

Report sensitivity across clipping caps (c\in{5,10,20,50}).

7) Tail-first evaluation metrics

Do not optimize only (\mathbb{E}[C]). Track:

mean / median slippage (bps)
q90 / q95 / q99 slippage
CVaR(_{95}) (expected slippage beyond 95th percentile)
breach rate vs symbol-level budget
underfill ratio + forced catch-up aggression frequency

Candidate promotion should require:

non-inferior mean slippage,
improved q95 or CVaR(_{95}), and
no increase in hard-risk breaches.

8) Counterfactual replay architecture (Vellab-friendly)

8.1 Components

event-log-writer: append-only decision/fill stream
feature-reconstructor: deterministic feature rebuild from raw events
policy-simulator: runs candidate (\pi_e) on historical states
ope-engine: IPS/SNIPS/DR/FQE + uncertainty bands
tail-risk-gate: rollout go/no-go rules

8.2 Determinism requirements

fixed seeds for stochastic policies
versioned feature schema + model artifact hash
strict event-time ordering (exchange timestamp priority; local time secondary)
replay checksums for reproducibility

9) KIS/KRX practical notes

For KIS live integration, add strict guardrails:

session-open and closing-auction windows evaluated separately
VI(Volatility Interruption) and quote-halting intervals isolated from normal regime OPE
API throttle/latency stress periods tagged in logs to avoid mixing operational artifacts with policy quality

KRX-specific microstructure events can dominate slippage tails; treat them as first-class regimes, not outliers to drop.

10) Rollout ladder

Offline replay + OPE pass (multi-week, multiple regimes)
Paper/live-shadow mode (decision logging, no actual routing)
5% notional canary with hard kill-switch
20% notional with hourly drift checks
Full rollout only after stable tail metrics

Kill-switch triggers (example):

q95 slippage deterioration > +15% vs baseline for 2 consecutive windows
CVaR(_{95}) breach over budget
fill-failure cluster above threshold in high-urgency buckets

11) Common failure modes

Propensity miscalibration: policy logs scores but not calibrated probabilities.
State leakage: replay accidentally uses future info in feature reconstruction.
Benchmark drift: comparing arrival shortfall this week vs VWAP shortfall last week.
Mean-only reporting: candidate “wins” by average while blowing up tails.
No regime slicing: gains in calm periods hide losses at open/close.

12) Minimal implementation checklist

Decision log includes (x_t, a_t, \pi_b(a_t\mid x_t), r_t)
Feature schema versioned and reproducible
OPE engine supports IPS/SNIPS/DR with clipping analysis
Tail metrics (q95/q99/CVaR) in default report
Regime-sliced evaluation (vol, spread, session phase, VI)
Promotion gate encoded as machine-checkable rule

References / Further Reading

Moallemi & Yuan, A Model for Queue Position Valuation in a Limit Order Book (queue value / execution context)
Bacry et al., Queue-Reactive Model (LOB dynamics simulation)
Jusselin et al., Propagators: Transient vs History Dependent Impact (impact modeling)
Dudík et al., Doubly Robust Policy Evaluation (OPE foundations)

Use this playbook to move execution research from “promising chart” to “risk-aware deployability.”