Counterfactual Execution Replay + Off-Policy Slippage Evaluation Playbook

2026-02-25 · finance

Counterfactual Execution Replay + Off-Policy Slippage Evaluation Playbook

Date: 2026-02-25
Category: research (quant execution)

TL;DR

Before promoting a new execution policy, run a counterfactual replay lab instead of trusting average backtest slippage.

Log decision-time context + action propensities, then estimate how a candidate policy would have performed using off-policy evaluation (IPS/SNIPS/DR/FQE). Gate rollout with tail metrics (q95/q99, CVaR), not just mean bps.


1) Why this matters in real trading

Execution policies usually fail in production for one of three reasons:

  1. They were tuned on price-only replay without realistic fill dynamics.
  2. They looked good on mean slippage but worsened tail events.
  3. They leaked alpha during stressed liquidity regimes.

A replay + OPE pipeline helps answer a practical question:

“If we had used the new policy last week under the same market states, what would slippage distribution and breach risk look like?”


2) Core setup: behavior policy vs candidate policy

Let:

Goal: estimate candidate value

[ V(\pi_e)=\mathbb{E}{\pi_e}\left[\sum{t=1}^{T} \gamma^{t-1} r_t\right] ]

from logs generated under (\pi_b), without going live.


3) Data contract (what must be logged)

At every child-order decision event, log:

If propensity is missing, OPE quality collapses.


4) Slippage objective definition first (non-negotiable)

Pick one primary benchmark and keep it stable:

Then decompose cost:

[ C_t = C_t^{spread} + C_t^{impact} + C_t^{delay} + C_t^{adverse\ markout} ]

The candidate should improve either total (C_t) or tail of (C_t) without violating risk limits.


5) OPE estimators for execution

5.1 IPS (Inverse Propensity Scoring)

[ \hat V_{IPS} = \frac{1}{N}\sum_{i=1}^{N} w_i R_i, \quad w_i = \prod_t \frac{\pi_e(a_{i,t}\mid x_{i,t})}{\pi_b(a_{i,t}\mid x_{i,t})} ]

Pros: unbiased (if overlap + correct propensities).
Cons: huge variance for long horizons / tiny (\pi_b).

5.2 SNIPS (Self-Normalized IPS)

[ \hat V_{SNIPS} = \frac{\sum_i w_i R_i}{\sum_i w_i} ]

Lower variance, small bias. Usually better operationally.

5.3 Doubly Robust (DR)

[ \hat V_{DR}=\frac{1}{N}\sum_i\left[\hat Q(x_i,\pi_e) + w_i\big(R_i-\hat Q(x_i,a_i)\big)\right] ]

Blends direct model + importance correction; robust if one side is decent.

5.4 FQE (Fitted Q Evaluation)

Learn (Q^{\pi_e}(x,a)) via temporal-difference style regression on logged data, then evaluate policy value from initial states. Helpful when action space is richer than binary passive/aggressive.


6) Overlap diagnostics (ship-stop criterion)

Before trusting any estimate, check:

Useful thresholds (example):

Weight clipping example:

[ \tilde w_i = \min(w_i, c) ]

Report sensitivity across clipping caps (c\in{5,10,20,50}).


7) Tail-first evaluation metrics

Do not optimize only (\mathbb{E}[C]). Track:

Candidate promotion should require:

  1. non-inferior mean slippage,
  2. improved q95 or CVaR(_{95}), and
  3. no increase in hard-risk breaches.

8) Counterfactual replay architecture (Vellab-friendly)

8.1 Components

8.2 Determinism requirements


9) KIS/KRX practical notes

For KIS live integration, add strict guardrails:

KRX-specific microstructure events can dominate slippage tails; treat them as first-class regimes, not outliers to drop.


10) Rollout ladder

  1. Offline replay + OPE pass (multi-week, multiple regimes)
  2. Paper/live-shadow mode (decision logging, no actual routing)
  3. 5% notional canary with hard kill-switch
  4. 20% notional with hourly drift checks
  5. Full rollout only after stable tail metrics

Kill-switch triggers (example):


11) Common failure modes


12) Minimal implementation checklist


References / Further Reading

Use this playbook to move execution research from “promising chart” to “risk-aware deployability.”