Support-Aware Safe Policy Improvement for Slippage Playbook
Date: 2026-03-03
Category: research (quant execution / slippage modeling)
Why this playbook exists
Most execution teams already run counterfactual replay + OPE before rollout. Good start, but still not enough.
In production, many policy failures happen when a candidate policy is:
- evaluated on weak action support (candidate asks for actions rarely seen in logs),
- promoted using mean slippage wins while tail risk worsens,
- deployed globally even though confidence exists only in a few regimes.
This playbook focuses on a stricter goal:
Only improve where data support is strong, and default to baseline elsewhere.
The result is slower but much safer slippage improvement.
1) Formal objective: improve over baseline with high confidence
Let:
- (\pi_b): baseline behavior policy currently in production,
- (\pi_c): candidate policy,
- (C): slippage cost in bps (lower is better),
- (\Delta = C_{\pi_b} - C_{\pi_c}): improvement (positive means candidate is better).
Require a lower confidence bound (LCB) above zero:
[ \mathrm{LCB}_{1-\alpha}(\Delta) > 0 ]
But apply this per regime (symbol × time bucket × volatility/spread/liquidity state), not only globally.
2) Build a support map before any “improvement” claim
For each decision context (x), compute candidate-to-baseline propensity ratio:
[ r(x,a)=\frac{\pi_c(a\mid x)}{\pi_b(a\mid x)} ]
Then track support diagnostics:
Effective sample size (ESS) [ \mathrm{ESS}=\frac{(\sum_i w_i)^2}{\sum_i w_i^2},\quad w_i=\prod_t r_{i,t} ]
Coverage gap [ \mathrm{Gap}=\Pr_{x\sim D}[\pi_c(\cdot\mid x)\ \text{places mass on actions with sparse/no baseline support}] ]
Weight concentration (top-1%, top-5% weight share).
Practical go/no-go thresholds (example):
- ESS/N < 0.15 ⇒ no promotion in that regime,
- top-1% weights > 25% total weight ⇒ unstable estimate,
- support gap > 3% notional ⇒ baseline lock.
3) Conservative policy mixing (baseline bootstrapping)
Instead of directly deploying (\pi_c), deploy a mixed policy:
[ \pi_{mix}(a\mid x)=(1-\lambda(x))\pi_b(a\mid x)+\lambda(x)\pi_c(a\mid x) ]
where (\lambda(x)\in[0,1]) is confidence-driven:
- high support + stable LCB ⇒ (\lambda(x)\uparrow),
- weak support / drift / unstable tails ⇒ (\lambda(x)\downarrow) (fallback to baseline).
This is the practical execution interpretation of safe policy improvement with baseline bootstrapping: know what you know, copy baseline when you don’t.
4) Slippage metric stack: optimize a vector, not one number
For each regime and globally, evaluate:
- mean implementation shortfall,
- median slippage,
- q90/q95/q99,
- CVaR(_{95}),
- completion miss rate,
- forced catch-up aggression rate,
- post-fill adverse markout (1s/5s/30s).
Promotion rule example:
- (\mathrm{LCB}(\Delta_{mean})>0)
- (\mathrm{LCB}(\Delta_{q95})>0) or (\mathrm{LCB}(\Delta_{CVaR95})>0)
- no increase in completion misses beyond hard budget.
If (1) passes but (2) fails, reject. Mean-only wins are a common production trap.
5) DR-first estimation with explicit uncertainty penalties
For each regime, estimate candidate value using a doubly robust estimator:
[ \hat V_{DR}=\frac{1}{N}\sum_i\left[\hat Q(x_i,\pi_c)+w_i\big(R_i-\hat Q(x_i,a_i)\big)\right] ]
Then penalize low-support regimes:
[ \hat V_{safe}=\hat V_{DR}-\kappa\cdot \mathrm{SE}(\hat V_{DR})-\eta\cdot \mathrm{SupportPenalty} ]
Deploy only if (\hat V_{safe}>V_{baseline}).
Operationally, this avoids “paper alpha” caused by under-covered action regions.
6) Regime-gated rollout blueprint
Phase 0: Shadow
- Log decisions for (\pi_c) and (\pi_{mix}), execute (\pi_b) only.
- Produce daily support map + LCB dashboard.
Phase 1: Support-positive buckets only
- Allow small (\lambda) (e.g., 0.1–0.2) only in high-confidence regimes.
- Keep baseline lock in unsupported buckets.
Phase 2: Notional expansion by confidence
- Expand traffic only where last M windows keep positive LCB and stable tails.
- Cap per-symbol per-regime notional.
Phase 3: Dynamic fallback
Auto-revert (\lambda\to 0) when any trigger fires:
- q95 degradation > threshold,
- ESS collapse,
- drift alarm on fill-hazard/markout residuals,
- completion SLA breach.
7) Data contract (minimum required fields)
Per child-order decision event:
- context: symbol, side, remaining qty, urgency, time-to-close,
- microstructure: spread, depth ladder, imbalance, queue-ahead estimate,
- latency: quote age, ack lag, cancel/replace lag,
- policy info: chosen action, (\pi_b(a\mid x)), (\pi_c(a\mid x)), (\lambda(x)),
- outcomes: fill ratio, fill delay, realized IS/slippage, markouts,
- regime tags: open/close/news/volatility/auction/VI flags.
No propensity logging = no trustworthy safe-improvement claim.
8) Failure modes and practical fixes
Global confidence hides local risk
Fix: enforce per-regime LCB gates.Candidate over-extrapolates beyond support
Fix: baseline bootstrapping + action masking in low-count cells.Estimator disagreement ignored (IPS/SNIPS/DR/FQE)
Fix: require directional agreement or treat as uncertainty spike.Tail blindness under mean improvement pressure
Fix: promotion blocked unless tail metrics also improve.Regime drift after rollout
Fix: online drift monitor + automatic (\lambda) decay.
9) 30-day implementation plan
Week 1 — Support observability
- Add propensity + candidate-score logging to execution decisions.
- Build ESS/coverage/weight dashboards by regime.
Week 2 — Safe estimator layer
- Implement DR + bootstrap confidence intervals.
- Add support-penalty adjusted score (safe value).
Week 3 — Controlled mixed policy
- Implement (\pi_{mix}) with configurable (\lambda(x)).
- Activate shadow and paper decision tracking.
Week 4 — Canary + guardrails
- Enable low-(\lambda) live canary in support-positive regimes only.
- Wire automatic fallback triggers and publish postmortem-ready logs.
References
- Dudík, M., Langford, J., & Li, L. (2011). Doubly Robust Policy Evaluation and Learning. ICML / arXiv. https://arxiv.org/abs/1103.4601
- Swaminathan, A., & Joachims, T. (2015). Doubly Robust Policy Evaluation and Optimization. https://arxiv.org/abs/1503.02834
- Thomas, P. S., & Brunskill, E. (2016). Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning. https://arxiv.org/abs/1604.00923
- Laroche, R., Trichelair, P., & Tachet des Combes, R. (2019). Safe Policy Improvement with Baseline Bootstrapping. https://arxiv.org/abs/1712.06924
- Saito, Y., Aihara, S., Matsutani, M., & Narita, Y. (2021). Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation. https://arxiv.org/abs/2008.07146
- Cont, R., & Kukanov, A. (2014). Optimal order placement in limit order markets. https://arxiv.org/abs/1210.1625
TL;DR
A candidate execution policy is not “better” because average backtest slippage is lower.
It is better only if it improves with confidence where support exists, keeps tails under control, and safely falls back to baseline where uncertainty is high.
That is how you ship slippage improvements without shipping hidden blow-up risk.