Conservative Contextual Bandit Slippage Model for Execution Tactic Selection
Date: 2026-02-27
Category: research (quant execution / slippage modeling)
Why this playbook
Most production execution stacks still choose tactics via static rules:
- "if spread <= X, stay passive"
- "if residual > Y near deadline, cross"
- "if volatility spikes, reduce clip"
Those rules are interpretable, but they leave money on the table because microstructure context changes faster than hand-tuned thresholds.
A contextual bandit can adapt tactic choice online, but naive exploration is dangerous in live trading. This playbook focuses on safe learning: improve slippage while remaining close to a trusted baseline policy.
Problem framing
At each decision step (e.g., every 250msโ1s), select one action from a tactic set:
A1: passive join bestA2: passive improve by one tickA3: midpoint/pegged passiveA4: small taker sliceA5: aggressive sweep chunk
Given context vector (x_t), choose (a_t) and observe delayed execution outcome.
Define cost-aware reward (higher is better):
[ r_t = -\text{IS}{t,\Delta} - \lambda{u},\text{UnderfillPenalty}{t,\Delta} - \lambda{r},\text{RejectPenalty}_{t,\Delta} ]
Where:
- (\text{IS}_{t,\Delta}): implementation shortfall over horizon (\Delta) (e.g., 5s/30s bucket),
- underfill penalty protects completion risk,
- reject penalty captures venue/API friction.
Bandit objective: maximize cumulative reward (minimize slippage + completion damage).
Context design (production features)
Minimum context block per decision:
- Microstructure
- spread (ticks/bps), top-of-book depth, queue imbalance, microprice drift.
- Toxicity / flow
- short-horizon OFI, cancel burst, short-term markout proxy.
- Execution state
- residual qty, residual time, realized participation vs target POV.
- Regime flags
- open/close/auction-adjacent, news window, volatility state.
- Infra/venue
- recent reject rate, ack latency bucket, throttling pressure.
Keep features strictly available at decision time (no look-ahead leakage).
Safety-first policy design
Let (\pi_b) be the current baseline (your existing deterministic/probabilistic tactic policy). Deploy (\pi) only if it respects conservative constraints relative to (\pi_b).
Practical safety contract:
[ \sum_{s=1}^{t} \mathbb{E}[r_s(\pi)] \ge (1-\alpha)\sum_{s=1}^{t} \mathbb{E}[r_s(\pi_b)] - B_t ]
- (\alpha): tolerated shortfall vs baseline (small in production),
- (B_t): finite exploration budget (drawdown allowance).
Implementation patterns that work
- Baseline fallback on low support (SPIBB-style idea)
- If
(context, action)support is weak, copy baseline action probability.
- If
- Conservative exploration gating
- Explore only when confidence interval of candidate action clears baseline by margin.
- Traffic ladder rollout
- 1% โ 5% โ 10% โ 25% with rollback triggers on p95/p99 cost and underfill.
Offline gate before online learning
Never start from pure online exploration. First run offline policy evaluation (OPE) on logged bandit data.
Use a robust estimator stack:
- IPS / SNIPS (propensity-weighted sanity check),
- Doubly Robust (DR) for lower variance + model correction,
- stratified OPE by regime (open, calm day, stress day, close).
A minimal DR form:
[ \hat{V}{DR} = \frac{1}{n}\sum{i=1}^{n}\left[\hat{q}(x_i,\pi(x_i)) + \frac{\pi(a_i|x_i)}{\hat{\mu}(a_i|x_i)}\left(r_i-\hat{q}(x_i,a_i)\right)\right] ]
Where:
- (\hat{\mu}): behavior policy propensity model,
- (\hat{q}): reward model.
Promotion rule example: require positive DR lift with one-sided confidence and no regime-level tail regressions.
Online controller architecture
Step loop
- Build context (x_t).
- Score candidate actions with uncertainty-aware reward model.
- Apply conservative gate (vs baseline action value).
- Sample/choose action under safety constraint.
- Execute child order + log full decision tuple.
- Update model incrementally (or mini-batch).
Decision tuple to log
- context snapshot hash + full feature vector,
- chosen action and full action probability vector,
- baseline recommended action,
- pre-trade predicted reward/uncertainty,
- realized reward decomposition (IS / underfill / rejects),
- venue and latency diagnostics.
Without propensity + baseline logs, OPE and incident forensics are crippled.
Guardrails and kill switches
Hard guardrails (non-negotiable):
- max aggression by regime,
- max participation and clip size,
- minimum completion trajectory,
- reject-rate and latency breaker,
- venue quarantine on persistent failures.
Auto-rollback triggers (example):
- p95 IS degradation > threshold for N consecutive windows,
- underfill rate breach at deadline buckets,
- confidence collapse (uncertainty spike / support cliff),
- policy drift too far from baseline action distribution.
Validation scorecard
Run daily and weekly.
- Cost outcomes
- p50/p95/p99 IS vs baseline.
- Completion quality
- underfill %, catch-up aggression incidents.
- Safety metrics
- cumulative reward gap vs baseline budget,
- rollback count and trigger type.
- Learning quality
- calibration of predicted reward intervals,
- propensity overlap diagnostics,
- per-regime lift stability.
Common failure modes
- Reward misspecification: optimizing short-horizon markout while silently increasing deadline underfills.
- Propensity bias: missing/incorrect logging policy probabilities ruins OPE.
- Unsafe cold start: exploration too broad before support is accumulated.
- Regime mixing: one model overfits calm states and explodes during stress windows.
- Metric gaming: better average bps but worse tails; always evaluate p95/p99 and completion jointly.
Rollout plan
Phase 1 โ Shadow replay
- Train bandit offline from logged baseline data.
- Publish daily OPE lift + uncertainty by regime.
Phase 2 โ Assisted mode
- Bandit proposes; baseline executes.
- Compare counterfactual recommendations, verify stability.
Phase 3 โ Conservative canary
- Live at 1โ5% traffic with strict rollback.
- Baseline fallback on low support contexts.
Phase 4 โ Scale with governance
- Increase traffic gradually once p95 + completion pass.
- Weekly champion/challenger review and monthly stress replay.
Practical takeaway
For live execution, contextual bandits are useful only when wrapped in conservative policy governance.
Treat baseline as a safety anchor, not as legacy baggage. Done right, you get adaptive tactic selection with bounded downside instead of "smart" exploration that pays tail-risk tuition in production.
Pointers for deeper reading
- Li et al. โ A Contextual-Bandit Approach to Personalized News Article Recommendation (WWW 2010).
- Dudรญk, Langford, Li โ Doubly Robust Policy Evaluation and Optimization (Statistical Science 2014).
- Laroche, Trichelair, Tachet des Combes โ Safe Policy Improvement with Baseline Bootstrapping (SPIBB) (ICML 2019 / PMLR 97).
- Wu et al. โ Conservative Contextual Linear Bandits (NeurIPS 2016).
- Kiyohara et al. โ Conservative Contextual Bandits: Beyond Linear Representations (arXiv 2024).
- Auer, Cesa-Bianchi, Fischer โ Finite-time Analysis of the Multiarmed Bandit Problem (Machine Learning 2002).