Conservative Contextual Bandit Slippage Model for Execution Tactic Selection

Date: 2026-02-27
Category: research (quant execution / slippage modeling)

Why this playbook

Most production execution stacks still choose tactics via static rules:

"if spread <= X, stay passive"
"if residual > Y near deadline, cross"
"if volatility spikes, reduce clip"

Those rules are interpretable, but they leave money on the table because microstructure context changes faster than hand-tuned thresholds.

A contextual bandit can adapt tactic choice online, but naive exploration is dangerous in live trading. This playbook focuses on safe learning: improve slippage while remaining close to a trusted baseline policy.

Problem framing

At each decision step (e.g., every 250ms–1s), select one action from a tactic set:

A1: passive join best
A2: passive improve by one tick
A3: midpoint/pegged passive
A4: small taker slice
A5: aggressive sweep chunk

Given context vector (x_t), choose (a_t) and observe delayed execution outcome.

Define cost-aware reward (higher is better):

[ r_t = -\text{IS}{t,\Delta} - \lambda{u},\text{UnderfillPenalty}{t,\Delta} - \lambda{r},\text{RejectPenalty}_{t,\Delta} ]

Where:

(\text{IS}_{t,\Delta}): implementation shortfall over horizon (\Delta) (e.g., 5s/30s bucket),
underfill penalty protects completion risk,
reject penalty captures venue/API friction.

Bandit objective: maximize cumulative reward (minimize slippage + completion damage).

Context design (production features)

Minimum context block per decision:

Microstructure
- spread (ticks/bps), top-of-book depth, queue imbalance, microprice drift.
Toxicity / flow
- short-horizon OFI, cancel burst, short-term markout proxy.
Execution state
- residual qty, residual time, realized participation vs target POV.
Regime flags
- open/close/auction-adjacent, news window, volatility state.
Infra/venue
- recent reject rate, ack latency bucket, throttling pressure.

Keep features strictly available at decision time (no look-ahead leakage).

Safety-first policy design

Let (\pi_b) be the current baseline (your existing deterministic/probabilistic tactic policy). Deploy (\pi) only if it respects conservative constraints relative to (\pi_b).

Practical safety contract:

[ \sum_{s=1}^{t} \mathbb{E}[r_s(\pi)] \ge (1-\alpha)\sum_{s=1}^{t} \mathbb{E}[r_s(\pi_b)] - B_t ]

(\alpha): tolerated shortfall vs baseline (small in production),
(B_t): finite exploration budget (drawdown allowance).

Implementation patterns that work

Baseline fallback on low support (SPIBB-style idea)
- If (context, action) support is weak, copy baseline action probability.
Conservative exploration gating
- Explore only when confidence interval of candidate action clears baseline by margin.
Traffic ladder rollout
- 1% → 5% → 10% → 25% with rollback triggers on p95/p99 cost and underfill.

Offline gate before online learning

Never start from pure online exploration. First run offline policy evaluation (OPE) on logged bandit data.

Use a robust estimator stack:

IPS / SNIPS (propensity-weighted sanity check),
Doubly Robust (DR) for lower variance + model correction,
stratified OPE by regime (open, calm day, stress day, close).

A minimal DR form:

[ \hat{V}{DR} = \frac{1}{n}\sum{i=1}^{n}\left[\hat{q}(x_i,\pi(x_i)) + \frac{\pi(a_i|x_i)}{\hat{\mu}(a_i|x_i)}\left(r_i-\hat{q}(x_i,a_i)\right)\right] ]

Where:

(\hat{\mu}): behavior policy propensity model,
(\hat{q}): reward model.

Promotion rule example: require positive DR lift with one-sided confidence and no regime-level tail regressions.

Online controller architecture

Step loop

Build context (x_t).
Score candidate actions with uncertainty-aware reward model.
Apply conservative gate (vs baseline action value).
Sample/choose action under safety constraint.
Execute child order + log full decision tuple.
Update model incrementally (or mini-batch).

Decision tuple to log

context snapshot hash + full feature vector,
chosen action and full action probability vector,
baseline recommended action,
pre-trade predicted reward/uncertainty,
realized reward decomposition (IS / underfill / rejects),
venue and latency diagnostics.

Without propensity + baseline logs, OPE and incident forensics are crippled.

Guardrails and kill switches

Hard guardrails (non-negotiable):

max aggression by regime,
max participation and clip size,
minimum completion trajectory,
reject-rate and latency breaker,
venue quarantine on persistent failures.

Auto-rollback triggers (example):

p95 IS degradation > threshold for N consecutive windows,
underfill rate breach at deadline buckets,
confidence collapse (uncertainty spike / support cliff),
policy drift too far from baseline action distribution.

Validation scorecard

Run daily and weekly.

Cost outcomes
- p50/p95/p99 IS vs baseline.
Completion quality
- underfill %, catch-up aggression incidents.
Safety metrics
- cumulative reward gap vs baseline budget,
- rollback count and trigger type.
Learning quality
- calibration of predicted reward intervals,
- propensity overlap diagnostics,
- per-regime lift stability.

Common failure modes

Reward misspecification: optimizing short-horizon markout while silently increasing deadline underfills.
Propensity bias: missing/incorrect logging policy probabilities ruins OPE.
Unsafe cold start: exploration too broad before support is accumulated.
Regime mixing: one model overfits calm states and explodes during stress windows.
Metric gaming: better average bps but worse tails; always evaluate p95/p99 and completion jointly.

Rollout plan

Phase 1 — Shadow replay

Train bandit offline from logged baseline data.
Publish daily OPE lift + uncertainty by regime.

Phase 2 — Assisted mode

Bandit proposes; baseline executes.
Compare counterfactual recommendations, verify stability.

Phase 3 — Conservative canary

Live at 1–5% traffic with strict rollback.
Baseline fallback on low support contexts.

Phase 4 — Scale with governance

Increase traffic gradually once p95 + completion pass.
Weekly champion/challenger review and monthly stress replay.

Practical takeaway

For live execution, contextual bandits are useful only when wrapped in conservative policy governance.

Treat baseline as a safety anchor, not as legacy baggage. Done right, you get adaptive tactic selection with bounded downside instead of "smart" exploration that pays tail-risk tuition in production.

Pointers for deeper reading

Li et al. — A Contextual-Bandit Approach to Personalized News Article Recommendation (WWW 2010).
Dudík, Langford, Li — Doubly Robust Policy Evaluation and Optimization (Statistical Science 2014).
Laroche, Trichelair, Tachet des Combes — Safe Policy Improvement with Baseline Bootstrapping (SPIBB) (ICML 2019 / PMLR 97).
Wu et al. — Conservative Contextual Linear Bandits (NeurIPS 2016).
Kiyohara et al. — Conservative Contextual Bandits: Beyond Linear Representations (arXiv 2024).
Auer, Cesa-Bianchi, Fischer — Finite-time Analysis of the Multiarmed Bandit Problem (Machine Learning 2002).