Conservative Contextual Bandit Slippage Model for Execution Tactic Selection

2026-02-27 ยท finance

Conservative Contextual Bandit Slippage Model for Execution Tactic Selection

Date: 2026-02-27
Category: research (quant execution / slippage modeling)

Why this playbook

Most production execution stacks still choose tactics via static rules:

Those rules are interpretable, but they leave money on the table because microstructure context changes faster than hand-tuned thresholds.

A contextual bandit can adapt tactic choice online, but naive exploration is dangerous in live trading. This playbook focuses on safe learning: improve slippage while remaining close to a trusted baseline policy.


Problem framing

At each decision step (e.g., every 250msโ€“1s), select one action from a tactic set:

Given context vector (x_t), choose (a_t) and observe delayed execution outcome.

Define cost-aware reward (higher is better):

[ r_t = -\text{IS}{t,\Delta} - \lambda{u},\text{UnderfillPenalty}{t,\Delta} - \lambda{r},\text{RejectPenalty}_{t,\Delta} ]

Where:

Bandit objective: maximize cumulative reward (minimize slippage + completion damage).


Context design (production features)

Minimum context block per decision:

  1. Microstructure
    • spread (ticks/bps), top-of-book depth, queue imbalance, microprice drift.
  2. Toxicity / flow
    • short-horizon OFI, cancel burst, short-term markout proxy.
  3. Execution state
    • residual qty, residual time, realized participation vs target POV.
  4. Regime flags
    • open/close/auction-adjacent, news window, volatility state.
  5. Infra/venue
    • recent reject rate, ack latency bucket, throttling pressure.

Keep features strictly available at decision time (no look-ahead leakage).


Safety-first policy design

Let (\pi_b) be the current baseline (your existing deterministic/probabilistic tactic policy). Deploy (\pi) only if it respects conservative constraints relative to (\pi_b).

Practical safety contract:

[ \sum_{s=1}^{t} \mathbb{E}[r_s(\pi)] \ge (1-\alpha)\sum_{s=1}^{t} \mathbb{E}[r_s(\pi_b)] - B_t ]

Implementation patterns that work

  1. Baseline fallback on low support (SPIBB-style idea)
    • If (context, action) support is weak, copy baseline action probability.
  2. Conservative exploration gating
    • Explore only when confidence interval of candidate action clears baseline by margin.
  3. Traffic ladder rollout
    • 1% โ†’ 5% โ†’ 10% โ†’ 25% with rollback triggers on p95/p99 cost and underfill.

Offline gate before online learning

Never start from pure online exploration. First run offline policy evaluation (OPE) on logged bandit data.

Use a robust estimator stack:

A minimal DR form:

[ \hat{V}{DR} = \frac{1}{n}\sum{i=1}^{n}\left[\hat{q}(x_i,\pi(x_i)) + \frac{\pi(a_i|x_i)}{\hat{\mu}(a_i|x_i)}\left(r_i-\hat{q}(x_i,a_i)\right)\right] ]

Where:

Promotion rule example: require positive DR lift with one-sided confidence and no regime-level tail regressions.


Online controller architecture

Step loop

  1. Build context (x_t).
  2. Score candidate actions with uncertainty-aware reward model.
  3. Apply conservative gate (vs baseline action value).
  4. Sample/choose action under safety constraint.
  5. Execute child order + log full decision tuple.
  6. Update model incrementally (or mini-batch).

Decision tuple to log

Without propensity + baseline logs, OPE and incident forensics are crippled.


Guardrails and kill switches

Hard guardrails (non-negotiable):

Auto-rollback triggers (example):


Validation scorecard

Run daily and weekly.

  1. Cost outcomes
    • p50/p95/p99 IS vs baseline.
  2. Completion quality
    • underfill %, catch-up aggression incidents.
  3. Safety metrics
    • cumulative reward gap vs baseline budget,
    • rollback count and trigger type.
  4. Learning quality
    • calibration of predicted reward intervals,
    • propensity overlap diagnostics,
    • per-regime lift stability.

Common failure modes


Rollout plan

Phase 1 โ€” Shadow replay

Phase 2 โ€” Assisted mode

Phase 3 โ€” Conservative canary

Phase 4 โ€” Scale with governance


Practical takeaway

For live execution, contextual bandits are useful only when wrapped in conservative policy governance.

Treat baseline as a safety anchor, not as legacy baggage. Done right, you get adaptive tactic selection with bounded downside instead of "smart" exploration that pays tail-risk tuition in production.


Pointers for deeper reading