Distilled Deep-LOB → Low-Latency Slippage Model Playbook

Date: 2026-03-15
Category: research
Audience: small quant teams that want richer slippage forecasts without blowing latency budgets

Why this research matters

Many teams hit the same wall:

Simple models (spread + volatility + participation) are fast but miss microstructure state.
Deep LOB models capture structure better, but inference is often too slow/fragile for live routing decisions.

A practical compromise is a two-speed model stack:

a high-capacity teacher model (offline / nearline),
a compact student model (online, strict latency SLO),
a continuous calibration loop linking both.

This gives better slippage surfaces than naive heuristics while keeping online serving operationally safe.

Core architecture (operator view)

1) Teacher (offline, high capacity)

Use richer microstructure context:

LOB depth tensors (e.g., top 10 levels bid/ask)
order-flow imbalance (OFI)
queue depletion/refill rates
latency path features (decision→send→ack)
regime tags (auction proximity, news windows, spread regime)

Typical outputs (multi-task):

E[IS_bps | state, action]
Q90/Q95(IS_bps | state, action)
fill probability and expected completion time
short-horizon markout

Teacher can be a CNN/LSTM/Transformer family as long as it is causal and evaluated on strict point-in-time data.

2) Student (online, low latency)

Train a lightweight model to mimic teacher outputs + real realized outcomes:

small GBDT / monotonic GAM / compact MLP
feature count capped (e.g., 30–80 engineered features)
inference target: typically sub-millisecond end-to-end, often <100–300µs model time

Student should optimize for stable serving, not leaderboard accuracy.

3) Distillation loop (bridge)

Replay historical sessions and generate teacher labels on action/state grid.
Train student on:
- teacher targets (smooth structure)
- realized execution labels (ground truth discipline)
Calibrate tails (isotonic/quantile recalibration) per liquidity regime.

Label design that actually survives production

Use labels aligned to real decisions:

Child-order IS vs arrival and decision benchmarks
Parent-order cumulative IS
Opportunity cost for unfilled slices
Censoring flags (no-fill, timeout, cancel-replace)

Do not drop censored paths: deleting no-fill branches creates optimistic bias and pushes live behavior toward panic crossing.

Objective stack (recommended)

Instead of a single MSE objective, use a portfolio:

mean loss for central tendency
pinball loss for q90/q95 tails
binary/log-loss for completion within deadline
optional ranking loss for tactic selection (maker-first vs taker-first)

Then derive a control metric for routing:

Score(action) = E[IS] + λ_tail * Q95(IS) + λ_deadline * P(miss_deadline)

Choose action with lowest score under hard risk constraints.

Data contract checklist (non-negotiable)

event-time sequencing preserved (decision, send, ack, fill, cancel)
venue/tactic IDs immutable through pipeline
point-in-time fees/rebates and lot rules
explicit handling of corrections/busts and late prints
deterministic train/eval manifests (data hash + feature hash + code hash)

If these are weak, better architecture will not save the model.

Validation ladder for rollout

Stage A — Offline replay

Gate on:

q50/q90/q95 error for IS
calibration error of tail quantiles
completion and deadline prediction quality
regime robustness (open, lunch lull, close, event windows)

Stage B — Shadow mode (paper routing)

run student live in parallel with current policy
compare predicted vs realized slippage by tactic/venue
monitor drift in feature population and residuals

Stage C — Canary deployment

allocate small notional slice
enforce hard kill-switch thresholds:
- tail slippage breach
- reject/retry storms
- completion floor breach

Promote only after stable behavior across multiple market regimes.

Known failure modes

Teacher overfits a historical microstructure regime
Fix: rolling retrain windows + regime-balanced sampling.
Student too compressed, loses action ranking
Fix: distill pairwise action preference in addition to scalar targets.
Tail underestimation after fee/latency drift
Fix: online residual monitors + periodic tail recalibration.
Feature leakage from non-causal joins
Fix: strict point-in-time feature store and time-travel tests.
Ops complexity explosion
Fix: start with one symbol cluster + one venue class, then expand.

2-week practical build plan

Days 1-3

finalize data contract and label schema
baseline student model from existing features

Days 4-7

train teacher on depth+flow tensors
produce teacher action-surface labels on replay set

Days 8-10

distill into student
add quantile calibration by regime

Days 11-12

shadow deployment + drift dashboard

Days 13-14

small canary with hard rollback rules
freeze v1 runbook and retraining cadence

Bottom line

You do not need to serve a giant deep model directly to benefit from deep microstructure learning.

A teacher-student slippage stack gives a pragmatic path:

rich structure discovery offline,
robust low-latency decisions online,
explicit tail-risk control for live capital.

For small teams, this is often the highest signal-per-operational-risk route.

References (starting points)

Cont, Kukanov, Stoikov — The Price Impact of Order Book Events (JFEc 2014)
https://arxiv.org/abs/1011.6402
Taranto et al. — Linear models for the impact of order flow on prices I (2016)
https://arxiv.org/abs/1602.02735
Zhang, Zohren, Roberts — DeepLOB (IEEE TSP 2019)
https://arxiv.org/abs/1808.03668
Bodor et al. — Deep Learning Meets Queue-Reactive (2025)
https://arxiv.org/abs/2501.08822
Benzaquen, Eisler, Bouchaud — Trading Lightly: Cross-Impact and Optimal Portfolio Execution
https://ar5iv.labs.arxiv.org/html/1702.03838
Donier, Bonart et al. — A million metaorder analysis of market impact on Bitcoin
https://ar5iv.labs.arxiv.org/html/1412.4503