Regime-Switching Slippage Surface (Participation × Duration × Queue-Risk) Playbook
Date: 2026-02-26
Category: Research (Execution / Slippage Modeling)
Scope: Intraday execution for single-name + basket routing (KR-first, global portable)
Why this model exists
Most slippage models fail for practical reasons:
- Static coefficients break when market regime flips (open/close, news, VI-adjacent stress).
- One-factor assumptions (only participation or only duration) miss nonlinear interaction.
- Queue-risk blindness underestimates cost when cancel pressure spikes and passive fills vanish.
A practical execution stack needs a model that answers one question continuously:
"Given current participation, remaining duration, and queue toxicity, what is expected slippage in this regime, and what action keeps tail risk bounded?"
This playbook builds exactly that using a regime-switching slippage surface.
Core concept
Model expected slippage in basis points as:
[ \mathbb{E}[IS_t \mid x_t] = \sum_{k=1}^{K} p_t^{(k)} \cdot g_k(x_t) ]
- (x_t): feature vector (participation, duration, spread, volatility, queue signals, venue state)
- (g_k): regime-specific slippage surface for regime (k)
- (p_t^{(k)}): online probability that regime (k) is active
Recommended regimes ((K=3)):
- Normal liquidity
- Fragile liquidity (book looks thick, but replenishment weak)
- Stress / dislocation (high cancel bursts, spread jumps, queue churn)
This avoids the "single average model" trap and gives direct control levers.
Feature design (minimal but production-strong)
A) Execution geometry
- Participation rate (\pi_t = q_t / V_t)
- Remaining horizon (T_t) (seconds/minutes)
- Urgency ratio (u_t = q_{remain} / T_t)
B) Market state
- Instant spread (ticks/bps)
- Short-horizon realized vol (e.g., 30s, 2m)
- Depth slope around top-of-book
- Auction / session flags (open, lunch, close, auction window)
C) Queue-risk / toxicity
- Cancel-to-trade ratio (rolling)
- Best-level depletion velocity
- Queue age decay (our posted order survival)
- Microprice drift vs mid-price drift divergence
D) Venue/routing context
- Venue id (KRX/NXT/alt)
- Venue-local fill hazard
- Cross-venue top-of-book divergence
Slippage surface form
Use a flexible but inspectable form per regime:
[ g_k(x) = \beta_{0k}
- \beta_{1k},\sigma\sqrt{\pi}
- \beta_{2k},\sigma\sqrt{T}
- \beta_{3k},\text{spread}
- \beta_{4k},\text{queueTox}
- \beta_{5k},\pi\cdot\text{queueTox}
- \beta_{6k},\sqrt{\pi}\cdot\sqrt{T} ]
Notes:
- (\sqrt{\pi}) term aligns with concave impact stylized fact (square-root behavior).
- (\sqrt{T}) term captures duration/time-risk exposure.
- Interaction terms capture where real losses happen: "high participation + toxic queue".
If you need extra flexibility, replace with monotonic GBM per regime (with constraints on (\pi), spread, toxicity).
Regime inference (online)
Use a lightweight Hidden Markov Model (or switching state-space filter):
- Compute residual: [ r_t = IS_t - g_{z_t}(x_t) ]
- Update regime posterior (p_t^{(k)}) using transition matrix (A) and emission likelihood.
- Apply persistence prior (avoid micro-flipping every second).
Practical transition prior:
- High persistence on diagonal (0.92–0.98)
- Faster transition into stress than out of stress
This lets controller react early without overtrading on noise.
Training pipeline
1) Label construction
For each child fill or micro-batch:
- benchmark (arrival/decision)
- realized slippage (bps)
- synchronized microstructure snapshot at send time
2) Data split
- Purged time split by day/session
- Symbol-group split to test cross-name generalization
3) Fit baseline surface
- Start with pooled model (all data)
- Initialize regime components via residual clustering (e.g., k-means on residual + toxicity + spread jump)
4) Fit switching model
- EM optimization for (g_k), transition matrix, emission variance
- Regularize to prevent degenerate regimes
5) Calibrate tails
- Add quantile layer (q90/q95) per regime for risk budget control
What to optimize (objective)
For production, optimize risk-adjusted execution cost, not mean IS only:
[ J = \mathbb{E}[IS] + \lambda_1,\text{CVaR}_{95}(IS) + \lambda_2,\text{UnderfillPenalty} ]
Why:
- Mean improves can hide rare but expensive blowups.
- Underfill penalty keeps model from becoming too passive during stress.
Controller integration (how model actually saves money)
At each control step:
- Estimate expected slippage + tail for candidate actions:
- keep pace
- slow participation
- speed up to reduce time-risk
- passive-heavy vs aggressive-heavy mix
- venue rebalance
- Pick action minimizing projected objective under current budget.
Example policy guardrails
- If (P(\text{stress}) > 0.65) and queue toxicity rising:
- reduce passive dwell time
- tighten stale quote cancel threshold
- cap single child notional
- If (P(\text{normal}) > 0.8) and spread tight:
- restore passive participation
- lengthen quote life to save spread cost
Evaluation scorecard
Modeling quality
- MAE / RMSE of IS (overall + by regime)
- Calibration error of regime probabilities
- Quantile coverage error (q90/q95)
Execution impact
- Delta mean IS vs baseline schedule
- Delta q95/q99 IS (must improve)
- Completion rate and underfill slippage transfer
Stability
- Regime flip rate per hour (too high = unstable)
- Action churn (cancel/replace inflation)
- Venue oscillation frequency
KR market implementation notes
- Separate models for opening call/closing auction windows; don’t force one continuous model.
- Treat VI-adjacent states as stress prior bump even before full trigger.
- Preserve conservative fallback schedule if feed latency or queue features go stale.
- Maintain venue-local toxicity estimate; stress can be asymmetric across venues.
Failure modes and protections
Regime overfitting
- Symptom: excellent backtest, unstable live posteriors
- Fix: stronger transition prior + fewer regimes
Hidden latency bias
- Symptom: model underestimates costs during burst periods
- Fix: add feature freshness lag and gateway RTT features
Overreaction controller
- Symptom: action churn increases fees without slippage gain
- Fix: hysteresis + minimum hold time per mode
Data leakage via post-trade features
- Symptom: impossible offline accuracy
- Fix: enforce strict event-time cutoff at order send
Practical rollout plan (2 weeks)
Week 1
- Build unified feature table and IS label stream
- Train 1-regime baseline + 3-regime switching prototype
- Backtest with replay and tail metrics
Week 2
- Shadow mode in paper/live-sim routing
- Compare decisions vs current controller
- Enable only low-risk levers first (quote dwell, child cap)
- Promote to active once q95 improvement is stable for 5+ sessions
Reference pointers
- Square-root impact stylized facts and propagator framing (metaorder impact literature; practical summaries include Emergent Mind overview, 2025 update).
- Participation-vs-duration modeling debate and the operational need to separate physical impact vs time risk (institutional execution literature; Talos practitioner write-up, 2025).
- Regime-switching/state-space methods from time-series econometrics adapted to execution control.
TL;DR
Use a regime-switching slippage surface instead of one static model.
Predict slippage as a mixture of regime-specific surfaces over participation, duration, and queue toxicity; then route/control with tail-aware objective.
You’ll usually get modest mean improvement, but the real win is material q95/q99 slippage drawdown control when market microstructure turns hostile.