Regime-Switching Slippage Surface (Participation × Duration × Queue-Risk) Playbook

Date: 2026-02-26
Category: Research (Execution / Slippage Modeling)
Scope: Intraday execution for single-name + basket routing (KR-first, global portable)

Why this model exists

Most slippage models fail for practical reasons:

Static coefficients break when market regime flips (open/close, news, VI-adjacent stress).
One-factor assumptions (only participation or only duration) miss nonlinear interaction.
Queue-risk blindness underestimates cost when cancel pressure spikes and passive fills vanish.

A practical execution stack needs a model that answers one question continuously:

"Given current participation, remaining duration, and queue toxicity, what is expected slippage in this regime, and what action keeps tail risk bounded?"

This playbook builds exactly that using a regime-switching slippage surface.

Core concept

Model expected slippage in basis points as:

[ \mathbb{E}[IS_t \mid x_t] = \sum_{k=1}^{K} p_t^{(k)} \cdot g_k(x_t) ]

(x_t): feature vector (participation, duration, spread, volatility, queue signals, venue state)
(g_k): regime-specific slippage surface for regime (k)
(p_t^{(k)}): online probability that regime (k) is active

Recommended regimes ((K=3)):

Normal liquidity
Fragile liquidity (book looks thick, but replenishment weak)
Stress / dislocation (high cancel bursts, spread jumps, queue churn)

This avoids the "single average model" trap and gives direct control levers.

Feature design (minimal but production-strong)

A) Execution geometry

Participation rate (\pi_t = q_t / V_t)
Remaining horizon (T_t) (seconds/minutes)
Urgency ratio (u_t = q_{remain} / T_t)

B) Market state

Instant spread (ticks/bps)
Short-horizon realized vol (e.g., 30s, 2m)
Depth slope around top-of-book
Auction / session flags (open, lunch, close, auction window)

C) Queue-risk / toxicity

Cancel-to-trade ratio (rolling)
Best-level depletion velocity
Queue age decay (our posted order survival)
Microprice drift vs mid-price drift divergence

D) Venue/routing context

Venue id (KRX/NXT/alt)
Venue-local fill hazard
Cross-venue top-of-book divergence

Slippage surface form

Use a flexible but inspectable form per regime:

[ g_k(x) = \beta_{0k}

\beta_{1k},\sigma\sqrt{\pi}
\beta_{2k},\sigma\sqrt{T}
\beta_{3k},\text{spread}
\beta_{4k},\text{queueTox}
\beta_{5k},\pi\cdot\text{queueTox}
\beta_{6k},\sqrt{\pi}\cdot\sqrt{T} ]

Notes:

(\sqrt{\pi}) term aligns with concave impact stylized fact (square-root behavior).
(\sqrt{T}) term captures duration/time-risk exposure.
Interaction terms capture where real losses happen: "high participation + toxic queue".

If you need extra flexibility, replace with monotonic GBM per regime (with constraints on (\pi), spread, toxicity).

Regime inference (online)

Use a lightweight Hidden Markov Model (or switching state-space filter):

Compute residual: [ r_t = IS_t - g_{z_t}(x_t) ]
Update regime posterior (p_t^{(k)}) using transition matrix (A) and emission likelihood.
Apply persistence prior (avoid micro-flipping every second).

Practical transition prior:

High persistence on diagonal (0.92–0.98)
Faster transition into stress than out of stress

This lets controller react early without overtrading on noise.

Training pipeline

1) Label construction

For each child fill or micro-batch:

benchmark (arrival/decision)
realized slippage (bps)
synchronized microstructure snapshot at send time

2) Data split

Purged time split by day/session
Symbol-group split to test cross-name generalization

3) Fit baseline surface

Start with pooled model (all data)
Initialize regime components via residual clustering (e.g., k-means on residual + toxicity + spread jump)

4) Fit switching model

EM optimization for (g_k), transition matrix, emission variance
Regularize to prevent degenerate regimes

5) Calibrate tails

Add quantile layer (q90/q95) per regime for risk budget control

What to optimize (objective)

For production, optimize risk-adjusted execution cost, not mean IS only:

[ J = \mathbb{E}[IS] + \lambda_1,\text{CVaR}_{95}(IS) + \lambda_2,\text{UnderfillPenalty} ]

Why:

Mean improves can hide rare but expensive blowups.
Underfill penalty keeps model from becoming too passive during stress.

Controller integration (how model actually saves money)

At each control step:

Estimate expected slippage + tail for candidate actions:
- keep pace
- slow participation
- speed up to reduce time-risk
- passive-heavy vs aggressive-heavy mix
- venue rebalance
Pick action minimizing projected objective under current budget.

Example policy guardrails

If (P(\text{stress}) > 0.65) and queue toxicity rising:
- reduce passive dwell time
- tighten stale quote cancel threshold
- cap single child notional
If (P(\text{normal}) > 0.8) and spread tight:
- restore passive participation
- lengthen quote life to save spread cost

Evaluation scorecard

Modeling quality

MAE / RMSE of IS (overall + by regime)
Calibration error of regime probabilities
Quantile coverage error (q90/q95)

Execution impact

Delta mean IS vs baseline schedule
Delta q95/q99 IS (must improve)
Completion rate and underfill slippage transfer

Stability

Regime flip rate per hour (too high = unstable)
Action churn (cancel/replace inflation)
Venue oscillation frequency

KR market implementation notes

Separate models for opening call/closing auction windows; don’t force one continuous model.
Treat VI-adjacent states as stress prior bump even before full trigger.
Preserve conservative fallback schedule if feed latency or queue features go stale.
Maintain venue-local toxicity estimate; stress can be asymmetric across venues.

Failure modes and protections

Regime overfitting
- Symptom: excellent backtest, unstable live posteriors
- Fix: stronger transition prior + fewer regimes
Hidden latency bias
- Symptom: model underestimates costs during burst periods
- Fix: add feature freshness lag and gateway RTT features
Overreaction controller
- Symptom: action churn increases fees without slippage gain
- Fix: hysteresis + minimum hold time per mode
Data leakage via post-trade features
- Symptom: impossible offline accuracy
- Fix: enforce strict event-time cutoff at order send

Practical rollout plan (2 weeks)

Week 1

Build unified feature table and IS label stream
Train 1-regime baseline + 3-regime switching prototype
Backtest with replay and tail metrics

Week 2

Shadow mode in paper/live-sim routing
Compare decisions vs current controller
Enable only low-risk levers first (quote dwell, child cap)
Promote to active once q95 improvement is stable for 5+ sessions

Reference pointers

Square-root impact stylized facts and propagator framing (metaorder impact literature; practical summaries include Emergent Mind overview, 2025 update).
Participation-vs-duration modeling debate and the operational need to separate physical impact vs time risk (institutional execution literature; Talos practitioner write-up, 2025).
Regime-switching/state-space methods from time-series econometrics adapted to execution control.

TL;DR

Use a regime-switching slippage surface instead of one static model.

Predict slippage as a mixture of regime-specific surfaces over participation, duration, and queue toxicity; then route/control with tail-aware objective.

You’ll usually get modest mean improvement, but the real win is material q95/q99 slippage drawdown control when market microstructure turns hostile.