Slippage CUSUM Change-Point + Burn-Rate Controller Playbook

Date: 2026-02-26
Category: Research (Execution / Slippage Modeling)
Scope: Intraday single-name and basket execution (KR-focused, portable)

Why this model exists

Most execution stacks still monitor slippage with slow averages:

rolling 15m mean IS,
end-of-order postmortems,
dashboard alarms after damage is done.

That is operationally too late.

This playbook treats slippage control like SRE incident detection:

Build a real-time expected slippage baseline,
Track residual error,
Run change-point detection (CUSUM/Page-Hinkley),
Trigger a burn-rate controller on the remaining slippage budget.

Core idea: don’t just estimate cost level. Detect when the process has shifted regime right now and react before tail loss compounds.

Problem setup

Let parent order start at time (t_0), horizon (H), benchmark (B) (arrival or decision).

Observed incremental slippage (bps) at decision tick (t):

[ y_t = \text{slippage}_{t\rightarrow t+\Delta} ]

Predict expected slippage using context (x_t):

[ \hat{y}_t = f(x_t) ]

Residual:

[ r_t = y_t - \hat{y}_t ]

If (r_t) becomes persistently positive, execution is paying an abnormal tax not explained by current modeled context.

Stage 1 — Baseline model (residualization)

Use a fast model for (\hat{y}_t):

LightGBM/CatBoost (tabular microstructure + own-exec features), or
linear + interactions for lower latency and easier diagnostics.

Suggested features:

spread, microprice drift, queue imbalance,
short-horizon volatility,
participation rate and remaining urgency,
queue age / cancel-replace churn,
session bucket (open/close/lunch), venue state (KRX/NXT/VI).

Goal is not perfection. Goal is to produce a stable residual stream where shifts are detectable.

Stage 2 — Online change-point detection

A) One-sided CUSUM (upward slippage shift)

Define standardized residual:

[ z_t = \frac{r_t - \mu_0}{\sigma_0 + \epsilon} ]

with (\mu_0\approx 0), (\sigma_0) estimated by robust rolling window.

CUSUM statistic:

[ S_t = \max{0, S_{t-1} + z_t - k} ]

(k): slack/drift allowance (reference value),
alert when (S_t > h) (decision threshold).

Interpretation:

random noise resets near zero,
persistent positive residuals accumulate and breach quickly.

B) Page-Hinkley variant (drift-sensitive)

For non-stationary sessions, Page-Hinkley can reduce false positives:

[ PH_t = PH_{t-1} + (r_t - \bar{r}_t - \delta) ]

Track minimum (m_t = \min_{i\le t} PH_i), alert if:

[ PH_t - m_t > \lambda ]

Use CUSUM as primary, Page-Hinkley as confirmation in noisy symbols.

Stage 3 — Slippage budget burn-rate control

Let total allowed slippage budget for parent order be (B_{tot}) (bps-weighted notional). Remaining budget at (t):

[ B_t = B_{tot} - \sum_{i=t_0}^{t} y_i w_i ]

where (w_i) is fill/notional weight.

Define projected burn rate over short horizon (\tau):

[ \rho_t = \frac{\mathbb{E}[\sum_{j=t}^{t+\tau} y_j w_j \mid x_t, S_t]}{\max(B_t, \epsilon)} ]

Controller tiers:

GREEN: (\rho_t < 0.35)
YELLOW: (0.35 \le \rho_t < 0.7)
ORANGE: (0.7 \le \rho_t < 1.0)
RED: (\rho_t \ge 1.0) or CUSUM hard breach

This is analogous to SLO burn-rate alerts: not just “are we bad,” but “how fast are we consuming error budget.”

Controller state machine

[ \text{NORMAL} \rightarrow \text{GUARDED} \rightarrow \text{DEFENSIVE} \rightarrow \text{KILL/SAFE} ]

NORMAL

passive-first,
standard participation caps,
normal quote dwell.

GUARDED (soft CUSUM alert or YELLOW burn)

reduce passive dwell time,
tighten stale-quote cancellation thresholds,
lower max child order size,
increase reprice cadence.

DEFENSIVE (CUSUM confirmed or ORANGE burn)

cap passive exposure,
prefer certainty over queue vanity,
temporary venue quarantine if regime appears venue-local,
activate tail-risk cap (q95 slippage guard).

KILL/SAFE (RED burn or repeated hard breaches)

halt aggressive escalation,
force controlled unwind protocol,
notify operator with feature snapshot + detector trace.

Add hysteresis so controller does not flap:

enter DEFENSIVE on 2 consecutive confirms,
exit only after detector cool-down window.

Calibration process

1) Build residual reference

Purged time-split by day/session,
robustly estimate (\sigma_0) per symbol-time bucket,
winsorize extreme prints for detector stability.

2) Tune detector thresholds

Optimize ((k, h)) for:

low mean detection delay under injected shift,
bounded false alarm rate by session bucket.

Inject synthetic shifts (e.g., +2/+4/+8 bps residual drift) into replay to map sensitivity curves.

3) Tune action policy

Backtest policy under counterfactual replay:

objective: reduce q95/q99 IS,
constraints: underfill penalty, completion SLA, turnover caps.

Evaluation scorecard

Primary:

q95 implementation shortfall reduction vs baseline controller.

Secondary:

detection delay (seconds) after true shift onset,
false positive alerts per order/day,
mode-switch rate (stability),
completion rate and opportunity cost,
tail conditional expectation (CVaR) of slippage.

Operational:

percent alerts with actionable explanation,
feature freshness SLO compliance,
fallback invocation rate.

Production guardrails

Feature staleness kill-switch: if key features delayed, detector output invalidated and fallback policy engaged.
Session-aware thresholds: open/close windows need separate ((k,h)).
Venue segmentation: detect whether shift is global or venue-local before routing changes.
Minimum sample floor: suppress detector on too-sparse tick windows.
Explainability payload per alert:
- top residual drivers,
- current CUSUM value, burn-rate tier,
- recommended action and confidence.

Pseudocode

for t in decision_ticks:
    x = build_features(t)
    y_hat = baseline.predict(x)
    y_obs = realized_incremental_slippage(t)
    r = y_obs - y_hat

    z = (r - mu0_bucket) / (sigma0_bucket + 1e-6)
    S = max(0.0, S + z - k)

    cusum_alert = (S > h)
    burn_rate = forecast_burn_rate(state, x, remaining_budget)

    state = transition(state, cusum_alert, burn_rate, hysteresis=True)
    action = policy(state, market_state=x, remain_qty=remain_qty)

    if state == "KILL_SAFE":
        trigger_safe_unwind()

    send(action)

Common failure modes

Detector on raw slippage (no residualization)
→ reacts to normal context changes, not true regime shifts.
Single threshold for all sessions
→ open/close false alarms explode.
No hysteresis
→ mode flapping and unnecessary cancel churn.
Ignoring underfill cost
→ slippage improves on paper while execution quality degrades.
Alert with no prescribed action
→ operators receive noise, not control.

Next experiments

Bayesian online change-point posterior as a second detector.
Joint detector across slippage residual + fill-hazard residual.
Portfolio-aware correlated change-point detection (basket contagion).
Conformal wrapper for alert confidence guarantees.

One-line takeaway

Treat slippage like an error budget: CUSUM catches hidden regime shifts early, and burn-rate control converts that warning into concrete execution actions before tail loss compounds.