Uncertainty-Decomposed Slippage Control Playbook (Epistemic vs Aleatoric)

Date: 2026-02-28
Category: research (quant execution / slippage modeling)

Why this playbook

Many execution stacks predict expected slippage ((\mu)) and maybe one tail quantile. That is useful, but it blends two very different risks:

Aleatoric uncertainty — market noise you cannot remove (microstructure randomness, queue race noise).
Epistemic uncertainty — model ignorance (regime novelty, sparse context, covariate shift).

Operationally, these need different actions. If you treat both as one number, you either:

over-throttle in noisy-but-familiar regimes, or
under-react when the model is blind in new regimes.

This playbook decomposes uncertainty and maps it to live execution controls.

Core objective

For each child-order decision at time (t), estimate:

(\hat{\mu}_t): expected slippage (bps),
(\hat{\sigma}^{(a)}_t): aleatoric uncertainty,
(\hat{\sigma}^{(e)}_t): epistemic uncertainty,
(\hat{q}_{p,t}): calibrated tail quantiles (e.g., p90/p95).

Then drive a controller that distinguishes:

"known noisy" (high aleatoric, low epistemic),
"unknown risky" (high epistemic, often before incidents).

Practical modeling architecture

Use a two-layer setup.

Layer A — Base predictive ensemble

Train (M) heterogeneous models (e.g., LightGBM, CatBoost, linear baseline, shallow MLP) on the same target:

[ S_t = \text{realized child-order slippage in bps} ]

Each model outputs:

mean (\hat{\mu}_{m,t}),
quantiles (\hat{q}{m,0.5}, \hat{q}{m,0.9}, \hat{q}_{m,0.95}), or equivalent scale proxy.

Ensemble mean:

[ \hat{\mu}t = \frac{1}{M}\sum{m=1}^{M} \hat{\mu}_{m,t} ]

Epistemic proxy (between-model spread):

[ \hat{\sigma}^{(e)}t = \sqrt{\frac{1}{M-1}\sum{m=1}^{M}\left(\hat{\mu}_{m,t}-\hat{\mu}_t\right)^2} ]

Layer B — Residual/noise model

Fit a second model on absolute residuals from recent production windows:

[ r_t = |S_t - \hat{\mu}_t| ]

Predict (\hat{r}_t) using the same context features plus queue/latency stress features. Treat (\hat{\sigma}^{(a)}_t \propto \hat{r}_t) as aleatoric scale.

This separates:

disagreement among models (epistemic),
irreducible local turbulence (aleatoric).

Feature blocks that improve decomposition

Use feature families with explicit intent:

Execution intent
- side, urgency, participation target, child size / ADV slice, remaining horizon.
Book state
- spread, top-k depth imbalance, microprice drift, queue position percentile.
Flow toxicity
- short-horizon OFI, cancel/trade divergence, markout pressure proxy.
Infra timing
- decision→gateway→ACK latencies, jitter z-score, throttle utilization.
Novelty / OOD indicators (epistemic helpers)
- distance-to-training manifold (kNN distance, Mahalanobis, leaf-frequency rarity),
- regime labels unseen in recent training windows,
- feature missingness pattern drift.

OOD features are often the highest leverage for epistemic alarms.

Calibration: make uncertainty actionable

Raw uncertainty numbers are rarely calibrated. Add explicit calibration.

1) Quantile calibration

On rolling validation windows, enforce empirical coverage:

p90 target — ~90% of realized (S_t) below (\hat{q}_{0.9}),
p95 target — ~95% below (\hat{q}_{0.95}).

Use isotonic or monotone spline recalibration per liquidity regime.

2) Epistemic reliability curve

Bucket (\hat{\sigma}^{(e)}) deciles, then measure future error inflation:

[ \rho_k = \frac{\mathbb{E}[|S-\hat{\mu}|\mid \hat{\sigma}^{(e)}\in B_k]}{\mathbb{E}[|S-\hat{\mu}|\mid \hat{\sigma}^{(e)}\in B_1]} ]

You want monotonic (\rho_k). If not monotonic, your epistemic proxy is noisy and needs feature/OOD work.

3) Drift-conditioned recalibration

Maintain separate calibrators for:

normal,
fragile,
stress,

using microstructure stress labels. One global calibrator usually under-covers in stress.

Decision policy: uncertainty-aware controller

Define normalized pressure scores:

[ U^{(a)}_t = \frac{\hat{\sigma}^{(a)}_t}{\text{median}({\hat{\sigma}^{(a)}})+\epsilon}, \quad U^{(e)}_t = \frac{\hat{\sigma}^{(e)}_t}{\text{median}({\hat{\sigma}^{(e)}})+\epsilon} ]

Tail budget burn:

[ B_t = \frac{\sum_{i\le t} \max(0, S_i - q_{0.9,i})}{\text{daily tail budget (bps)}} ]

State machine

NORMAL
- condition: (U^{(e)}<1.3), (B_t<0.5)
- action: baseline tactic mix.
NOISY-KNOWN (aleatoric-dominant)
- condition: (U^{(a)}\uparrow), (U^{(e)}) moderate
- action: smaller clips, slightly lower POV, keep participation continuity.
UNKNOWN-RISK (epistemic-dominant)
- condition: (U^{(e)}\ge 1.8) or OOD flag high
- action: switch to conservative fallback policy (historically robust tactic set), tighten max aggression, increase passive bias unless deadline breach risk dominates.
SAFE
- condition: (B_t\ge 1.0) + (high (U^{(e)}) or severe stress)
- action: hard participation cap, venue whitelist, optional temporary pause for non-urgent flow.

Use hysteresis and minimum dwell times to avoid state flapping.

Counterfactual guardrail before policy promotion

Before promoting a new policy/model to production:

Replay recent tape with off-policy estimators (DR/SNIPS).
Evaluate by tail-first metrics:
- p95 slippage,
- CVaR95,
- budget-breach frequency,
- underfill at horizon.
Require challenger to satisfy:
- no worse p95/CVaR95 in stress,
- no material increase in underfill (> threshold),
- uncertainty calibration not degraded (coverage error bound).

Mean improvement alone is not sufficient.

Monitoring dashboard (minimum viable)

Track these in real time and EOD:

Cost metrics
- mean/p90/p95 slippage, CVaR95.
Calibration metrics
- p90/p95 empirical coverage gap,
- PIT histogram sanity.
Uncertainty decomposition
- median/90p (\hat{\sigma}^{(a)}), (\hat{\sigma}^{(e)}),
- share of UNKNOWN-RISK time.
Controller behavior
- state occupancy, transitions/hour, dwell time.
Execution business metrics
- completion ratio, schedule delay, reject/cancel anomalies.

Alert examples:

p95 coverage miss > +4pp for 3 consecutive windows,
UNKNOWN-RISK occupancy > 25% in liquid names,
SAFE entry rate > historical p95 baseline.

Failure modes and fixes

Epistemic over-triggering in sparse symbols
- fix: hierarchical pooling by liquidity bucket + symbol embeddings.
Aleatoric model learning stale residual regimes
- fix: shorter residual retrain cadence + stress-window oversampling.
Great calibration in backtest, poor live coverage
- fix: prequential (online) calibration checks, not static split only.
Controller improves tail but kills fills
- fix: dual objective guardrail (tail + completion), enforce minimum completion SLO.

Implementation blueprint (4-week cut)

Week 1

Build ensemble prediction pipeline + OOD feature set.
Backfill (\hat{\sigma}^{(e)}) history and sanity-check monotonicity.

Week 2

Fit aleatoric residual model.
Add rolling quantile calibration by regime.

Week 3

Integrate state machine in shadow mode (no real control impact).
Measure hypothetical actions and counterfactual tail impact.

Week 4

Canary live traffic (5% → 20% → 50%) with rollback gates.
Promote only with tail+completion pass.

Compact operator checklist

Before market open:

Calibrators fresh (<24h) for all active regimes.
Epistemic/OOD sensors healthy (no null stream).
Tail budget initialized per strategy sleeve.

During session:

Watch UNKNOWN-RISK occupancy spikes.
Verify SAFE transitions correspond to real stress, not telemetry errors.
Keep completion-vs-tail tradeoff inside policy envelope.

After close:

Coverage diagnostics (p90/p95) by regime and symbol bucket.
Review top epistemic spikes and label true novelty vs data-quality issue.
Update fallback policy set if repeated UNKNOWN-RISK contexts emerge.

Bottom line

A single slippage forecast is not enough for live execution control. The critical upgrade is to separate:

noise you must trade through (aleatoric), from
ignorance you must respect (epistemic).

That decomposition turns uncertainty from dashboard decoration into a concrete control surface for safer, higher-confidence execution.