Uncertainty-Decomposed Slippage Control Playbook (Epistemic vs Aleatoric)
Date: 2026-02-28
Category: research (quant execution / slippage modeling)
Why this playbook
Many execution stacks predict expected slippage ((\mu)) and maybe one tail quantile. That is useful, but it blends two very different risks:
- Aleatoric uncertainty — market noise you cannot remove (microstructure randomness, queue race noise).
- Epistemic uncertainty — model ignorance (regime novelty, sparse context, covariate shift).
Operationally, these need different actions. If you treat both as one number, you either:
- over-throttle in noisy-but-familiar regimes, or
- under-react when the model is blind in new regimes.
This playbook decomposes uncertainty and maps it to live execution controls.
Core objective
For each child-order decision at time (t), estimate:
- (\hat{\mu}_t): expected slippage (bps),
- (\hat{\sigma}^{(a)}_t): aleatoric uncertainty,
- (\hat{\sigma}^{(e)}_t): epistemic uncertainty,
- (\hat{q}_{p,t}): calibrated tail quantiles (e.g., p90/p95).
Then drive a controller that distinguishes:
- "known noisy" (high aleatoric, low epistemic),
- "unknown risky" (high epistemic, often before incidents).
Practical modeling architecture
Use a two-layer setup.
Layer A — Base predictive ensemble
Train (M) heterogeneous models (e.g., LightGBM, CatBoost, linear baseline, shallow MLP) on the same target:
[ S_t = \text{realized child-order slippage in bps} ]
Each model outputs:
- mean (\hat{\mu}_{m,t}),
- quantiles (\hat{q}{m,0.5}, \hat{q}{m,0.9}, \hat{q}_{m,0.95}), or equivalent scale proxy.
Ensemble mean:
[ \hat{\mu}t = \frac{1}{M}\sum{m=1}^{M} \hat{\mu}_{m,t} ]
Epistemic proxy (between-model spread):
[ \hat{\sigma}^{(e)}t = \sqrt{\frac{1}{M-1}\sum{m=1}^{M}\left(\hat{\mu}_{m,t}-\hat{\mu}_t\right)^2} ]
Layer B — Residual/noise model
Fit a second model on absolute residuals from recent production windows:
[ r_t = |S_t - \hat{\mu}_t| ]
Predict (\hat{r}_t) using the same context features plus queue/latency stress features. Treat (\hat{\sigma}^{(a)}_t \propto \hat{r}_t) as aleatoric scale.
This separates:
- disagreement among models (epistemic),
- irreducible local turbulence (aleatoric).
Feature blocks that improve decomposition
Use feature families with explicit intent:
- Execution intent
- side, urgency, participation target, child size / ADV slice, remaining horizon.
- Book state
- spread, top-k depth imbalance, microprice drift, queue position percentile.
- Flow toxicity
- short-horizon OFI, cancel/trade divergence, markout pressure proxy.
- Infra timing
- decision→gateway→ACK latencies, jitter z-score, throttle utilization.
- Novelty / OOD indicators (epistemic helpers)
- distance-to-training manifold (kNN distance, Mahalanobis, leaf-frequency rarity),
- regime labels unseen in recent training windows,
- feature missingness pattern drift.
OOD features are often the highest leverage for epistemic alarms.
Calibration: make uncertainty actionable
Raw uncertainty numbers are rarely calibrated. Add explicit calibration.
1) Quantile calibration
On rolling validation windows, enforce empirical coverage:
- p90 target — ~90% of realized (S_t) below (\hat{q}_{0.9}),
- p95 target — ~95% below (\hat{q}_{0.95}).
Use isotonic or monotone spline recalibration per liquidity regime.
2) Epistemic reliability curve
Bucket (\hat{\sigma}^{(e)}) deciles, then measure future error inflation:
[ \rho_k = \frac{\mathbb{E}[|S-\hat{\mu}|\mid \hat{\sigma}^{(e)}\in B_k]}{\mathbb{E}[|S-\hat{\mu}|\mid \hat{\sigma}^{(e)}\in B_1]} ]
You want monotonic (\rho_k). If not monotonic, your epistemic proxy is noisy and needs feature/OOD work.
3) Drift-conditioned recalibration
Maintain separate calibrators for:
- normal,
- fragile,
- stress,
using microstructure stress labels. One global calibrator usually under-covers in stress.
Decision policy: uncertainty-aware controller
Define normalized pressure scores:
[ U^{(a)}_t = \frac{\hat{\sigma}^{(a)}_t}{\text{median}({\hat{\sigma}^{(a)}})+\epsilon}, \quad U^{(e)}_t = \frac{\hat{\sigma}^{(e)}_t}{\text{median}({\hat{\sigma}^{(e)}})+\epsilon} ]
Tail budget burn:
[ B_t = \frac{\sum_{i\le t} \max(0, S_i - q_{0.9,i})}{\text{daily tail budget (bps)}} ]
State machine
NORMAL
- condition: (U^{(e)}<1.3), (B_t<0.5)
- action: baseline tactic mix.
NOISY-KNOWN (aleatoric-dominant)
- condition: (U^{(a)}\uparrow), (U^{(e)}) moderate
- action: smaller clips, slightly lower POV, keep participation continuity.
UNKNOWN-RISK (epistemic-dominant)
- condition: (U^{(e)}\ge 1.8) or OOD flag high
- action: switch to conservative fallback policy (historically robust tactic set), tighten max aggression, increase passive bias unless deadline breach risk dominates.
SAFE
- condition: (B_t\ge 1.0) + (high (U^{(e)}) or severe stress)
- action: hard participation cap, venue whitelist, optional temporary pause for non-urgent flow.
Use hysteresis and minimum dwell times to avoid state flapping.
Counterfactual guardrail before policy promotion
Before promoting a new policy/model to production:
- Replay recent tape with off-policy estimators (DR/SNIPS).
- Evaluate by tail-first metrics:
- p95 slippage,
- CVaR95,
- budget-breach frequency,
- underfill at horizon.
- Require challenger to satisfy:
- no worse p95/CVaR95 in stress,
- no material increase in underfill (> threshold),
- uncertainty calibration not degraded (coverage error bound).
Mean improvement alone is not sufficient.
Monitoring dashboard (minimum viable)
Track these in real time and EOD:
- Cost metrics
- mean/p90/p95 slippage, CVaR95.
- Calibration metrics
- p90/p95 empirical coverage gap,
- PIT histogram sanity.
- Uncertainty decomposition
- median/90p (\hat{\sigma}^{(a)}), (\hat{\sigma}^{(e)}),
- share of UNKNOWN-RISK time.
- Controller behavior
- state occupancy, transitions/hour, dwell time.
- Execution business metrics
- completion ratio, schedule delay, reject/cancel anomalies.
Alert examples:
- p95 coverage miss > +4pp for 3 consecutive windows,
- UNKNOWN-RISK occupancy > 25% in liquid names,
- SAFE entry rate > historical p95 baseline.
Failure modes and fixes
- Epistemic over-triggering in sparse symbols
- fix: hierarchical pooling by liquidity bucket + symbol embeddings.
- Aleatoric model learning stale residual regimes
- fix: shorter residual retrain cadence + stress-window oversampling.
- Great calibration in backtest, poor live coverage
- fix: prequential (online) calibration checks, not static split only.
- Controller improves tail but kills fills
- fix: dual objective guardrail (tail + completion), enforce minimum completion SLO.
Implementation blueprint (4-week cut)
Week 1
- Build ensemble prediction pipeline + OOD feature set.
- Backfill (\hat{\sigma}^{(e)}) history and sanity-check monotonicity.
Week 2
- Fit aleatoric residual model.
- Add rolling quantile calibration by regime.
Week 3
- Integrate state machine in shadow mode (no real control impact).
- Measure hypothetical actions and counterfactual tail impact.
Week 4
- Canary live traffic (5% → 20% → 50%) with rollback gates.
- Promote only with tail+completion pass.
Compact operator checklist
Before market open:
- Calibrators fresh (<24h) for all active regimes.
- Epistemic/OOD sensors healthy (no null stream).
- Tail budget initialized per strategy sleeve.
During session:
- Watch UNKNOWN-RISK occupancy spikes.
- Verify SAFE transitions correspond to real stress, not telemetry errors.
- Keep completion-vs-tail tradeoff inside policy envelope.
After close:
- Coverage diagnostics (p90/p95) by regime and symbol bucket.
- Review top epistemic spikes and label true novelty vs data-quality issue.
- Update fallback policy set if repeated UNKNOWN-RISK contexts emerge.
Bottom line
A single slippage forecast is not enough for live execution control. The critical upgrade is to separate:
- noise you must trade through (aleatoric), from
- ignorance you must respect (epistemic).
That decomposition turns uncertainty from dashboard decoration into a concrete control surface for safer, higher-confidence execution.