Execution Benchmark Integrity for Slippage Modeling: Arrival, VWAP, and Markout Triangulation Playbook

Date: 2026-03-26
Category: research
Audience: quant operators running production execution + TCA loops

Why this note

A lot of live slippage models are statistically fine but operationally misleading because the benchmark itself is unstable or gameable.

If your benchmark can be shifted by workflow latency, schedule choice, or reporting convention, model quality metrics will look better while real execution quality gets worse.

This note focuses on a practical fix: model and govern slippage against a triangulated benchmark stack instead of a single metric.

1) Benchmark stack: one number is not enough

Use at least four simultaneous views:

Decision benchmark (portfolio manager decision time)
Arrival benchmark (order reaches execution system)
Schedule benchmark (e.g., interval VWAP over executable horizon)
Post-trade markout (e.g., +1m/+5m/+30m)

For buy side, define signed bps costs:

[ IS_{decision} = \frac{P_{exec}-P_{decision}}{P_{decision}}\cdot 10^4 ]

[ IS_{arrival} = \frac{P_{exec}-P_{arrival}}{P_{arrival}}\cdot 10^4 ]

[ VWAP_gap = \frac{P_{exec}-VWAP_{window}}{VWAP_{window}}\cdot 10^4 ]

[ Markout_{\Delta} = \frac{P_{\Delta}-P_{exec}}{P_{exec}}\cdot 10^4 ]

These should be tracked jointly, not substituted for each other.

2) Practical decomposition (what moved, what was controllable)

Decompose total decision-to-fill shortfall:

[ IS_{decision} = Drift_{decision\to arrival} + ExecCost_{arrival\to fill} ]

Then split execution cost:

[ ExecCost = Spread + Impact + Fees - Rebates + TimingResidual ]

And validate with markout horizons:

negative short-horizon markout after aggressive fills → likely adverse selection
positive markout after passive fills but high timeout rate → opportunity-cost leakage

This decomposition is far more diagnosable than a single IS label.

3) Benchmark fragility map (common production traps)

A) Arrival timestamp drift

If arrival uses gateway ingest in one venue and strategy dispatch in another, cross-venue comparisons are corrupted.

Control: enforce canonical event-time hierarchy and keep source field in every record.

B) VWAP window leakage

If VWAP window is chosen after seeing liquidity/price path, the benchmark is selection-biased.

Control: pre-register executable window policy by parent order type.

C) Completion bias

Comparing only completed parents hides timeout/cancel losses.

Control: keep unfinished residual as explicit miss-cost label.

D) Markout cherry-picking

Using only one horizon (e.g., +5m) can overfit to microstructure noise.

Control: fixed markout ladder (+30s, +1m, +5m, +30m) with no per-symbol custom post hoc tuning.

4) Modeling architecture: multi-target, benchmark-aware

Train a coupled stack instead of one monolithic slippage regressor:

Drift model for decision→arrival
Execution-cost model for arrival→fill
Residual completion model (probability unfinished + miss-cost)
Markout model for post-trade toxicity check

Use quantile outputs (q50/q90/q95) for each target and score policies on tail-aware objective:

[ Score(a)=\mathbb{E}[C|a] + \lambda_{tail}Q_{95}(C|a) + \lambda_{miss},\mathbb{E}[MissCost|a] ]

5) Data contract (PIT + audit-safe)

Minimum schema for each child:

order IDs (parent/child), side, qty, venue
decision timestamp + source
arrival timestamp + source
ack/fill/cancel timestamps
benchmark snapshots used at decision time
benchmark policy version (window rules, markout horizons)
fee/rebate table version
residual inventory and deadline state

If benchmark policy version is missing, route record to quarantine dataset (do not train).

6) Calibration and health checks

A) By-benchmark calibration

Track realized-minus-predicted by benchmark type:

decision IS calibration
arrival IS calibration
VWAP-gap calibration
markout calibration

B) Stability monitors

benchmark disagreement index: (|IS_{decision}-IS_{arrival}|)
window sensitivity index: VWAP gap under adjacent admissible windows
completion censoring ratio by symbol/venue/time bucket

If disagreement index spikes while network/venue latency spikes, freeze model promotion.

7) Anti-gaming governance controls

Benchmark policy registry (versioned, immutable once session opens)
No retroactive benchmark edits without incident ticket + audit trail
Dual reporting: headline metric + hard-to-game companion metric
Promotion gate: challenger must improve at least two independent benchmarks
Red-team tests: simulate timestamp delays and window-shift attacks

A model that only improves one benchmark while degrading markout should fail promotion.

8) Two-week rollout plan

Days 1–3
Define benchmark policy registry and canonical timestamp hierarchy.

Days 4–6
Build decomposition labels (decision drift / execution cost / miss cost / markout).

Days 7–9
Train multi-target quantile stack and create by-benchmark calibration dashboard.

Days 10–11
Add anti-gaming checks (window sensitivity + disagreement index alerts).

Days 12–13
Shadow-mode comparison vs incumbent model with fixed benchmark policy.

Day 14
Canary launch with promotion gate requiring multi-benchmark improvement.

Bottom line

Better slippage modeling is not just better prediction — it is benchmark integrity engineering.

If you only ship one improvement this cycle: move from single-metric IS to a triangulated benchmark stack (decision/arrival/VWAP/markout) with immutable policy versioning. That alone prevents a large class of false model wins.

References

Perold, A. F. (1988), The Implementation Shortfall: Paper vs. Reality
https://www.hbs.edu/faculty/Pages/item.aspx?num=2083
Bertsimas, D., Lo, A. W. (1998), Optimal Control of Execution Costs
https://www.mit.edu/~dbertsim/papers/Finance/Optimal%20control%20of%20execution%20costs.pdf
Almgren, R., Chriss, N. (2000), Optimal Execution of Portfolio Transactions
https://www.smallake.kr/wp-content/uploads/2016/03/optliq.pdf
Gatheral, J. (2010), No-Dynamic-Arbitrage and Market Impact
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1292353
Huang, W., Lehalle, C.-A., Rosenbaum, M. (2015), The Queue-Reactive Model
https://arxiv.org/abs/1312.0563
Bucci, F. et al. (2022), Market Impact: Empirical Evidence, Theory and Practice
https://arxiv.org/pdf/2205.07385
Easley, D., López de Prado, M., O’Hara, M. (2012), Flow Toxicity and Liquidity in a High-Frequency World
https://academic.oup.com/rfs/article-abstract/25/5/1457/1569929
Andersen, T. G., Bondarenko, O. (2014), Reflecting on the VPIN Dispute
https://ideas.repec.org/a/eee/finmar/v17y2014icp53-64.html