Label-Maturity Delay and Fake-Negative Drift in Online Slippage Models
Date: 2026-04-10
Category: research (execution / slippage modeling)
Why this playbook exists
A live slippage stack does not observe its labels all at once.
Some outcomes arrive immediately:
- spread at send,
- ACK latency,
- first fill price,
- initial queue reaction.
Other outcomes arrive later:
- full completion vs timeout,
- cancel-confirmation,
- post-fill markout,
- parent-order cleanup cost,
- drop-copy / correction / busted-trade adjustments,
- end-of-parent implementation shortfall.
If you continuously retrain on the freshest data without respecting label maturity, you create a quiet but severe bias:
- unresolved passive orders look like no-fill negatives,
- incomplete parents look artificially cheap,
- tail losses have not landed yet,
- late corrections rewrite yesterday’s “ground truth”.
The result is a model that looks fresh, but systematically underprices urgency, underestimates tails, and over-recommends patience.
This note gives a production blueprint for dealing with that problem.
The core failure mode: immature labels masquerading as truth
Suppose a child order is posted at 09:30:00.100.
At 09:30:00.400 you may know:
- it has not filled yet,
- the book moved slightly away,
- the quote is still live.
But you do not yet know:
- whether it fills at 09:30:01.000,
- whether you cancel and cross at 09:30:01.300,
- whether the parent misses its schedule and pays cleanup cost,
- whether a later correction changes the true fill sequence,
- whether the 5s/30s markout is toxic.
If your online learner treats that 300ms-old order as a clean negative example, it is learning from a fake negative.
In production, this usually causes three distortions:
- Fill models underestimate eventual fill probability for passive quotes.
- Cost models understate cleanup / catch-up cost because unfinished parents are still in flight.
- Calibration drifts optimistic because bad outcomes arrive later than easy outcomes.
What “label maturity” should mean
For execution modeling, a label should be versioned by as-of time and maturity state.
For each child or parent order, store at least:
event_time: when the action happened,label_asof_time: when the label snapshot was materialized,maturity_state: how complete the outcome is,finalization_reason: why the label is considered final,revision_count: how many times the label changed,pending_components[]: markout / correction / completion / reconciliation still open.
A practical maturity ladder:
1. OPEN_UNCERTAIN
The order/parent is still live or downstream state is unresolved.
2. PARTIAL_OBSERVED
Some components are known (e.g. first fill, current residual), but completion and/or markout are still pending.
3. SOFT_FINAL
Trading outcome is mostly known, but late corrections / drop-copy / venue reconciliation may still alter it.
4. HARD_FINAL
All required components for the label are closed:
- fills/cancels reconciled,
- parent completion resolved,
- chosen markout horizons observed,
- corrections window passed,
- fee/rebate mapping frozen.
Only HARD_FINAL should be treated as canonical truth for retrospective model evaluation.
Decompose the problem: one label is usually the wrong abstraction
Do not train “slippage” as one monolithic target.
In production you usually need several targets with different maturity clocks:
Fill-hazard target
Time-to-first-fill / time-to-full-fill / timeout / cancel.Immediate execution-cost target
Spread-crossing + short-horizon price move around send/fill.Completion-cost target
Residual inventory + forced catch-up + missed-schedule penalty.Post-trade TCA target
Final implementation shortfall with fees/rebates/corrections.Toxicity / markout target
1s / 5s / 30s / 120s post-fill markout.
Each target matures on a different clock.
If you collapse them into one number too early, online training becomes a race between fast labels and correct labels.
Production pattern: a maturity-aware label table
A reliable schema usually looks like this:
execution_outcome_versions
- order_id / parent_id
- version_id
- event_time
- label_asof_time
- maturity_state
- fill_qty_observed
- residual_qty_observed
- realized_cost_bps_partial
- expected_remaining_cost_bps
- final_cost_bps_nullable
- markout_1s_bps_nullable
- markout_5s_bps_nullable
- markout_30s_bps_nullable
- correction_pending_bool
- reconciliation_pending_bool
- fees_final_bool
- revision_source
Important rule:
never overwrite immature labels in place without keeping history.
You want to be able to answer:
- what the model knew at training time,
- how often labels were revised,
- whether optimism came from immature supervision.
Modeling blueprint
Layer A: delay / maturity model
First model when labels become reliable.
For each target, estimate:
[ P(M_h = 1 \mid x, a, t) ]
where:
- (M_h) = label matured by horizon (h),
- (x) = market + order state,
- (a) = action (join/improve/take/pause/reroute),
- (t) = session time / venue phase.
Useful features:
- venue / symbol / liquidity bucket,
- order type + TIF,
- passive depth level and queue proxy,
- session phase (open, midday, close, halt, reopen),
- message-rate congestion,
- correction incidence by venue,
- whether parent has hedge coupling or basket dependency.
This gives you a maturity propensity. Later you can use it to decide whether to train now, wait, or apply correction weights.
Layer B: competing-risks outcome model
For passive and semi-passive routing, use a competing-risks view:
- fill,
- cancel,
- replace/queue-reset,
- timeout,
- forced-cross / cleanup.
This is often more faithful than a single binary fill label.
Conceptually:
[ \lambda_k(\tau \mid x) \quad k \in {\text{fill, cancel, timeout, cleanup}} ]
Then couple the hazards with expected downstream cost.
Why this matters:
- a quote that has not filled yet may still be attractive,
- a quote that survives too long may increase cleanup cost convexly,
- fake negatives mostly arise when the model ignores the time-to-event structure.
Layer C: partial-label nowcast
When labels are immature, do not force a hard 0/1 or final-bps target.
Use a decomposition like:
[ E[C_{final} \mid \mathcal{F}{asof}] = C{observed} + E[C_{remaining} \mid \mathcal{F}_{asof}] ]
where:
- (C_{observed}): cost already realized,
- (E[C_{remaining}\mid\mathcal{F}_{asof}]): model-based estimate of unresolved future cost.
This lets you build two separate objects:
- provisional real-time cost nowcast for live control,
- final retrospective label for truth and evaluation.
Do not confuse them.
Layer D: maturity-aware training policy
For each target, choose one of three policies.
Policy 1: hard-gate training
Train only on labels with sufficient maturity.
Good for:
- final IS,
- long-horizon markouts,
- fee/rebate-sensitive labels.
Trade-off: less freshness, lower bias.
Policy 2: weighted training
Train on partially matured labels but down-weight by unresolved risk:
[ w_i = f\big(P(M_h=1 \mid x_i)\big) ]
Good for:
- near-term fill hazards,
- short-horizon provisional cost models.
Trade-off: fresher, but needs careful calibration.
Policy 3: dual-stream training
Maintain:
- a fresh provisional stream on immature labels,
- a slow truth stream on hard-final labels.
Then reconcile via calibration or teacher-student updates.
This is often the best production compromise.
A practical state machine for live desks
Use a desk-friendly state machine rather than academic purity.
GREEN
- label maturity lag within expected range,
- revision rate normal,
- q95 coverage stable by label age.
YELLOW_DELAYED
- recent data too immature,
- online learner switches to reduced freshness / lower learning rate,
- controller widens uncertainty premium.
ORANGE_REVISIONS
- correction rate / reconciliation drift elevated,
- freeze promotion of new model artifacts,
- route more conservatively on affected venues.
RED_TRUTH_UNSTABLE
- late-label revisions materially change cost tails,
- stop automatic online updates,
- serve last trusted calibration map only.
This turns label maturity into an explicit operational object.
Metrics that catch the problem early
If you only monitor MAE on final labels, you will miss the drift.
Track these instead:
1. Label-age calibration curves
For the same prediction bucket, compare calibration when labels are:
- 1m old,
- 15m old,
- end-of-session,
- T+1 finalized.
If the curve worsens monotonically with age, you are probably learning fake negatives.
2. Revision ratio
[ \text{revision ratio} = \frac{\text{labels changed after first materialization}}{\text{all labels}} ]
Bucket by venue, session phase, and order type.
3. Tail maturation premium
[ \Delta q95 = q95_{hard-final} - q95_{fresh-asof} ]
This directly measures how much tail cost is arriving late.
4. Maturity-lag distribution
Median / p90 / p99 time until hard-final for each target.
5. Fake-negative rate
Among examples initially marked as no-fill / cheap / benign, what fraction later become:
- filled,
- costly cleanup,
- adverse markout,
- corrected/busted.
6. Fresh-vs-final regret gap
Evaluate the strategy using the freshest available labels and the hard-final labels. If the final regret is much worse, your online loop is too optimistic.
What to do in the data pipeline
1. Snapshot labels by as-of time
Materialize labels as they would have existed at:
- +1s,
- +10s,
- +1m,
- end-of-parent,
- end-of-session,
- T+1/T+n reconciliation.
This gives you a full maturity surface.
2. Separate “unknown” from “zero”
Examples:
no_fill_yetis notwill_not_fill,markout_pendingis notmarkout = 0,fees_not_finalis notfees = 0.
This one distinction prevents a lot of silent bias.
3. Backfill without leakage
When late labels arrive, update the truth table, but do not let future-only information leak into historical feature snapshots.
4. Record the label source
Was the update driven by:
- direct execution feed,
- drop copy,
- exchange correction,
- broker reconciliation,
- manual desk intervention?
Source-specific revision patterns are often the first clue.
What to do in the model layer
1. Serve uncertainty, not only point estimates
For immature labels, widen predictive bands or add a maturity premium.
2. Penalize immaturity-sensitive actions
If a tactic depends on labels known to mature slowly, do not let the freshest data overrule the stable prior too quickly.
3. Calibrate by label age
A model can be well calibrated on T+1 truth and badly calibrated on “fresh” labels used for online learning. Track both.
4. Prefer champion/challenger on finalized truth
Fresh provisional wins are interesting, but promotions should depend mainly on hard-final performance.
Failure patterns to avoid
- Treating unresolved passive orders as negatives.
- Using end-of-parent cost before the parent is actually done.
- Mixing provisional and final labels in one unversioned table.
- Measuring calibration only on fresh labels.
- Ignoring late corrections and fee-code revisions.
- Letting online training outrun truth stabilization.
- Using future reconciliation info in historical feature generation.
Minimal implementation checklist
- Every target has an explicit maturity definition.
- Labels are versioned by
label_asof_timeandmaturity_state. - Unknown / pending values are distinct from zero / negative outcomes.
- Fresh provisional models and hard-final evaluation are separated.
- Revision-rate and tail-maturation metrics are monitored by venue and phase.
- Online training policy changes when truth becomes unstable.
- Model promotion depends on finalized truth, not freshest apparent win.
References (useful mental models)
- Joulani, György, Szepesvári (2013), Online Learning under Delayed Feedback.
https://proceedings.mlr.press/v28/joulani13.html - Wang et al. (2020), Delayed Feedback Modeling for the Entire Space Conversion Rate Prediction.
https://arxiv.org/abs/2011.11826 - Rosales et al. (2019), Addressing Delayed Feedback for Continuous Training with Neural Networks in CTR prediction.
https://arxiv.org/abs/1907.06558 - Lo, MacKinlay, Zhang (1997), Econometric Models of Limit-Order Executions.
https://www.nber.org/papers/w6257 - Maglaras, Moallemi, Wang (2021), A Deep Learning Approach to Estimating Fill Probabilities in a Limit Order Book.
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3897438 - Arroyo, Cartea, Moreno-Pino, Zohren (2023), Deep Attentive Survival Analysis in Limit Order Books.
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4432087
Bottom line
In execution research, “freshest data” is not the same thing as “most truthful data.”
A slippage model that ignores label maturity will usually learn the wrong lesson:
- passive quotes look too safe,
- cleanup risk looks too small,
- tails look tamer than they really are.
The cure is not to stop learning online.
It is to make label maturity a first-class object:
- version labels by as-of time,
- model delay explicitly,
- separate provisional nowcasts from final truth,
- promote models on stabilized outcomes.
That is how you stay adaptive without training on lies.