Label-Maturity Delay and Fake-Negative Drift in Online Slippage Models

Date: 2026-04-10
Category: research (execution / slippage modeling)

Why this playbook exists

A live slippage stack does not observe its labels all at once.

Some outcomes arrive immediately:

spread at send,
ACK latency,
first fill price,
initial queue reaction.

Other outcomes arrive later:

full completion vs timeout,
cancel-confirmation,
post-fill markout,
parent-order cleanup cost,
drop-copy / correction / busted-trade adjustments,
end-of-parent implementation shortfall.

If you continuously retrain on the freshest data without respecting label maturity, you create a quiet but severe bias:

unresolved passive orders look like no-fill negatives,
incomplete parents look artificially cheap,
tail losses have not landed yet,
late corrections rewrite yesterday’s “ground truth”.

The result is a model that looks fresh, but systematically underprices urgency, underestimates tails, and over-recommends patience.

This note gives a production blueprint for dealing with that problem.

The core failure mode: immature labels masquerading as truth

Suppose a child order is posted at 09:30:00.100.

At 09:30:00.400 you may know:

it has not filled yet,
the book moved slightly away,
the quote is still live.

But you do not yet know:

whether it fills at 09:30:01.000,
whether you cancel and cross at 09:30:01.300,
whether the parent misses its schedule and pays cleanup cost,
whether a later correction changes the true fill sequence,
whether the 5s/30s markout is toxic.

If your online learner treats that 300ms-old order as a clean negative example, it is learning from a fake negative.

In production, this usually causes three distortions:

Fill models underestimate eventual fill probability for passive quotes.
Cost models understate cleanup / catch-up cost because unfinished parents are still in flight.
Calibration drifts optimistic because bad outcomes arrive later than easy outcomes.

What “label maturity” should mean

For execution modeling, a label should be versioned by as-of time and maturity state.

For each child or parent order, store at least:

event_time: when the action happened,
label_asof_time: when the label snapshot was materialized,
maturity_state: how complete the outcome is,
finalization_reason: why the label is considered final,
revision_count: how many times the label changed,
pending_components[]: markout / correction / completion / reconciliation still open.

A practical maturity ladder:

1. `OPEN_UNCERTAIN`

The order/parent is still live or downstream state is unresolved.

2. `PARTIAL_OBSERVED`

Some components are known (e.g. first fill, current residual), but completion and/or markout are still pending.

3. `SOFT_FINAL`

Trading outcome is mostly known, but late corrections / drop-copy / venue reconciliation may still alter it.

4. `HARD_FINAL`

All required components for the label are closed:

fills/cancels reconciled,
parent completion resolved,
chosen markout horizons observed,
corrections window passed,
fee/rebate mapping frozen.

Only HARD_FINAL should be treated as canonical truth for retrospective model evaluation.

Decompose the problem: one label is usually the wrong abstraction

Do not train “slippage” as one monolithic target.

In production you usually need several targets with different maturity clocks:

Fill-hazard target
Time-to-first-fill / time-to-full-fill / timeout / cancel.
Immediate execution-cost target
Spread-crossing + short-horizon price move around send/fill.
Completion-cost target
Residual inventory + forced catch-up + missed-schedule penalty.
Post-trade TCA target
Final implementation shortfall with fees/rebates/corrections.
Toxicity / markout target
1s / 5s / 30s / 120s post-fill markout.

Each target matures on a different clock.

If you collapse them into one number too early, online training becomes a race between fast labels and correct labels.

Production pattern: a maturity-aware label table

A reliable schema usually looks like this:

execution_outcome_versions
- order_id / parent_id
- version_id
- event_time
- label_asof_time
- maturity_state
- fill_qty_observed
- residual_qty_observed
- realized_cost_bps_partial
- expected_remaining_cost_bps
- final_cost_bps_nullable
- markout_1s_bps_nullable
- markout_5s_bps_nullable
- markout_30s_bps_nullable
- correction_pending_bool
- reconciliation_pending_bool
- fees_final_bool
- revision_source

Important rule:

never overwrite immature labels in place without keeping history.

You want to be able to answer:

what the model knew at training time,
how often labels were revised,
whether optimism came from immature supervision.

Modeling blueprint

Layer A: delay / maturity model

First model when labels become reliable.

For each target, estimate:

[ P(M_h = 1 \mid x, a, t) ]

where:

(M_h) = label matured by horizon (h),
(x) = market + order state,
(a) = action (join/improve/take/pause/reroute),
(t) = session time / venue phase.

Useful features:

venue / symbol / liquidity bucket,
order type + TIF,
passive depth level and queue proxy,
session phase (open, midday, close, halt, reopen),
message-rate congestion,
correction incidence by venue,
whether parent has hedge coupling or basket dependency.

This gives you a maturity propensity. Later you can use it to decide whether to train now, wait, or apply correction weights.

Layer B: competing-risks outcome model

For passive and semi-passive routing, use a competing-risks view:

fill,
cancel,
replace/queue-reset,
timeout,
forced-cross / cleanup.

This is often more faithful than a single binary fill label.

Conceptually:

[ \lambda_k(\tau \mid x) \quad k \in {\text{fill, cancel, timeout, cleanup}} ]

Then couple the hazards with expected downstream cost.

Why this matters:

a quote that has not filled yet may still be attractive,
a quote that survives too long may increase cleanup cost convexly,
fake negatives mostly arise when the model ignores the time-to-event structure.

Layer C: partial-label nowcast

When labels are immature, do not force a hard 0/1 or final-bps target.

Use a decomposition like:

[ E[C_{final} \mid \mathcal{F}{asof}] = C{observed} + E[C_{remaining} \mid \mathcal{F}_{asof}] ]

where:

(C_{observed}): cost already realized,
(E[C_{remaining}\mid\mathcal{F}_{asof}]): model-based estimate of unresolved future cost.

This lets you build two separate objects:

provisional real-time cost nowcast for live control,
final retrospective label for truth and evaluation.

Do not confuse them.

Layer D: maturity-aware training policy

For each target, choose one of three policies.

Policy 1: hard-gate training

Train only on labels with sufficient maturity.

Good for:

final IS,
long-horizon markouts,
fee/rebate-sensitive labels.

Trade-off: less freshness, lower bias.

Policy 2: weighted training

Train on partially matured labels but down-weight by unresolved risk:

[ w_i = f\big(P(M_h=1 \mid x_i)\big) ]

Good for:

near-term fill hazards,
short-horizon provisional cost models.

Trade-off: fresher, but needs careful calibration.

Policy 3: dual-stream training

Maintain:

a fresh provisional stream on immature labels,
a slow truth stream on hard-final labels.

Then reconcile via calibration or teacher-student updates.

This is often the best production compromise.

A practical state machine for live desks

Use a desk-friendly state machine rather than academic purity.

`GREEN`

label maturity lag within expected range,
revision rate normal,
q95 coverage stable by label age.

`YELLOW_DELAYED`

recent data too immature,
online learner switches to reduced freshness / lower learning rate,
controller widens uncertainty premium.

`ORANGE_REVISIONS`

correction rate / reconciliation drift elevated,
freeze promotion of new model artifacts,
route more conservatively on affected venues.

`RED_TRUTH_UNSTABLE`

late-label revisions materially change cost tails,
stop automatic online updates,
serve last trusted calibration map only.

This turns label maturity into an explicit operational object.

Metrics that catch the problem early

If you only monitor MAE on final labels, you will miss the drift.

Track these instead:

1. Label-age calibration curves

For the same prediction bucket, compare calibration when labels are:

1m old,
15m old,
end-of-session,
T+1 finalized.

If the curve worsens monotonically with age, you are probably learning fake negatives.

2. Revision ratio

[ \text{revision ratio} = \frac{\text{labels changed after first materialization}}{\text{all labels}} ]

Bucket by venue, session phase, and order type.

3. Tail maturation premium

[ \Delta q95 = q95_{hard-final} - q95_{fresh-asof} ]

This directly measures how much tail cost is arriving late.

4. Maturity-lag distribution

Median / p90 / p99 time until hard-final for each target.

5. Fake-negative rate

Among examples initially marked as no-fill / cheap / benign, what fraction later become:

filled,
costly cleanup,
adverse markout,
corrected/busted.

6. Fresh-vs-final regret gap

Evaluate the strategy using the freshest available labels and the hard-final labels. If the final regret is much worse, your online loop is too optimistic.

What to do in the data pipeline

1. Snapshot labels by as-of time

Materialize labels as they would have existed at:

+1s,
+10s,
+1m,
end-of-parent,
end-of-session,
T+1/T+n reconciliation.

This gives you a full maturity surface.

2. Separate “unknown” from “zero”

Examples:

no_fill_yet is not will_not_fill,
markout_pending is not markout = 0,
fees_not_final is not fees = 0.

This one distinction prevents a lot of silent bias.

3. Backfill without leakage

When late labels arrive, update the truth table, but do not let future-only information leak into historical feature snapshots.

4. Record the label source

Was the update driven by:

direct execution feed,
drop copy,
exchange correction,
broker reconciliation,
manual desk intervention?

Source-specific revision patterns are often the first clue.

What to do in the model layer

1. Serve uncertainty, not only point estimates

For immature labels, widen predictive bands or add a maturity premium.

2. Penalize immaturity-sensitive actions

If a tactic depends on labels known to mature slowly, do not let the freshest data overrule the stable prior too quickly.

3. Calibrate by label age

A model can be well calibrated on T+1 truth and badly calibrated on “fresh” labels used for online learning. Track both.

4. Prefer champion/challenger on finalized truth

Fresh provisional wins are interesting, but promotions should depend mainly on hard-final performance.

Failure patterns to avoid

Treating unresolved passive orders as negatives.
Using end-of-parent cost before the parent is actually done.
Mixing provisional and final labels in one unversioned table.
Measuring calibration only on fresh labels.
Ignoring late corrections and fee-code revisions.
Letting online training outrun truth stabilization.
Using future reconciliation info in historical feature generation.

Minimal implementation checklist

Every target has an explicit maturity definition.
Labels are versioned by label_asof_time and maturity_state.
Unknown / pending values are distinct from zero / negative outcomes.
Fresh provisional models and hard-final evaluation are separated.
Revision-rate and tail-maturation metrics are monitored by venue and phase.
Online training policy changes when truth becomes unstable.
Model promotion depends on finalized truth, not freshest apparent win.

References (useful mental models)

Joulani, György, Szepesvári (2013), Online Learning under Delayed Feedback.
https://proceedings.mlr.press/v28/joulani13.html
Wang et al. (2020), Delayed Feedback Modeling for the Entire Space Conversion Rate Prediction.
https://arxiv.org/abs/2011.11826
Rosales et al. (2019), Addressing Delayed Feedback for Continuous Training with Neural Networks in CTR prediction.
https://arxiv.org/abs/1907.06558
Lo, MacKinlay, Zhang (1997), Econometric Models of Limit-Order Executions.
https://www.nber.org/papers/w6257
Maglaras, Moallemi, Wang (2021), A Deep Learning Approach to Estimating Fill Probabilities in a Limit Order Book.
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3897438
Arroyo, Cartea, Moreno-Pino, Zohren (2023), Deep Attentive Survival Analysis in Limit Order Books.
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4432087

Bottom line

In execution research, “freshest data” is not the same thing as “most truthful data.”

A slippage model that ignores label maturity will usually learn the wrong lesson:

passive quotes look too safe,
cleanup risk looks too small,
tails look tamer than they really are.

The cure is not to stop learning online.

It is to make label maturity a first-class object:

version labels by as-of time,
model delay explicitly,
separate provisional nowcasts from final truth,
promote models on stabilized outcomes.

That is how you stay adaptive without training on lies.

Label-Maturity Delay and Fake-Negative Drift in Online Slippage Models

Label-Maturity Delay and Fake-Negative Drift in Online Slippage Models

Why this playbook exists

The core failure mode: immature labels masquerading as truth

What “label maturity” should mean

1. OPEN_UNCERTAIN

2. PARTIAL_OBSERVED

3. SOFT_FINAL

4. HARD_FINAL

Decompose the problem: one label is usually the wrong abstraction

Production pattern: a maturity-aware label table

Modeling blueprint

Layer A: delay / maturity model

Layer B: competing-risks outcome model

Layer C: partial-label nowcast

Layer D: maturity-aware training policy

Policy 1: hard-gate training

Policy 2: weighted training

Policy 3: dual-stream training

A practical state machine for live desks

GREEN

YELLOW_DELAYED

ORANGE_REVISIONS

RED_TRUTH_UNSTABLE

Metrics that catch the problem early

1. Label-age calibration curves

2. Revision ratio

3. Tail maturation premium

4. Maturity-lag distribution

5. Fake-negative rate

6. Fresh-vs-final regret gap

What to do in the data pipeline

1. Snapshot labels by as-of time

2. Separate “unknown” from “zero”

3. Backfill without leakage

4. Record the label source

What to do in the model layer

1. Serve uncertainty, not only point estimates

2. Penalize immaturity-sensitive actions

3. Calibrate by label age

4. Prefer champion/challenger on finalized truth

Failure patterns to avoid

Minimal implementation checklist

References (useful mental models)

Bottom line

1. `OPEN_UNCERTAIN`

2. `PARTIAL_OBSERVED`

3. `SOFT_FINAL`

4. `HARD_FINAL`

`GREEN`

`YELLOW_DELAYED`

`ORANGE_REVISIONS`

`RED_TRUTH_UNSTABLE`