Label-Maturity Delay and Fake-Negative Drift in Online Slippage Models

2026-04-10 · finance

Label-Maturity Delay and Fake-Negative Drift in Online Slippage Models

Date: 2026-04-10
Category: research (execution / slippage modeling)

Why this playbook exists

A live slippage stack does not observe its labels all at once.

Some outcomes arrive immediately:

Other outcomes arrive later:

If you continuously retrain on the freshest data without respecting label maturity, you create a quiet but severe bias:

The result is a model that looks fresh, but systematically underprices urgency, underestimates tails, and over-recommends patience.

This note gives a production blueprint for dealing with that problem.


The core failure mode: immature labels masquerading as truth

Suppose a child order is posted at 09:30:00.100.

At 09:30:00.400 you may know:

But you do not yet know:

If your online learner treats that 300ms-old order as a clean negative example, it is learning from a fake negative.

In production, this usually causes three distortions:

  1. Fill models underestimate eventual fill probability for passive quotes.
  2. Cost models understate cleanup / catch-up cost because unfinished parents are still in flight.
  3. Calibration drifts optimistic because bad outcomes arrive later than easy outcomes.

What “label maturity” should mean

For execution modeling, a label should be versioned by as-of time and maturity state.

For each child or parent order, store at least:

A practical maturity ladder:

1. OPEN_UNCERTAIN

The order/parent is still live or downstream state is unresolved.

2. PARTIAL_OBSERVED

Some components are known (e.g. first fill, current residual), but completion and/or markout are still pending.

3. SOFT_FINAL

Trading outcome is mostly known, but late corrections / drop-copy / venue reconciliation may still alter it.

4. HARD_FINAL

All required components for the label are closed:

Only HARD_FINAL should be treated as canonical truth for retrospective model evaluation.


Decompose the problem: one label is usually the wrong abstraction

Do not train “slippage” as one monolithic target.

In production you usually need several targets with different maturity clocks:

  1. Fill-hazard target
    Time-to-first-fill / time-to-full-fill / timeout / cancel.

  2. Immediate execution-cost target
    Spread-crossing + short-horizon price move around send/fill.

  3. Completion-cost target
    Residual inventory + forced catch-up + missed-schedule penalty.

  4. Post-trade TCA target
    Final implementation shortfall with fees/rebates/corrections.

  5. Toxicity / markout target
    1s / 5s / 30s / 120s post-fill markout.

Each target matures on a different clock.

If you collapse them into one number too early, online training becomes a race between fast labels and correct labels.


Production pattern: a maturity-aware label table

A reliable schema usually looks like this:

execution_outcome_versions
- order_id / parent_id
- version_id
- event_time
- label_asof_time
- maturity_state
- fill_qty_observed
- residual_qty_observed
- realized_cost_bps_partial
- expected_remaining_cost_bps
- final_cost_bps_nullable
- markout_1s_bps_nullable
- markout_5s_bps_nullable
- markout_30s_bps_nullable
- correction_pending_bool
- reconciliation_pending_bool
- fees_final_bool
- revision_source

Important rule:

never overwrite immature labels in place without keeping history.

You want to be able to answer:


Modeling blueprint

Layer A: delay / maturity model

First model when labels become reliable.

For each target, estimate:

[ P(M_h = 1 \mid x, a, t) ]

where:

Useful features:

This gives you a maturity propensity. Later you can use it to decide whether to train now, wait, or apply correction weights.


Layer B: competing-risks outcome model

For passive and semi-passive routing, use a competing-risks view:

This is often more faithful than a single binary fill label.

Conceptually:

[ \lambda_k(\tau \mid x) \quad k \in {\text{fill, cancel, timeout, cleanup}} ]

Then couple the hazards with expected downstream cost.

Why this matters:


Layer C: partial-label nowcast

When labels are immature, do not force a hard 0/1 or final-bps target.

Use a decomposition like:

[ E[C_{final} \mid \mathcal{F}{asof}] = C{observed} + E[C_{remaining} \mid \mathcal{F}_{asof}] ]

where:

This lets you build two separate objects:

  1. provisional real-time cost nowcast for live control,
  2. final retrospective label for truth and evaluation.

Do not confuse them.


Layer D: maturity-aware training policy

For each target, choose one of three policies.

Policy 1: hard-gate training

Train only on labels with sufficient maturity.

Good for:

Trade-off: less freshness, lower bias.

Policy 2: weighted training

Train on partially matured labels but down-weight by unresolved risk:

[ w_i = f\big(P(M_h=1 \mid x_i)\big) ]

Good for:

Trade-off: fresher, but needs careful calibration.

Policy 3: dual-stream training

Maintain:

Then reconcile via calibration or teacher-student updates.

This is often the best production compromise.


A practical state machine for live desks

Use a desk-friendly state machine rather than academic purity.

GREEN

YELLOW_DELAYED

ORANGE_REVISIONS

RED_TRUTH_UNSTABLE

This turns label maturity into an explicit operational object.


Metrics that catch the problem early

If you only monitor MAE on final labels, you will miss the drift.

Track these instead:

1. Label-age calibration curves

For the same prediction bucket, compare calibration when labels are:

If the curve worsens monotonically with age, you are probably learning fake negatives.

2. Revision ratio

[ \text{revision ratio} = \frac{\text{labels changed after first materialization}}{\text{all labels}} ]

Bucket by venue, session phase, and order type.

3. Tail maturation premium

[ \Delta q95 = q95_{hard-final} - q95_{fresh-asof} ]

This directly measures how much tail cost is arriving late.

4. Maturity-lag distribution

Median / p90 / p99 time until hard-final for each target.

5. Fake-negative rate

Among examples initially marked as no-fill / cheap / benign, what fraction later become:

6. Fresh-vs-final regret gap

Evaluate the strategy using the freshest available labels and the hard-final labels. If the final regret is much worse, your online loop is too optimistic.


What to do in the data pipeline

1. Snapshot labels by as-of time

Materialize labels as they would have existed at:

This gives you a full maturity surface.

2. Separate “unknown” from “zero”

Examples:

This one distinction prevents a lot of silent bias.

3. Backfill without leakage

When late labels arrive, update the truth table, but do not let future-only information leak into historical feature snapshots.

4. Record the label source

Was the update driven by:

Source-specific revision patterns are often the first clue.


What to do in the model layer

1. Serve uncertainty, not only point estimates

For immature labels, widen predictive bands or add a maturity premium.

2. Penalize immaturity-sensitive actions

If a tactic depends on labels known to mature slowly, do not let the freshest data overrule the stable prior too quickly.

3. Calibrate by label age

A model can be well calibrated on T+1 truth and badly calibrated on “fresh” labels used for online learning. Track both.

4. Prefer champion/challenger on finalized truth

Fresh provisional wins are interesting, but promotions should depend mainly on hard-final performance.


Failure patterns to avoid

  1. Treating unresolved passive orders as negatives.
  2. Using end-of-parent cost before the parent is actually done.
  3. Mixing provisional and final labels in one unversioned table.
  4. Measuring calibration only on fresh labels.
  5. Ignoring late corrections and fee-code revisions.
  6. Letting online training outrun truth stabilization.
  7. Using future reconciliation info in historical feature generation.

Minimal implementation checklist


References (useful mental models)


Bottom line

In execution research, “freshest data” is not the same thing as “most truthful data.”

A slippage model that ignores label maturity will usually learn the wrong lesson:

The cure is not to stop learning online.

It is to make label maturity a first-class object:

That is how you stay adaptive without training on lies.