Aggressor-Side Misclassification and Toxicity-Label Drift in Slippage Models

Date: 2026-04-11
Category: research (execution / slippage modeling)

Why this playbook exists

A lot of execution stacks quietly assume they know which trades were buyer-initiated and which were seller-initiated.

That assumption leaks everywhere:

short-horizon toxicity scores,
markout labels,
venue ranking,
passive-vs-aggressive switching,
anti-gaming logic,
and post-trade slippage attribution.

The problem is that many production pipelines do not observe the true aggressor side directly. They infer it from:

trade price vs quote midpoint,
tick tests,
Lee-Ready-style quote matching,
bulk signed-volume heuristics,
or vendor-normalized flags with unknown provenance.

When that inference is wrong, the damage is not limited to a noisy dashboard. It becomes a control error.

The model starts learning from mislabeled toxicity:

toxic flow looks benign,
benign flow looks toxic,
passive fills get overcredited,
aggressive crossings get blamed or forgiven for the wrong reasons,
venue selection drifts toward the wrong microstructure regime.

This note turns that problem into a production modeling and control framework.

A useful companion note is:

2026-04-08-trade-sign-classification-tick-rule-quote-rule-lee-ready-bvc-playbook.md

That file is about how trade signing methods work. This file is about what happens when those labels feed live slippage and routing models.

The core failure mode

Let:

(s^*_i \in {-1,+1}): true aggressor side of trade (i),
(\hat{s}_i): inferred aggressor side used by your model,
(v_i): signed or unsigned trade size,
(m_i = 1[\hat{s}_i \neq s^*_i]): misclassification indicator.

Your model often builds features like:

signed flow imbalance,
recent aggressive buy/sell pressure,
toxicity-conditioned fill hazard,
markout by aggressor side,
adverse-selection cost after passive fills.

But the model actually sees:

[ \hat{F}t = \sum{i \in W_t} \hat{s}_i v_i ]

instead of the flow you really care about:

[ F^t = \sum{i \in W_t} s^_i v_i ]

If sign error is random, the signal gets attenuated. If sign error is state-dependent, the signal becomes biased.

State-dependent error is the dangerous case. That happens when misclassification rises specifically during:

high quote-update rates,
midpoint / hidden-liquidity activity,
fragmented/off-exchange reporting,
odd-lot-heavy regimes,
auctions and special-print windows,
or timestamp skew between trade and quote paths.

In other words:

your labels get worst exactly when microstructure is hardest and slippage matters most.

Why this hurts slippage models more than people expect

Trade-sign noise does not only degrade one feature. It contaminates the whole feedback loop.

1. Toxicity models underreact to real informed flow

If real buyer-initiated sweeps are partially mislabeled as sells or unknowns, the model underestimates buy-side pressure. Then:

passive sell orders stay live too long,
queue positions are held beyond their edge,
adverse markouts rise,
later catch-up aggression pays the bill.

2. Passive fills look safer than they really are

A passive fill is often evaluated against what happened right after it. If post-fill aggressive flow is mislabeled, the system understates adverse selection. That flatters passive tactics in backtests and shadow evaluation.

3. Aggressive routes learn the wrong venue map

A venue may look safe because its toxic flow is harder to sign correctly:

more midpoint prints,
more delayed reporting,
more odd-lot activity,
more quote/trade timing ambiguity.

The router then mistakes measurement weakness for venue quality.

4. Attribution shifts from model error to market error

When labels are noisy, teams often conclude:

“the market was just jumpy,”
“the book was weird,”
“today’s markout was random,”
“venue X became toxic.”

Sometimes the real answer is simpler:

your aggressor labels degraded, so the model stopped seeing toxicity correctly.

Mechanism map

1) Trade/quote clock skew

Classic quote-rule and Lee-Ready-style logic depend on the quote being the one the trade actually interacted with. If the quote stream is too early or too late relative to the trade print, you sign against the wrong midpoint.

That flips labels especially when:

spreads are tight,
quote updates are frequent,
price moves cluster,
or the trade happened near a rapidly moving touch.

2) Midpoint and hidden-liquidity executions

A midpoint print is weak evidence of direction from price alone. If your fallback logic forces a buy/sell guess anyway, sign quality collapses in midpoint-heavy venues.

3) Odd-lot and sub-round-lot regimes

Odd lots can carry real informational content while interacting awkwardly with displayed-touch logic. A classifier that still mentally lives in a 100-share-touch world will misread modern flow.

4) Off-exchange and delayed reporting

If trade reports arrive late or in bursts, the quote state visible at arrival time may be unrelated to the economic state at execution time. That makes post hoc signing look cleaner than live signing.

5) Bulk classification used as trade-level truth

Bulk Volume Classification (BVC) can be useful for interval-level signed volume. It is not a per-trade aggressor oracle. Using it as trade-level truth poisons event-level toxicity and fill models.

6) Corrections, cancels, and sale-condition drift

If a print is corrected, canceled, or reclassified later, your signed-flow history is rewritten after the model may already have acted on it. That creates label inconsistency across:

live control,
replay backtests,
post-trade TCA,
champion/challenger comparisons.

A more useful abstraction: sign quality is a latent state

Instead of pretending every trade label is equally trustworthy, define:

(q_i = P(\hat{s}_i = s^*_i \mid x_i)): confidence that the inferred sign is correct,
(\ell_i = \hat{s}_i q_i): confidence-weighted signed contribution.

Then your model can use:

[ \tilde{F}t = \sum{i \in W_t} \ell_i v_i ]

instead of naïve signed flow.

This immediately separates two very different situations:

strong, trustworthy flow imbalance,
apparent flow imbalance generated by weak sign evidence.

That difference matters operationally. A routing model should react strongly to the first and cautiously to the second.

Cost decomposition

A practical decomposition is:

[ C_{total} = C_{base} + C_{signal_loss} + C_{wrong_reaction} + C_{venue_misrank} + C_{attribution_drift} ]

Where:

(C_{base}): unavoidable cost from spread, impact, and volatility,
(C_{signal_loss}): reduced predictive power from noisy trade signs,
(C_{wrong_reaction}): live policy errors caused by sign noise,
(C_{venue_misrank}): routing cost from learning the wrong venue toxicity map,
(C_{attribution_drift}): TCA distortion from inconsistent or revised labels.

A simple way to think about the live piece:

[ C_{wrong_reaction} \approx \kappa_1 \cdot |F^*_t - \tilde{F}_t| + \kappa_2 \cdot \text{policy flip rate} + \kappa_3 \cdot \text{false passive dwell time} ]

That last term matters a lot. A toxic flow signal that arrives late or inverted often does not cause one bad fill. It causes a sequence:

keep passive order live too long,
get negatively selected,
cancel late,
cross later in a worse book,
attribute the pain to “market move” instead of label contamination.

The exact model bug people miss

Many teams validate trade signing by aggregate statistics:

daily imbalance correlation,
average signed volume agreement,
or bar-level predictive power.

That is not enough.

A classifier can look acceptable in aggregate while still being disastrous in the exact subset that matters for execution:

near-touch trades,
short-horizon markout windows,
high-update symbols,
midpoint-heavy venues,
open/close transitions,
high-volatility bursts.

This is the same pathology as a slippage model with fine RMSE and terrible q95.

Average correctness is not the relevant KPI. Correctness in decision-critical regimes is.

Public grounding

A few public references make this problem very real:

Lee-Ready-style trade classification explicitly depends on comparing a trade to a contemporaneous or lagged quote midpoint, which means clock alignment is foundational rather than cosmetic.
Public summaries of Chakrabarty, Moulton, and Shkilko report trade-level misclassification rates around 31% with contemporaneous quotes and around 21% with a one-second quote lag in their study, showing that the label problem can be material even before you get to modern hidden/midpoint fragmentation.
Public summaries of later electronic-era comparison work report that Lee-Ready underperforms during intervals of high trade and/or quote frequency, which is exactly when live slippage control is least tolerant of label error.
FINRA public trade-reporting rules and FAQs say OTC equity trades must be reported as soon as practicable, generally no later than 10 seconds after execution, so delayed trade visibility is structurally present in some market segments.
Nasdaq public notices on odd-lot dissemination state odd-lot trades were included in volume statistics with a dedicated modifier, which is another reminder that statistical trade visibility and decision-grade touch semantics are not identical.

The punchline is simple:

trade sign is not a timeless label living inside the print. It is a reconstruction whose quality depends on clocks, venue semantics, and market regime.

Features that belong in a slippage stack

A. Sign-quality features

quote_trade_delta_us
best_quote_age_us
midpoint_tie_flag
trade_at_touch_flag
tick_fallback_used
classifier_method (native, quote, lee_ready, tick, bvc, unknown)
sign_confidence
quote_alignment_bucket

B. Venue / print-semantics features

odd_lot_flag
off_exchange_flag
sale_condition_bucket
midpoint_or_hidden_proxy
auction_or_cross_flag
late_report_flag
corrected_or_canceled_flag

C. Market-state features

spread,
microprice,
touch imbalance,
quote update rate,
trade rate,
short-horizon realized vol,
queue refill speed,
phase (open, continuous, close, halt, reopen).

D. Strategy-state features

passive dwell time,
residual inventory,
time-to-deadline,
current aggression level,
recent markout losses,
venue concentration,
fallback-policy dwell time.

The important idea:

label quality is itself a first-class feature. Do not hide it in preprocessing and pretend the downstream model is working with truth.

Metrics worth monitoring

1. SIR — Sign Inversion Rate

Estimated rate at which the inferred sign disagrees with better ground truth on a benchmark subset.

Break it out by:

venue,
liquidity bucket,
spread bucket,
time of day,
quote-age bucket,
print type.

2. QAD — Quote Alignment Drift

Distribution of trade-to-quote timing mismatch used by the classifier.

If QAD shifts, sign quality may shift even when market behavior itself does not.

3. MDS — Method Dispersion Share

Share of labels produced by each method:

quote rule,
tick fallback,
midpoint guess,
vendor flag,
unknown.

A rising fallback share is often an early-warning indicator.

4. MSD — Markout Sign Disagreement

Compare markout statistics when grouped by inferred sign vs higher-confidence sign on a labeled subset.

This measures how much toxicity inference is being bent by label noise.

5. VCR — Venue Contamination Ratio

Fraction of flow for a venue that lands in low-confidence sign buckets.

This catches the classic failure mode where a venue looks “safer” only because your labels are weaker there.

6. PFD — Passive Fill Distortion

Difference in expected passive-fill markout using naïve signs vs confidence-aware signs.

This is the business metric that often reveals the problem fastest.

State machine for live control

`LABEL_CLEAN`

Sign-quality metrics stable.
Toxicity models trusted normally.
Standard passive/aggressive switching allowed.

`QUOTE_SKEWED`

Triggered when quote-age / trade-quote alignment deteriorates.

Downweight trade-sign-driven features.
Increase use of quote-age and spread signals directly.
Reduce confidence in micro-horizon toxicity flips.

`HIDDEN_FLOW_HEAVY`

Triggered when midpoint / odd-lot / non-standard print share rises.

Treat trade-sign features as partial evidence only.
Prefer venue-aware and fill-aware signals.
Avoid overreacting to inferred signed imbalance alone.

`OFFEX_DELAYED`

Triggered when delayed reporting or correction activity rises.

Separate live routing features from replay/TCA labels.
Do not let repaired historical signs drive immediate control changes.
Use confidence floors before adapting venue scores.

`SAFE_ABSTAIN`

Triggered when sign quality is persistently poor.

Freeze sign-sensitive venue ranking updates.
Fall back to more robust features: spread, depth, queue age, realized markout, fill hazard, time-to-deadline.
Prefer policies that are less label-sensitive.

A bad sign regime should degrade into safer, simpler control, not into random overfitting.

Modeling blueprint

Layer 1 — Preserve label provenance

For every signed trade, store at least:

raw trade timestamp,
raw quote timestamp used,
venue/source,
method used,
lag used,
sign,
confidence,
special-condition flags.

If you do not store provenance, you will never debug label drift cleanly.

Layer 2 — Build a confidence model

Estimate:

[ q_i = P(\hat{s}_i = s^*_i \mid x_i) ]

using benchmark subsets such as:

native aggressor flags where available,
MBO/order-level reconstruction,
trusted venue-native data,
or carefully curated manual subsets.

Layer 3 — Replace hard labels with soft labels

Instead of training on (\hat{s}_i) alone, use:

confidence-weighted signs,
unknown / abstain states,
or explicit multi-class labels (buy, sell, unknown).

Layer 4 — Train toxicity and markout models conditional on sign quality

Predict:

[ E[C \mid x_t, \tilde{F}_t, q_t] ]

not merely:

[ E[C \mid x_t, \hat{F}_t] ]

That lets the model learn different responses for:

strong true imbalance,
weakly supported imbalance,
and ambiguity-heavy flow.

Layer 5 — Freeze or regularize online adaptation under label stress

If venue scores adapt online from mislabeled toxicity, the router will drift the wrong way. Use:

slower learning rates,
confidence-gated updates,
shadow learning first,
and abstention when label quality is degraded.

Practical policy rules

Rule 1: unknown is a valid label

Forcing weak guesses is often worse than carrying uncertainty.

Rule 2: keep trade signing and slippage modeling loosely coupled

Signing logic will change. If your whole feature store assumes one eternal sign classifier, future fixes become painful and misleading.

Rule 3: benchmark subsets must match decision-critical regimes

A high-confidence benchmark only on calm large-cap lit flow is not enough. You need validation where the live model bleeds:

fast symbols,
fragmented venues,
midpoint-heavy periods,
open/close transitions.

Rule 4: venue quality and label quality are different things

If a venue looks safe, ask whether it is actually safe or just hard to sign.

Rule 5: TCA must distinguish live-as-of labels from hindsight-repaired labels

Otherwise replay studies flatter the model by grading it with cleaner labels than the live controller had.

30-day rollout plan

Week 1 — Instrument label provenance

Persist sign method, lag, quote age, and confidence metadata.
Baseline MDS, QAD, and confidence distribution by venue and time of day.
Separate live labels from hindsight-repaired labels in storage.

Week 2 — Build benchmark subsets

Identify venues / datasets with stronger aggressor truth.
Estimate SIR by regime.
Rank where misclassification is operationally worst, not merely most frequent.

Week 3 — Confidence-aware shadow models

Replace hard signed-flow features with confidence-weighted versions in shadow.
Add SAFE_ABSTAIN logic for label-stress windows.
Compare passive-fill markout and venue ranking drift vs baseline.

Week 4 — Controlled activation

Gate online learning updates by sign confidence.
Clip policy changes when label quality deteriorates.
Optimize PFD and venue-misrank cost, not just classifier accuracy.

Common anti-patterns

Treating one trade-signing rule as permanent truth.
Validating sign quality only on daily aggregates.
Letting BVC-style interval inference leak into trade-level labels.
Ignoring midpoint / hidden-flow ambiguity and forcing a guess.
Learning venue toxicity maps without adjusting for label quality.
Mixing live labels and hindsight-repaired labels in the same evaluation.
Blaming “market regime change” before checking whether label quality changed first.

What good looks like

A production execution stack should be able to answer:

Which trades were signed with strong evidence vs weak evidence?
How does sign quality change by venue, symbol, and time regime?
How much passive-fill markout worsens when sign confidence is low?
Whether a venue looks safe because of real outcomes or because its flow is hard to classify?
Whether online adaptation slows down or becomes reckless during label-stress windows?

If you cannot answer those, your toxicity model may be learning from mislabeled flow.

And mislabeled flow is one of the cleanest ways to pay real slippage for imaginary signal.

Selected public references

Lee, C. M. C. and Ready, M. J. (1991), Inferring Trade Direction from Intraday Data — the classic quote-rule + tick-fallback framework that makes quote alignment central to trade signing.
Chakrabarty, B., Moulton, P. C., and Shkilko, A., Short Sales, Long Sales, and the Lee-Ready Trade Classification Algorithm Revisited — public summaries report trade-level misclassification around 31% with contemporaneous quotes and around 21% with a one-second lag:
- https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1755025
- https://ecommons.cornell.edu/items/f1a78830-d866-46d9-b1f6-160762af40e1
Panayides, M., Shohfi, T., and Smith, J., Comparing Trade Flow Classification Algorithms in the Electronic Era — public summaries report Lee-Ready underperforming during high trade/quote frequency intervals and highlight the distinction between trade-level and bulk classification use cases:
- https://www.quantresearch.org/Panayides_Shohfi_Smith.pdf
FINRA public trade-reporting rules / FAQ — public guidance that trades must be reported as soon as practicable, generally within 10 seconds, which makes delayed visibility structurally relevant for some prints:
- https://www.finra.org/rules-guidance/rulebooks/finra-rules/6622
- https://www.finra.org/filing-reporting/market-transparency-reporting/trade-reporting-faq
Nasdaq odd-lot dissemination notice — odd-lot trades included in volume statistical calculations with dedicated modifiers, underscoring the gap between visible print statistics and decision-grade touch semantics:
- https://www.nasdaqtrader.com/TraderNews.aspx?id=dtn2013-34

Bottom line

Trade-sign classification error is not just a data-cleaning nuisance.

It is a slippage-model contamination channel.

When aggressor-side labels degrade, toxicity features attenuate or invert, passive fills get misgraded, venue rankings drift, and online adaptation starts learning from the wrong market.

The right response is not “pick one better classifier and forget it.” It is:

preserve label provenance,
model sign confidence explicitly,
separate live labels from hindsight labels,
gate control changes by label quality,
and degrade gracefully when trade signing becomes unreliable.

In short:

before you trust signed-flow alpha, make sure you trust the signs.