Aggressor-Side Misclassification and Toxicity-Label Drift in Slippage Models
Date: 2026-04-11
Category: research (execution / slippage modeling)
Why this playbook exists
A lot of execution stacks quietly assume they know which trades were buyer-initiated and which were seller-initiated.
That assumption leaks everywhere:
- short-horizon toxicity scores,
- markout labels,
- venue ranking,
- passive-vs-aggressive switching,
- anti-gaming logic,
- and post-trade slippage attribution.
The problem is that many production pipelines do not observe the true aggressor side directly. They infer it from:
- trade price vs quote midpoint,
- tick tests,
- Lee-Ready-style quote matching,
- bulk signed-volume heuristics,
- or vendor-normalized flags with unknown provenance.
When that inference is wrong, the damage is not limited to a noisy dashboard. It becomes a control error.
The model starts learning from mislabeled toxicity:
- toxic flow looks benign,
- benign flow looks toxic,
- passive fills get overcredited,
- aggressive crossings get blamed or forgiven for the wrong reasons,
- venue selection drifts toward the wrong microstructure regime.
This note turns that problem into a production modeling and control framework.
A useful companion note is:
2026-04-08-trade-sign-classification-tick-rule-quote-rule-lee-ready-bvc-playbook.md
That file is about how trade signing methods work. This file is about what happens when those labels feed live slippage and routing models.
The core failure mode
Let:
- (s^*_i \in {-1,+1}): true aggressor side of trade (i),
- (\hat{s}_i): inferred aggressor side used by your model,
- (v_i): signed or unsigned trade size,
- (m_i = 1[\hat{s}_i \neq s^*_i]): misclassification indicator.
Your model often builds features like:
- signed flow imbalance,
- recent aggressive buy/sell pressure,
- toxicity-conditioned fill hazard,
- markout by aggressor side,
- adverse-selection cost after passive fills.
But the model actually sees:
[ \hat{F}t = \sum{i \in W_t} \hat{s}_i v_i ]
instead of the flow you really care about:
[ F^t = \sum{i \in W_t} s^_i v_i ]
If sign error is random, the signal gets attenuated. If sign error is state-dependent, the signal becomes biased.
State-dependent error is the dangerous case. That happens when misclassification rises specifically during:
- high quote-update rates,
- midpoint / hidden-liquidity activity,
- fragmented/off-exchange reporting,
- odd-lot-heavy regimes,
- auctions and special-print windows,
- or timestamp skew between trade and quote paths.
In other words:
your labels get worst exactly when microstructure is hardest and slippage matters most.
Why this hurts slippage models more than people expect
Trade-sign noise does not only degrade one feature. It contaminates the whole feedback loop.
1. Toxicity models underreact to real informed flow
If real buyer-initiated sweeps are partially mislabeled as sells or unknowns, the model underestimates buy-side pressure. Then:
- passive sell orders stay live too long,
- queue positions are held beyond their edge,
- adverse markouts rise,
- later catch-up aggression pays the bill.
2. Passive fills look safer than they really are
A passive fill is often evaluated against what happened right after it. If post-fill aggressive flow is mislabeled, the system understates adverse selection. That flatters passive tactics in backtests and shadow evaluation.
3. Aggressive routes learn the wrong venue map
A venue may look safe because its toxic flow is harder to sign correctly:
- more midpoint prints,
- more delayed reporting,
- more odd-lot activity,
- more quote/trade timing ambiguity.
The router then mistakes measurement weakness for venue quality.
4. Attribution shifts from model error to market error
When labels are noisy, teams often conclude:
- “the market was just jumpy,”
- “the book was weird,”
- “today’s markout was random,”
- “venue X became toxic.”
Sometimes the real answer is simpler:
your aggressor labels degraded, so the model stopped seeing toxicity correctly.
Mechanism map
1) Trade/quote clock skew
Classic quote-rule and Lee-Ready-style logic depend on the quote being the one the trade actually interacted with. If the quote stream is too early or too late relative to the trade print, you sign against the wrong midpoint.
That flips labels especially when:
- spreads are tight,
- quote updates are frequent,
- price moves cluster,
- or the trade happened near a rapidly moving touch.
2) Midpoint and hidden-liquidity executions
A midpoint print is weak evidence of direction from price alone. If your fallback logic forces a buy/sell guess anyway, sign quality collapses in midpoint-heavy venues.
3) Odd-lot and sub-round-lot regimes
Odd lots can carry real informational content while interacting awkwardly with displayed-touch logic. A classifier that still mentally lives in a 100-share-touch world will misread modern flow.
4) Off-exchange and delayed reporting
If trade reports arrive late or in bursts, the quote state visible at arrival time may be unrelated to the economic state at execution time. That makes post hoc signing look cleaner than live signing.
5) Bulk classification used as trade-level truth
Bulk Volume Classification (BVC) can be useful for interval-level signed volume. It is not a per-trade aggressor oracle. Using it as trade-level truth poisons event-level toxicity and fill models.
6) Corrections, cancels, and sale-condition drift
If a print is corrected, canceled, or reclassified later, your signed-flow history is rewritten after the model may already have acted on it. That creates label inconsistency across:
- live control,
- replay backtests,
- post-trade TCA,
- champion/challenger comparisons.
A more useful abstraction: sign quality is a latent state
Instead of pretending every trade label is equally trustworthy, define:
- (q_i = P(\hat{s}_i = s^*_i \mid x_i)): confidence that the inferred sign is correct,
- (\ell_i = \hat{s}_i q_i): confidence-weighted signed contribution.
Then your model can use:
[ \tilde{F}t = \sum{i \in W_t} \ell_i v_i ]
instead of naïve signed flow.
This immediately separates two very different situations:
- strong, trustworthy flow imbalance,
- apparent flow imbalance generated by weak sign evidence.
That difference matters operationally. A routing model should react strongly to the first and cautiously to the second.
Cost decomposition
A practical decomposition is:
[ C_{total} = C_{base} + C_{signal_loss} + C_{wrong_reaction} + C_{venue_misrank} + C_{attribution_drift} ]
Where:
- (C_{base}): unavoidable cost from spread, impact, and volatility,
- (C_{signal_loss}): reduced predictive power from noisy trade signs,
- (C_{wrong_reaction}): live policy errors caused by sign noise,
- (C_{venue_misrank}): routing cost from learning the wrong venue toxicity map,
- (C_{attribution_drift}): TCA distortion from inconsistent or revised labels.
A simple way to think about the live piece:
[ C_{wrong_reaction} \approx \kappa_1 \cdot |F^*_t - \tilde{F}_t| + \kappa_2 \cdot \text{policy flip rate} + \kappa_3 \cdot \text{false passive dwell time} ]
That last term matters a lot. A toxic flow signal that arrives late or inverted often does not cause one bad fill. It causes a sequence:
- keep passive order live too long,
- get negatively selected,
- cancel late,
- cross later in a worse book,
- attribute the pain to “market move” instead of label contamination.
The exact model bug people miss
Many teams validate trade signing by aggregate statistics:
- daily imbalance correlation,
- average signed volume agreement,
- or bar-level predictive power.
That is not enough.
A classifier can look acceptable in aggregate while still being disastrous in the exact subset that matters for execution:
- near-touch trades,
- short-horizon markout windows,
- high-update symbols,
- midpoint-heavy venues,
- open/close transitions,
- high-volatility bursts.
This is the same pathology as a slippage model with fine RMSE and terrible q95.
Average correctness is not the relevant KPI. Correctness in decision-critical regimes is.
Public grounding
A few public references make this problem very real:
- Lee-Ready-style trade classification explicitly depends on comparing a trade to a contemporaneous or lagged quote midpoint, which means clock alignment is foundational rather than cosmetic.
- Public summaries of Chakrabarty, Moulton, and Shkilko report trade-level misclassification rates around 31% with contemporaneous quotes and around 21% with a one-second quote lag in their study, showing that the label problem can be material even before you get to modern hidden/midpoint fragmentation.
- Public summaries of later electronic-era comparison work report that Lee-Ready underperforms during intervals of high trade and/or quote frequency, which is exactly when live slippage control is least tolerant of label error.
- FINRA public trade-reporting rules and FAQs say OTC equity trades must be reported as soon as practicable, generally no later than 10 seconds after execution, so delayed trade visibility is structurally present in some market segments.
- Nasdaq public notices on odd-lot dissemination state odd-lot trades were included in volume statistics with a dedicated modifier, which is another reminder that statistical trade visibility and decision-grade touch semantics are not identical.
The punchline is simple:
trade sign is not a timeless label living inside the print. It is a reconstruction whose quality depends on clocks, venue semantics, and market regime.
Features that belong in a slippage stack
A. Sign-quality features
quote_trade_delta_usbest_quote_age_usmidpoint_tie_flagtrade_at_touch_flagtick_fallback_usedclassifier_method(native,quote,lee_ready,tick,bvc,unknown)sign_confidencequote_alignment_bucket
B. Venue / print-semantics features
odd_lot_flagoff_exchange_flagsale_condition_bucketmidpoint_or_hidden_proxyauction_or_cross_flaglate_report_flagcorrected_or_canceled_flag
C. Market-state features
- spread,
- microprice,
- touch imbalance,
- quote update rate,
- trade rate,
- short-horizon realized vol,
- queue refill speed,
- phase (
open,continuous,close,halt,reopen).
D. Strategy-state features
- passive dwell time,
- residual inventory,
- time-to-deadline,
- current aggression level,
- recent markout losses,
- venue concentration,
- fallback-policy dwell time.
The important idea:
label quality is itself a first-class feature. Do not hide it in preprocessing and pretend the downstream model is working with truth.
Metrics worth monitoring
1. SIR — Sign Inversion Rate
Estimated rate at which the inferred sign disagrees with better ground truth on a benchmark subset.
Break it out by:
- venue,
- liquidity bucket,
- spread bucket,
- time of day,
- quote-age bucket,
- print type.
2. QAD — Quote Alignment Drift
Distribution of trade-to-quote timing mismatch used by the classifier.
If QAD shifts, sign quality may shift even when market behavior itself does not.
3. MDS — Method Dispersion Share
Share of labels produced by each method:
- quote rule,
- tick fallback,
- midpoint guess,
- vendor flag,
- unknown.
A rising fallback share is often an early-warning indicator.
4. MSD — Markout Sign Disagreement
Compare markout statistics when grouped by inferred sign vs higher-confidence sign on a labeled subset.
This measures how much toxicity inference is being bent by label noise.
5. VCR — Venue Contamination Ratio
Fraction of flow for a venue that lands in low-confidence sign buckets.
This catches the classic failure mode where a venue looks “safer” only because your labels are weaker there.
6. PFD — Passive Fill Distortion
Difference in expected passive-fill markout using naïve signs vs confidence-aware signs.
This is the business metric that often reveals the problem fastest.
State machine for live control
LABEL_CLEAN
- Sign-quality metrics stable.
- Toxicity models trusted normally.
- Standard passive/aggressive switching allowed.
QUOTE_SKEWED
Triggered when quote-age / trade-quote alignment deteriorates.
- Downweight trade-sign-driven features.
- Increase use of quote-age and spread signals directly.
- Reduce confidence in micro-horizon toxicity flips.
HIDDEN_FLOW_HEAVY
Triggered when midpoint / odd-lot / non-standard print share rises.
- Treat trade-sign features as partial evidence only.
- Prefer venue-aware and fill-aware signals.
- Avoid overreacting to inferred signed imbalance alone.
OFFEX_DELAYED
Triggered when delayed reporting or correction activity rises.
- Separate live routing features from replay/TCA labels.
- Do not let repaired historical signs drive immediate control changes.
- Use confidence floors before adapting venue scores.
SAFE_ABSTAIN
Triggered when sign quality is persistently poor.
- Freeze sign-sensitive venue ranking updates.
- Fall back to more robust features: spread, depth, queue age, realized markout, fill hazard, time-to-deadline.
- Prefer policies that are less label-sensitive.
A bad sign regime should degrade into safer, simpler control, not into random overfitting.
Modeling blueprint
Layer 1 — Preserve label provenance
For every signed trade, store at least:
- raw trade timestamp,
- raw quote timestamp used,
- venue/source,
- method used,
- lag used,
- sign,
- confidence,
- special-condition flags.
If you do not store provenance, you will never debug label drift cleanly.
Layer 2 — Build a confidence model
Estimate:
[ q_i = P(\hat{s}_i = s^*_i \mid x_i) ]
using benchmark subsets such as:
- native aggressor flags where available,
- MBO/order-level reconstruction,
- trusted venue-native data,
- or carefully curated manual subsets.
Layer 3 — Replace hard labels with soft labels
Instead of training on (\hat{s}_i) alone, use:
- confidence-weighted signs,
- unknown / abstain states,
- or explicit multi-class labels (
buy,sell,unknown).
Layer 4 — Train toxicity and markout models conditional on sign quality
Predict:
[ E[C \mid x_t, \tilde{F}_t, q_t] ]
not merely:
[ E[C \mid x_t, \hat{F}_t] ]
That lets the model learn different responses for:
- strong true imbalance,
- weakly supported imbalance,
- and ambiguity-heavy flow.
Layer 5 — Freeze or regularize online adaptation under label stress
If venue scores adapt online from mislabeled toxicity, the router will drift the wrong way. Use:
- slower learning rates,
- confidence-gated updates,
- shadow learning first,
- and abstention when label quality is degraded.
Practical policy rules
Rule 1: unknown is a valid label
Forcing weak guesses is often worse than carrying uncertainty.
Rule 2: keep trade signing and slippage modeling loosely coupled
Signing logic will change. If your whole feature store assumes one eternal sign classifier, future fixes become painful and misleading.
Rule 3: benchmark subsets must match decision-critical regimes
A high-confidence benchmark only on calm large-cap lit flow is not enough. You need validation where the live model bleeds:
- fast symbols,
- fragmented venues,
- midpoint-heavy periods,
- open/close transitions.
Rule 4: venue quality and label quality are different things
If a venue looks safe, ask whether it is actually safe or just hard to sign.
Rule 5: TCA must distinguish live-as-of labels from hindsight-repaired labels
Otherwise replay studies flatter the model by grading it with cleaner labels than the live controller had.
30-day rollout plan
Week 1 — Instrument label provenance
- Persist sign method, lag, quote age, and confidence metadata.
- Baseline MDS, QAD, and confidence distribution by venue and time of day.
- Separate live labels from hindsight-repaired labels in storage.
Week 2 — Build benchmark subsets
- Identify venues / datasets with stronger aggressor truth.
- Estimate SIR by regime.
- Rank where misclassification is operationally worst, not merely most frequent.
Week 3 — Confidence-aware shadow models
- Replace hard signed-flow features with confidence-weighted versions in shadow.
- Add SAFE_ABSTAIN logic for label-stress windows.
- Compare passive-fill markout and venue ranking drift vs baseline.
Week 4 — Controlled activation
- Gate online learning updates by sign confidence.
- Clip policy changes when label quality deteriorates.
- Optimize PFD and venue-misrank cost, not just classifier accuracy.
Common anti-patterns
- Treating one trade-signing rule as permanent truth.
- Validating sign quality only on daily aggregates.
- Letting BVC-style interval inference leak into trade-level labels.
- Ignoring midpoint / hidden-flow ambiguity and forcing a guess.
- Learning venue toxicity maps without adjusting for label quality.
- Mixing live labels and hindsight-repaired labels in the same evaluation.
- Blaming “market regime change” before checking whether label quality changed first.
What good looks like
A production execution stack should be able to answer:
- Which trades were signed with strong evidence vs weak evidence?
- How does sign quality change by venue, symbol, and time regime?
- How much passive-fill markout worsens when sign confidence is low?
- Whether a venue looks safe because of real outcomes or because its flow is hard to classify?
- Whether online adaptation slows down or becomes reckless during label-stress windows?
If you cannot answer those, your toxicity model may be learning from mislabeled flow.
And mislabeled flow is one of the cleanest ways to pay real slippage for imaginary signal.
Selected public references
- Lee, C. M. C. and Ready, M. J. (1991), Inferring Trade Direction from Intraday Data — the classic quote-rule + tick-fallback framework that makes quote alignment central to trade signing.
- Chakrabarty, B., Moulton, P. C., and Shkilko, A., Short Sales, Long Sales, and the Lee-Ready Trade Classification Algorithm Revisited — public summaries report trade-level misclassification around 31% with contemporaneous quotes and around 21% with a one-second lag:
- Panayides, M., Shohfi, T., and Smith, J., Comparing Trade Flow Classification Algorithms in the Electronic Era — public summaries report Lee-Ready underperforming during high trade/quote frequency intervals and highlight the distinction between trade-level and bulk classification use cases:
- FINRA public trade-reporting rules / FAQ — public guidance that trades must be reported as soon as practicable, generally within 10 seconds, which makes delayed visibility structurally relevant for some prints:
- Nasdaq odd-lot dissemination notice — odd-lot trades included in volume statistical calculations with dedicated modifiers, underscoring the gap between visible print statistics and decision-grade touch semantics:
Bottom line
Trade-sign classification error is not just a data-cleaning nuisance.
It is a slippage-model contamination channel.
When aggressor-side labels degrade, toxicity features attenuate or invert, passive fills get misgraded, venue rankings drift, and online adaptation starts learning from the wrong market.
The right response is not “pick one better classifier and forget it.” It is:
- preserve label provenance,
- model sign confidence explicitly,
- separate live labels from hindsight labels,
- gate control changes by label quality,
- and degrade gracefully when trade signing becomes unreliable.
In short:
before you trust signed-flow alpha, make sure you trust the signs.