Sale-Condition Filtering & Benchmark-Contamination Slippage Playbook

2026-04-12 · finance

Sale-Condition Filtering & Benchmark-Contamination Slippage Playbook

Date: 2026-04-12
Category: research (execution / slippage modeling)

Why this playbook exists

A lot of execution analytics quietly assume that every reported trade is equally useful as a market reference.

That is false.

In real market data, some prints are good anchors for intraday execution modeling, and some are not. A reported trade can be:

If a slippage stack treats those prints as one homogeneous stream, three bad things happen fast:

  1. benchmarks drift away from the actually tradeable market,
  2. toxicity / markout labels become contaminated,
  3. TCA starts praising or blaming the router for moves it never truly caused or could have reacted to.

This is one of those bugs that hides in plain sight because the tape still looks like trading happened.

But in execution modeling, “a trade happened” is not enough.

The real question is:

what kind of trade was it, and was it eligible to represent the current continuous market for the label I am building?

This note turns that question into a practical framework for feature engineering, benchmark construction, labeling, and controls.


Public market-structure facts that make this real

This is not just a data-cleaning preference. Public rulebooks and feed documentation explicitly distinguish among trade types.

A few examples:

The modeling implication is simple:

the tape already contains a taxonomy of “how much this print should mean.”

If your model ignores that taxonomy, it is manufacturing a cleaner and more continuous market than the one you actually observed.


The core failure mode

Suppose you are grading a child order against:

Now imagine the most recent tape print before or after the child is one of these:

If the model uses that print as though it were an ordinary continuous execution, it can:

  1. assign fake market movement to the live book,
  2. misstate short-term impact and reversion,
  3. sign trades against stale or economically irrelevant price moves,
  4. inflate or crush realized volume denominators,
  5. create spurious “alpha decay” or “routing regression.”

This is not a small hygiene issue.

For some labels, a single misclassified print can be more damaging than a few milliseconds of timestamp noise.


A useful abstraction: trade eligibility depends on the downstream use

Let each trade report at time (t) have:

Define a task-specific eligibility function:

[ E^{(k)}_t = f_k(m_t, c_t, \tau_t^{exec}, \tau_t^{pub}) \in {0,1} ]

where (k) is the downstream use case.

Examples:

The important point:

there is no single universal “good trade print” flag.

A trade can be valid for volume accounting but invalid for short-horizon continuous-price labeling.

That is exactly why condition codes exist.


Mechanism map

1. Continuous-book benchmark contamination

A child order arrives when the lit market is around $100.00 / $100.01.

Seconds later, a derivatively priced or contingent print appears at $99.94 for a large reported size.

If your TCA stack marks the child against “latest trade” without sale-condition filtering, the child suddenly looks like it overpaid by 7 cents, even though the displayed continuous market never really moved there.

That is not slippage.

That is benchmark contamination.

2. Markout distortion

Suppose a router buys at the offer, then the next tape print is a late report or auction-related special print far away from the touch.

A naive 1-second or next-print markout says the fill immediately went against you.

But the print was not a reliable proxy for the live marginal market.

Now the model learns that a healthy fill was toxic.

Repeat this enough times and the strategy starts avoiding good flow.

3. Trade-sign / toxicity label damage

Many toxicity pipelines infer aggressor side or price impact from sequences of trade-to-trade moves.

When non-standard prints enter the same return path as continuous prints, you get:

In other words, sale-condition filtering is not just for TCA.

It is a label-integrity problem.

4. Reported-volume pacing illusion

Some special-condition trades still update volume even when they should not drive the same price benchmark logic.

That means a POV / participation engine can be simultaneously:

If the system cannot separate those two facts, it confuses reporting semantics with executable price discovery.

5. Late-report time-travel

A late or out-of-sequence report can enter the stream after the router decision but represent an earlier execution time.

If a replay joins labels to report arrival time instead of market-observable time and condition eligibility, the model can accidentally “discover” price moves that the live controller could not have known about.

That is classic point-in-time leakage wearing a tape-costume.


A better decomposition: observed tape vs continuous benchmark stream

Construct two trade streams instead of one:

Stream A — full reported trade stream

Contains all reported trades that matter for:

Stream B — continuous-benchmark eligible stream

Contains only trades eligible for the specific benchmark or label you are building.

For a benchmark use case, define:

[ P^{bench}(t) = g\big({p_i : i \le t, E^{(bench)}_i = 1}\big) ]

For volume use cases:

[ V^{pacing}(t) = \sum_{i \le t} q_i \cdot E^{(volume)}_i ]

The key operational rule is this:

the same trade may be included in (V^{pacing}) and excluded from (P^{bench}).

That is not inconsistency.

That is correct modeling.


Metrics worth instrumenting

1. BCG — Benchmark Contamination Gap

[ BCG(t) = P^{raw_last}(t) - P^{bench}(t) ]

How far your naive last-trade benchmark deviates from the condition-filtered continuous benchmark.

Track by symbol, venue mix, time of day, and sale-condition bucket.

2. NSPR — Non-Standard Print Ratio

Share of reported trades or volume with sale conditions outside your continuous-benchmark-eligible set.

Compute separately by:

3. LRTL — Late-Report Time-Leak

Fraction of labels that would differ if you use report-publication time instead of strict as-of observable time.

4. MDI — Markout Distortion Index

Difference between markouts computed from:

When MDI spikes, your “toxicity” is often just tape taxonomy bleeding into the label.

5. VPI — Volume/Price Inconsistency

Fraction of intervals where a trade contributes to volume pacing but is excluded from price benchmarking.

This should not be forced to zero.

You want to measure it because it tells you how often your system is operating in a mixed reporting regime.

6. CCR — Correction Contamination Rate

How often corrected or canceled prints change historical labels relative to the live-as-of version the controller actually saw.

7. SCS — Sale-Condition Shock

Short-window burst intensity of benchmark-ineligible trade reports.

Useful during auctions, reopenings, or venue/reporting anomalies.


Feature set for slippage models

A. Trade-report semantics

B. Time-quality features

C. Stream-divergence features

D. Quote context

E. Regime features

Important rule:

sale conditions should enter the model as first-class features and filtering gates, not just as downstream BI metadata.


Labeling blueprint

For every child order or decision timestamp, store both raw and filtered references.

At minimum capture:

  1. latest raw reported trade,
  2. latest benchmark-eligible trade,
  3. latest quote-based mid,
  4. latest volume-eligible trade and cumulative volume,
  5. modifier / sale-condition details for any intervening trades,
  6. as-of observable timestamps,
  7. correction status.

Then build separate labels.

Label 1 — raw-tape slippage

[ S_{raw} = p_{fill} - P^{raw_last}(t) ]

Useful mostly as a diagnostic, not as your truth.

Label 2 — filtered continuous-benchmark slippage

[ S_{flt} = p_{fill} - P^{bench}(t) ]

This is usually the honest execution benchmark.

Label 3 — quote-anchored slippage

[ S_{mid} = p_{fill} - mid(t) ]

Useful when trade prints are sparse or noisy.

Label 4 — contamination gap

[ C_{gap} = S_{raw} - S_{flt} ]

This is the cleanest way to quantify how much performance measurement was altered by trade-condition handling rather than routing quality.

Label 5 — live-vs-repaired gap

[ L_{gap} = S_{flt}^{live_asof} - S_{flt}^{hindsight_repaired} ]

This isolates how much hindsight correction or cancellation would rewrite the label.

That matters if you retrain models on cleaned history while live routing had to trade on dirtier truth.


Policy rules for execution stacks

Rule 1: maintain multiple benchmark streams on purpose

You probably need at least three:

One stream cannot serve all jobs honestly.

Rule 2: filtering must be use-case specific

Do not build one global valid_trade_flag and call it done.

A print can be:

Rule 3: use publication-time as-of logic for live realism

If the controller could not observe a report yet, the label should not pretend it could.

Late-report handling and sale-condition handling belong together.

Rule 4: corrections should version labels, not silently replace them

Keep both:

Otherwise your research will quietly train on a better tape than production ever had.

Rule 5: auction and special-print windows deserve their own regime tags

Open, close, halts, and special reporting windows should not be shoved into “normal microstructure” buckets.

Rule 6: do not let benchmark contamination masquerade as impact

If raw last and filtered benchmark diverge materially, attribution must split:

Otherwise operations teams will chase the wrong problem.


Common anti-patterns


30-day rollout plan

Week 1 — make the trade taxonomy observable

Week 2 — split benchmark streams

Week 3 — retrain labels with explicit condition handling

Week 4 — harden production controls


What good looks like

A production-grade slippage stack should be able to answer:

  1. What was the last reported trade?
  2. What was the last benchmark-eligible continuous trade?
  3. Which intervening trades were excluded and why?
  4. Did a special-condition print affect volume, price benchmark, both, or neither?
  5. Would the label differ under live-as-of vs hindsight-repaired data?
  6. How much of measured “slippage” is actually benchmark contamination?

If you cannot answer those questions, your model is probably measuring execution against a tape that is too literal and not nearly semantic enough.


Selected public references

Bottom line

Not every print should have equal voting rights in your slippage model.

Some trades are excellent references for the live continuous market. Others are real and important, but meaningful in a different way:

The expensive mistake is collapsing all of them into one “last sale” reality.

Execution analytics gets cleaner when you stop asking only:

what was the last print?

and start asking:

what was the last print that was eligible to mean what I am about to claim it means?

That question sounds pedantic.

It saves real money.