Trade-Sign Classification Playbook

Date: 2026-04-08
Category: knowledge
Domain: finance / market microstructure / data engineering / execution analytics

Why this matters

A surprising amount of execution research quietly depends on one deceptively simple question:

Was this trade buyer-initiated or seller-initiated?

If you get that wrong, a lot of downstream analytics becomes noisy or outright misleading:

order-flow imbalance,
toxicity metrics,
VPIN-style signals,
short-horizon impact estimates,
markout attribution,
queue-reactive models,
and even basic “were buyers or sellers in control?” dashboards.

The trap is that many teams treat trade signing as solved by one canned rule. It is not.

What works tolerably well on older quote/trade feeds can break in modern electronic markets because of:

timestamp skew between trades and quotes,
fragmented venues and feed paths,
odd lots and hidden liquidity,
midpoint and dark executions,
auctions and off-book prints,
and high-frequency bursts where simple tick tests lose context.

The practical goal is not academic purity. The goal is:

pick the simplest trade-signing method that matches your data quality, use case, and market structure, then validate it against something closer to ground truth before trusting any downstream model.

1) Fast mental model

There are really four different problems people blur together:

True aggressor-side recovery
Who crossed the spread in reality?
Trade-level direction approximation
Given only prints and quotes, what side is the best guess for this specific trade?
Bar-level signed-volume estimation
Over a time/volume bar, roughly how much volume was buyer- vs seller-initiated?
Execution-analytics robustness
How much sign error can your downstream signal tolerate?

That distinction matters because different tools solve different problems:

Venue aggressor flags / MBO data are best for true aggressor recovery.
Quote rule / tick rule / Lee-Ready are trade-level approximations.
Bulk Volume Classification (BVC) is a bar-level estimation tool, not a precise trade labeler.

If you only remember one thing, remember this:

Do not use a bar-level signing method when you actually need per-trade labels, and do not pretend a trade-level heuristic is ground truth when feed timing is messy.

2) The hierarchy of evidence

From best to weakest practical evidence:

Exchange-native aggressor flag
Market-by-order / order-add-delete-match reconstruction with venue semantics
Trade matched to prevailing bid/ask or midpoint using high-quality quote timing
Tick-test fallback
Bar-level inference like BVC

That hierarchy is brutal but useful. A lot of trouble comes from pretending level 4 or 5 is equivalent to level 1.

3) The quote rule

In one sentence

Compare the trade price to the prevailing bid-ask midpoint: above midpoint implies buyer-initiated, below midpoint implies seller-initiated.

Why it works

If a market order lifts the ask, prints tend to happen near the ask side. If a market order hits the bid, prints tend to happen near the bid side. The midpoint is the simplest separator.

Basic logic

Let:

bid = best bid,
ask = best ask,
midpoint = (bid + ask) / 2,
trade price = p.

Then:

if p > midpoint → classify as buy,
if p < midpoint → classify as sell,
if p == midpoint → unresolved; use fallback logic.

Where it shines

liquid lit markets,
reasonably synchronized trade/quote data,
analyses that care about trade-level signed flow,
environments where most prints still occur near the displayed spread.

Where it breaks

midpoint executions,
hidden-liquidity matches,
locked/crossed or stale-quote states,
poor quote alignment,
auctions and special trade-condition prints,
fragmented feeds where your “prevailing quote” was not actually what the trade saw.

Practical warning

The quote rule is often the best simple default, but only if the quote you compare against is actually contemporaneous. That timing assumption is the whole game.

4) The tick test

In one sentence

If the trade price is above the previous distinct trade price, call it a buy; if below, call it a sell.

Basic logic

Compare the trade price to the most recent previous different trade price:

if p_t > p_prev_diff → buy,
if p_t < p_prev_diff → sell,
if equal, keep walking backward until you find a different price.

Why people still use it

Because it needs only trade prints. That makes it convenient for:

old or incomplete datasets,
coarse historical tapes,
fast fallback logic,
or environments where quote timestamps are clearly unreliable.

Where it helps

as a tie-breaker when the quote rule lands exactly at midpoint,
when quote data is missing or unusable,
for rough signed-flow proxies on less demanding tasks.

Where it breaks badly

repeated same-price prints,
bouncing around the same price level,
fast two-sided activity at one price,
hidden/midpoint executions,
markets where price change is too coarse a proxy for aggressor side.

In modern electronic trading, the tick test is often less a primary method and more an emergency fallback.

5) Lee-Ready

In one sentence

Lee-Ready combines the quote rule with a tick-test fallback, historically using a lagged quote to compensate for trade/quote timestamp mismatch.

This is the classic workhorse. It became famous because it was simple, practical, and far better than pretending every trade at the ask was always a buy and every trade at the bid was always a sell without handling ties and timing issues.

The classic intuition

Older data often had quote updates recorded with delays relative to trades. So a “current” quote in the database could actually be newer than the quote traders saw when the trade happened. Lee-Ready’s famous fix was to look at a slightly earlier quote rather than the naively contemporaneous one.

Then:

compare trade price to the bid-ask midpoint from the selected quote,
classify above/below midpoint as buy/sell,
if the trade occurs at the midpoint, use the tick test.

Why it still matters

Because the structure is still useful:

quote-based primary classification,
trade-only fallback for unresolved cases,
explicit recognition that timestamp alignment matters.

What people get wrong

They copy the old lag mechanically.

The dangerous anti-pattern is:

“Lee-Ready means use a 5-second lag.”

No. That lag was a historical fix for specific data conditions. In modern feeds:

the needed lag may be much smaller,
it may be symbol-specific,
it may vary by venue/feed,
or the correct answer may be no lag at all.

In some environments, blindly applying the historical lag can be worse than using a calibrated contemporaneous quote.

Practical rule

Use Lee-Ready as a framework, not as a frozen parameter choice.

6) Bulk Volume Classification (BVC)

In one sentence

BVC estimates the buy/sell split of volume over short intervals from aggregate price movement, instead of assigning an exact sign to each individual trade.

This is a very different tool. It is not trying to recover the precise aggressor side for trade number 184,723. It is trying to infer whether a short interval’s volume was mostly buyer- or seller-driven.

Why it exists

Sometimes you do not have the data quality needed for reliable trade-level classification. But you still want:

signed-volume bars,
rough toxicity measures,
bar-level order-flow summaries,
or features for slower-horizon models.

BVC is attractive because it can be data-efficient and useful when trade-level signing is messy or unnecessary.

Where it shines

bar-level analytics,
coarser data environments,
exploratory flow summaries,
signals where interval-level signed volume matters more than exact trade labels.

Where it is the wrong tool

trade-by-trade impact estimation,
per-fill aggressor labeling,
venue microstructure diagnostics,
queue or fill-hazard models,
or anything that needs precise event sequencing.

The key mental model

BVC is an interval estimator, not an aggressor-side oracle.

If you use it as if it were trade-level truth, you will quietly poison your downstream research.

7) Modern market structure breaks the naive versions

Trade classification got harder because markets got faster and less literal.

A) Timestamp alignment is not a footnote

This is the biggest operational issue. Your trades and quotes may be stamped by:

different clocks,
different gateways,
different feed handlers,
different vendors,
or different normalization pipelines.

Even if both timestamps look precise, they may not be causally aligned.

What this breaks:

quote-rule accuracy,
Lee-Ready lag choice,
bid/ask touch matching,
and all “prevailing quote” logic.

B) Midpoint and hidden executions

A midpoint print is not clean evidence of either side from price alone. That means midpoint-heavy venues or hidden-liquidity interactions create many unresolved or weakly resolved cases.

C) Odd lots distort displayed touch intuition

Displayed NBBO logic may not fully reflect the real tradeable state when odd lots or venue-specific display rules matter. A trade can look “inside” or “off touch” without being economically bizarre.

D) Fragmented markets mean multiple plausible quotes

Which quote was “the quote”? The SIP quote? the direct-feed quote? the venue-local quote? a vendor-normalized synthetic quote?

These are not equivalent under latency and fragmentation.

E) Auctions and special prints are different animals

Opening/closing crosses, halts, reopen auctions, off-book prints, derivatively priced prints, and correction/cancel conditions should not be shoved through the same classifier as ordinary continuous lit trading.

A robust pipeline often starts by excluding or separately labeling special trade conditions.

8) Decision matrix: what to use when

Use native aggressor-side flags when

the venue/feed provides them,
you trust the data source,
and you need trade-level truth as much as possible.

This is the first choice whenever available. Do not “simplify” away better information.

Use quote rule + fallback when

you have decent trade and quote data,
you need trade-level signed flow,
but you do not have native aggressor flags.

For many practical research stacks, this is the default baseline.

Use Lee-Ready-style logic when

trade/quote alignment is imperfect,
midpoint ties need resolution,
and you are willing to calibrate the quote lag rather than cargo-culting it.

This is often the best pragmatic trade-level heuristic family.

Use tick test alone only when

quote data is missing or unusable,
or you explicitly accept lower fidelity.

It is a fallback, not a badge of toughness.

Use BVC when

you only need interval-level signed volume,
your data is coarse,
or you are building bar-level features rather than micro-event labels.

9) A sane production pipeline

A robust trade-signing pipeline usually looks more like this than a single rule:

Step 1: Filter or isolate non-standard prints

Separate or drop:

auctions,
corrections/cancels,
derivatively priced prints,
condition codes you do not want in continuous-flow analysis,
and suspicious off-book records.

Step 2: Choose the quote source deliberately

Define whether you use:

venue-local book,
consolidated quote,
vendor-normalized top-of-book,
or direct-feed-derived quote state.

Do not leave this implicit.

Step 3: Calibrate quote alignment

Test a range of quote offsets/lags and measure classification quality on a sample with better ground truth.

Step 4: Apply midpoint/quote rule first

Use quote information when it is credible.

Step 5: Use tick test only for unresolved ties or quote-missing cases

Do not let the fallback quietly become the dominant method unless you intended that.

Step 6: Emit confidence / method metadata

For every signed trade, store at least:

sign,
method_used (native, quote, midpoint, tick, bulk, unknown),
quote_lag_used,
at_midpoint flag,
special_condition flag,
and ideally a confidence tier.

This is incredibly useful later when a model behaves strangely.

10) The most useful validation questions

Before trusting signed-flow analytics, ask:

A) What is the ground truth subset?

Examples:

exchange-native aggressor flags for some venues,
proprietary order-level logs,
or MBO reconstruction where aggressor is inferable.

Without some benchmark subset, you are grading your classifier with vibes.

B) How sensitive is performance to quote lag?

If accuracy changes a lot when the lag moves slightly, your alignment problem is not solved.

C) Where does the classifier fail?

Break results out by:

symbol liquidity bucket,
spread bucket,
volatility regime,
venue,
time of day,
midpoint-tie frequency,
and special trade-condition rates.

D) How much fallback is happening?

If most trades are being signed by the tick test, your quote path is weaker than you think.

E) What downstream metrics are fragile to sign noise?

Some signals survive moderate misclassification. Others collapse. You need to know which kind of signal you are building.

11) Common failure modes

Failure mode 1: Hardcoding the historical Lee-Ready lag

This is probably the most common unforced error. What fixed one old dataset can badly misalign another.

Failure mode 2: Mixing venues without venue-aware quote logic

If you pool prints across venues but use one synthetic quote state blindly, you can manufacture sign noise.

Failure mode 3: Treating auction prints like normal continuous prints

This contaminates order-flow metrics around the open, close, halts, and rebalance windows.

Failure mode 4: Using BVC for trade-level models

BVC is powerful in the right place and quietly destructive in the wrong one.

Failure mode 5: Ignoring unresolved/unknown cases

Unknown is a valid label. Forcing a weak guess can be worse than carrying uncertainty explicitly.

Failure mode 6: Not storing the classifier provenance

Months later, someone asks why the toxicity feature shifted. If you did not store method, lag, and filters, the answer becomes archaeology.

12) A practical default for most research teams

If you have decent top-of-book quote data but no native aggressor flag, a good default is:

filter out non-standard trade conditions,
choose a deliberate quote source,
calibrate a small set of candidate quote lags,
apply midpoint/quote classification,
use tick test only for midpoint ties or quote-missing cases,
keep an unknown state when confidence is weak,
validate by venue/liquidity/time-of-day,
and only use BVC for bar-level features where trade-level precision is not required.

That is not the fanciest possible pipeline. It is just the one least likely to fool you early.

13) When to invest in something more advanced

Move beyond basic heuristics when:

signed-flow features materially drive PnL or risk,
your market is midpoint/hidden-liquidity heavy,
your execution models are venue-sensitive,
your feed stack includes both SIP and direct-feed paths,
or you are estimating very short-horizon impact / toxicity where sign error is expensive.

At that point, the right answer is often:

venue-aware logic,
direct-feed or MBO reconstruction,
richer trade-condition filtering,
explicit clock-alignment work,
and per-venue validation rather than one global classifier.

14) Bottom line

Trade signing is one of those market-microstructure chores that looks boring right up until it ruins a model.

The practical summary is:

Best available truth beats clever heuristics. Use native aggressor information when you have it.
Quote-based methods are still the right baseline for many trade-level tasks, but only if timing is handled carefully.
Tick test is a fallback, not a modern gold standard.
Lee-Ready is a framework, not a sacred 5-second constant.
BVC is for interval-level signed flow, not exact per-trade labels.
Always validate by venue, liquidity, and time regime before trusting downstream analytics.

If your signed-flow feature feels mysteriously unstable, the first suspect should often be the classifier, not the alpha.

Pointers for deeper reading

Classic and commonly cited references to revisit:

Lee & Ready (1991) — canonical quote-rule + tick-fallback framework for trade classification.
Odders-White / Finucane / Ellis-O'Hara-Michaely era work — practical accuracy issues and quote/trade timing concerns.
Easley, López de Prado, O'Hara — bulk volume classification and interval-level flow inference.
Later electronic-era comparison papers — useful reminders that method quality depends heavily on market structure, data resolution, and timestamp alignment.

Those papers are worth reading not because they give one eternal answer, but because they teach the right habit:

treat trade classification as a data-quality problem first, and an algorithm choice second.