Market-Data Source-Parity Drift and Benchmark-Inconsistency Slippage Playbook

Date: 2026-04-12
Category: research (execution / slippage modeling)

Why this playbook exists

A lot of slippage models are secretly built on a fragile assumption:

the market-data representation used in research, backtests, TCA, and live routing is the same enough object.

In practice, that assumption breaks all the time.

Examples:

research labels were built on direct feeds, but live features come from SIP-plus-normalization,
historical data used round-lot-only top of book, while live inference now ingests odd-lot-inside quotes,
one vendor upgrades to MDI-era odd-lot dissemination while another still normalizes to legacy NBBO fields,
training used market-by-price depth, but production scoring uses market-by-order-derived queue features,
TCA benchmarks against visible best quote while routing logic acts on a protected/actionable best quote,
a replay dataset silently changes timestamp semantics, depth truncation, or quote-eligibility treatment.

When that happens, the model may look like it drifted even if the market did not.

What actually changed was the representation of the market state.

That creates a specific slippage failure mode:

benchmark continuity breaks,
feature meaning changes without feature names changing,
backtest/live comparability collapses,
TCA blames strategy urgency for what is really data-source mismatch,
retraining pipelines learn source transitions instead of market structure.

This note treats that as a first-class production problem: source-parity drift.

Public facts that make this real

This is not a hypothetical edge case. U.S. equity market data semantics have already changed in public, documented ways.

A few examples are enough:

The SEC’s Market Data Infrastructure rule expanded core data and updated round-lot definitions, explicitly recognizing that better-priced liquidity can exist outside the legacy 100-share lens and that certain odd-lot quotation information should be disseminated.
CTA / UTP odd-lot materials publicly state that odd-lot quotes can be disseminated while still not being protected quotes and not changing the protected round-lot NBBO.
The SEC’s order-execution disclosure modernization explicitly acknowledges that odd-lot quotes in higher-priced stocks often offer prices better than the round-lot NBBO, and it expands execution-quality reporting to include odd-lot and fractional-share contexts.
Public SIP / plan documentation and vendor rollout notices make clear that message content, parsing expectations, and quote fields have changed over time as odd-lot dissemination and related market-data enhancements rolled out.

The implication is simple:

"best quote" is no longer a stable primitive unless you also specify the source semantics.

If your model logs only best_bid, best_ask, and spread, you are often logging an incomplete object.

The core failure mode

A slippage model really sees a market state through a source tuple.

For decision time t, define:

Qsrc(t): quote source (SIP / direct / blended / vendor-normalized),
Ssrc(t): size semantics (round-lot-only / odd-lot-aware / protected-size-only / aggregated depth),
Dsrc(t): depth semantics (L1 / L5 / MBP / MBO / synthetic queue estimate),
Psrc(t): protection/accessibility semantics (visible / protected / router-actionable),
Tsrc(t): timestamp semantics (exchange event / SIP event / receive time / normalized hybrid),
Nsrc(t): normalization version (vendor parser rules, symbol mapping, round-lot table version, corporate-action adjustment logic).

Call the full tuple:

[ \Sigma(t) = (Qsrc, Ssrc, Dsrc, Psrc, Tsrc, Nsrc) ]

Two observations with the same field names but different \Sigma(t) are not the same state.

That is the heart of source-parity drift.

A model trained under \Sigma_train and served under \Sigma_live is effectively doing domain transfer, whether the team realizes it or not.

Why this is worse than ordinary feature drift

Ordinary feature drift says:

the feature distribution changed because the market changed.

Source-parity drift says:

the feature distribution changed because the measuring instrument changed,
while field names and dashboards still pretend continuity.

That makes it more dangerous because:

it often looks like genuine regime change,
standard drift detectors can fire without identifying the root cause,
performance regressions may be “fixed” with retraining that bakes in the mismatch,
TCA comparisons across periods become apples-to-oranges,
rollback decisions become confused because online metrics and offline replays disagree for structural reasons.

Mechanism map

1. Benchmark object drift

Suppose research used a protected round-lot arrival benchmark, but live TCA now benchmarks against visible best quote including odd-lot-inside improvement.

Now fills can look worse even if router behavior is unchanged, because the benchmark got tighter while actionable capacity did not improve much.

This is not strategy degradation.

It is benchmark object drift.

2. Feature meaning drift under the same column name

A column called spread_bps can mean very different things depending on whether the touch was:

round-lot only,
odd-lot-aware,
protected-only,
blended across feeds,
recomputed from vendor-normalized synthetic best quote.

Likewise for:

imbalance,
queue size,
top-of-book depth,
quote age,
visible price improvement,
markout reference mid.

Same name, different object.

3. Label contamination

Suppose labels are built from execution price relative to arrival mid from a source richer than what the live model sees.

Then the model is trained to predict outcomes against a reference it never gets online.

That creates false optimism in offline training and fake underperformance online.

4. Counterfactual replay mismatch

A postmortem replay often reconstructs “what the model would have seen.”

But if replay data came from another vendor, another normalization version, or another source mix, the reconstructed state can differ materially from the actual online state at the same decision timestamp.

Now the replay is not a counterfactual.

It is a different universe with shared timestamps.

5. Production rollout illusion

A data vendor or parser upgrade can make a strategy look dramatically smarter or dumber overnight.

Not because routing changed, but because:

odd-lot inside quotes are suddenly included,
round-lot thresholds changed by price tier,
timestamp precedence changed,
depth aggregation logic changed,
stale or inaccessible quotes are filtered differently.

If this is not tagged and segmented, model governance starts chasing ghosts.

The right abstraction: source parity as a measurable state variable

Define a source distance between two environments:

[ \Delta_{src}(A,B) = w_q d(Q_A,Q_B) + w_s d(S_A,S_B) + w_d d(D_A,D_B) + w_p d(P_A,P_B) + w_t d(T_A,T_B) + w_n d(N_A,N_B) ]

Where each component is a structured mismatch measure:

quote-source mismatch,
size-semantics mismatch,
depth mismatch,
protection/accessibility mismatch,
timestamp mismatch,
normalization-version mismatch.

You do not need perfect formal distance metrics for this to be useful. A practical implementation can encode each component as categorical parity / partial parity / broken parity.

The key idea:

before comparing model performance across environments, compare Δsrc first.

If source distance is large, performance comparison is not clean evidence of model quality.

A practical source-tuple contract

Every scored decision and every backtest sample should store a compact source descriptor like:

{
  "quoteSource": "sip",
  "sizeSemantics": "odd-lot-aware-visible-best",
  "depthSemantics": "l1-plus-aggregated-l5",
  "protectionSemantics": "visible-not-actionable",
  "timestampSemantics": "sip_event_ts",
  "normalizationVersion": "mdi-v3.2",
  "vendor": "vendor-x",
  "roundLotTableVersion": "2026-04-27-prep"
}

Without this, you cannot reliably answer:

why backtest and live disagree,
whether TCA is comparable across months,
whether a rollout changed model performance or merely changed measurement.

Metrics worth instrumenting

1. SPD — Source-Parity Distance

A structured score between training, replay, TCA, and live environments.

Examples:

SPD(train, live)
SPD(backtest, prod)
SPD(tca, router)
SPD(replay, prod)

If SPD is high, segment metrics before drawing conclusions.

2. BSG — Benchmark Source Gap

For a buy order:

[ BSG(t) = P_{bench}(t) - P_{router}(t) ]

Where:

P_bench(t) is the benchmark price source used by TCA/labels,
P_router(t) is the economically relevant price source used by live routing.

This catches cases where TCA uses visible best but the router acts on protected or actionable best.

3. FSC — Feature Semantic Change Rate

Share of decisions where the same named feature changes interpretation because the source tuple changed.

Practical example:

spread_bps computed under round-lot-only yesterday,
spread_bps computed under odd-lot-aware visible best today.

4. LCP — Label/Context Parity Rate

Fraction of training labels whose reference source matches the live feature source contract.

Low LCP means the model is learning against a target defined in another data universe.

5. RRG — Reachable-Reference Gap

Difference between:

reference quote in the benchmark source,
actually reachable quote for the router’s actionable venue set.

This is especially useful when visible, protected, and actionable price sets diverge.

6. SSS — Source-Stable Sharpe / Source-Stable Slippage

Measure performance only on cohorts where source tuple parity is stable.

This helps separate:

genuine model signal,
source-change artifacts.

7. VNC — Version-Normalization Churn

Rate at which parser/vendor/normalization versions change in the production pipeline.

Many teams track model version obsessively and data-normalization version barely at all. That is backwards.

Concrete failure examples

Example A: odd-lot rollout makes TCA harsher overnight

Before rollout:

arrival benchmark = round-lot NBBO,
live routing = protected round-lot best,
TCA and router mostly aligned.

After rollout:

TCA benchmark = visible best with odd-lot-inside price improvement,
router still executes mostly against protected liquidity,
same fills now look systematically worse.

What changed?

Not necessarily routing quality.

The benchmark source changed.

Example B: model trained on direct-feed queue features, served on SIP-plus-derived queue proxies

Offline:

queue-position proxies look predictive,
fill-hazard model is sharp.

Online:

quote updates are less granular,
event ordering differs,
queue proxy becomes noisy,
model overcommits to passive posting.

What failed?

Not “markets got harder.”

Σ_train != Σ_live.

Example C: vendor parser change appears as regime drift

A vendor upgrades odd-lot support and round-lot-tier handling.

Suddenly:

inside spread compresses more often,
quote-improvement frequency jumps,
effective spread histograms shift,
model residuals move.

A naive dashboard says “regime change.”

A source-aware dashboard says “normalization version changed on Tuesday.”

Modeling blueprint

A robust slippage model should condition on both market state and source state.

Think of expected cost as:

[ E[C \mid x_t, \Sigma_t] = f(x_t, \Sigma_t) ]

where:

x_t = market/execution state,
Σ_t = source tuple.

There are three practical ways to implement this.

Path 1: hard parity enforcement

Only train / backtest / compare when source tuples match within strict tolerance.

Best when:

governance is strong,
sources are controlled,
sample size is large enough.

Path 2: source-conditioned model

Add source-state indicators directly into the model:

quote-source class,
benchmark-source class,
depth representation,
odd-lot visibility flag,
actionable-vs-visible reference mode,
normalization version family.

Best when:

exact parity is impossible,
multiple source regimes must coexist.

Path 3: source-specific experts with gated blending

Train specialized experts for source regimes:

legacy round-lot-only regime,
odd-lot-aware visible-best regime,
direct-feed queue-rich regime,
low-fidelity SIP-only regime.

Then gate by detected source tuple.

Best when:

source semantics differ materially,
one unified model is too brittle.

A useful decomposition for slippage attribution

When source parity is imperfect, decompose realized cost as:

[ C_{realized} = C_{market} + C_{execution} + C_{source} ]

Where:

C_market: cost from actual market conditions,
C_execution: cost from decisions / routing / urgency,
C_source: cost or measurement distortion introduced by source mismatch.

C_source is not always direct economic loss. Sometimes it is a diagnostic distortion that causes:

wrong TCA conclusions,
bad retraining labels,
incorrect promotion / rollback choices,
overreaction to fake performance change.

In practice, teams should estimate both:

economic source cost — real fill degradation from source mismatch,
measurement source cost — benchmark / label distortion caused by source mismatch.

State machine for production controls

1. PARITY_STABLE

Conditions:

source tuple consistent across training / scoring / TCA,
no unreviewed parser or vendor changes,
benchmark source aligned with routing source.

Action:

normal operation.

2. PARITY_WARNING

Conditions:

source tuple changed in one dimension,
benchmark-source gap widening,
odd-lot / depth / timestamp semantics differ but comparability still partially salvageable.

Action:

segment metrics,
freeze cross-period comparisons unless source-adjusted,
start dual-source shadow logging.

3. PARITY_BROKEN

Conditions:

benchmark and router sources materially diverge,
training/live source mismatch exceeds tolerance,
replay environment no longer reproduces production state.

Action:

stop using naive TCA comparisons for model governance,
downweight affected features,
route conservatively for source-sensitive tactics,
require source-aware retraining or replay rebuild.

4. SAFE_SHADOW

Conditions:

vendor rollout, parser upgrade, odd-lot enablement, round-lot-table change, timestamp-domain migration, or benchmark-definition change.

Action:

dual-run old/new source paths,
score both without immediate promotion,
compare only on parity-stable cohorts,
promote after source-adjusted acceptance tests pass.

Use hysteresis. It should be easier to enter warning modes than to leave them.

Controls that actually help

Control 1: log the source tuple on every decision

If the source is not in the event log, postmortems will guess.

Guesses are where fake regime shifts are born.

Control 2: separate benchmark source from router source

Never assume the TCA benchmark is the same as the price object routing used.

Persist both explicitly.

Control 3: pin normalization versions in backtests

Do not let historical replays silently reparse old periods with today’s semantics unless that is the explicit experiment.

“Reproducible backtest” means reproducible source semantics too.

Control 4: build parity-stable cohorts

For governance dashboards, always keep a source-stable cohort where:

quote source,
size semantics,
benchmark source,
timestamp domain,
normalization version,

are held constant.

This becomes the clean baseline for true model drift.

Control 5: require source-diff review for data changes

A vendor/parser/benchmark change should get the same review seriousness as a model change.

At minimum, require:

impacted fields,
changed semantics,
expected effect on TCA,
expected effect on live features,
replay compatibility status,
rollback plan.

Control 6: shadow old and new sources during transitions

Before fully switching a benchmark or data vendor:

capture both source views,
score both,
compute benchmark-source gap,
compare fills and markouts on matched decisions.

Control 7: maintain source-aware feature dictionaries

Every important feature should declare:

economic meaning,
data dependencies,
source assumptions,
known failure modes under representation changes.

What to retrain, and what not to retrain

When source parity breaks, the knee-jerk response is often:

retrain immediately on the new data.

Sometimes that is correct.

Sometimes it is the fastest way to bury the evidence.

Retrain immediately only if:

the new source semantics are intentional,
they are the new production truth,
the benchmark contract is updated,
parity-aware evaluation says the shift is not just a temporary migration artifact.

Do not retrain blindly if:

the source change is accidental or incomplete,
only TCA changed but router features did not,
replay cannot faithfully reproduce production,
the new semantics are still under rollout.

In those cases, first restore comparability or explicitly fork the regime.

Promotion gates for source-changing rollouts

Before promoting a source or benchmark change, require something like:

SPD(old, new) documented and approved,
parity-stable cohort metrics non-regressive,
benchmark-source gap explained and bounded,
replay / postmortem environment rebuilt or explicitly marked incompatible,
live-vs-shadow disagreement rate within threshold,
no unexplained jump in effective spread / markout / fill-hazard residuals after source adjustment.

Rollback triggers:

abrupt unexplained TCA deterioration concentrated in source-sensitive names,
postmortem replays no longer matching online decisions,
fill-hazard residual explosion after parser/vendor change,
benchmark object changing without governance signoff.

Common mistakes

1. “Field names match, so the data matches.”

No. best_ask without source semantics is an underspecified object.

2. “If the model regressed after the rollout, the model is worse.”

Maybe. Or maybe the measuring stick changed.

3. “We can fix it by backfilling everything with the new parser.”

Sometimes that helps. Sometimes it destroys historical comparability unless the old periods are reinterpreted very carefully.

4. “TCA is independent ground truth.”

Only if benchmark semantics are stable and relevant to the router’s feasible action set.

5. “Vendor changes are ops details, not model risk.”

Wrong. In modern execution stacks, vendor/parser/normalization changes are often model-input contract changes.

A compact operating checklist

Before trusting a slippage comparison, ask:

Did quote source change?
Did odd-lot visibility change?
Did protected vs visible benchmark semantics change?
Did round-lot tier logic change?
Did depth representation change?
Did timestamp precedence change?
Did parser / vendor / normalization version change?
Did replay use the same source tuple as production?
Did TCA benchmark the same economic object routing saw?
Are we looking at a parity-stable cohort or a mixed cohort?

If you cannot answer those, you do not yet know whether the strategy changed.

Bottom line

A slippage model does not observe “the market” directly. It observes a representation of the market.

When that representation changes, the model’s world changes.

That means source-parity drift should be treated like any other serious production risk:

measured,
logged,
gated,
segmented,
governed.

Otherwise teams end up retraining on phantom drift, grading execution with moving benchmarks, and congratulating or punishing the model for changes that actually came from the data pipe.

The practical rule is simple:

Before asking whether the model is wrong, ask whether the market-data object stayed the same.

References and public anchors

U.S. SEC, Market Data Infrastructure final rule (2021): expanded core data concepts and updated round-lot definitions, acknowledging better-priced quotation information beyond the legacy round-lot lens.
CTA / UTP odd-lot plan materials and FAQs: odd-lot quotations may be disseminated but are not protected quotes and do not change the protected round-lot NBBO.
U.S. SEC, Disclosure of Order Execution Information modernization: recognizes that odd-lot quotes in higher-priced stocks can offer prices better than the round-lot NBBO and expands execution-quality reporting to odd-lot / fractional-share contexts.