Market-Data Source-Parity Drift and Benchmark-Inconsistency Slippage Playbook

2026-04-12 · finance

Market-Data Source-Parity Drift and Benchmark-Inconsistency Slippage Playbook

Date: 2026-04-12
Category: research (execution / slippage modeling)

Why this playbook exists

A lot of slippage models are secretly built on a fragile assumption:

the market-data representation used in research, backtests, TCA, and live routing is the same enough object.

In practice, that assumption breaks all the time.

Examples:

When that happens, the model may look like it drifted even if the market did not.

What actually changed was the representation of the market state.

That creates a specific slippage failure mode:

  1. benchmark continuity breaks,
  2. feature meaning changes without feature names changing,
  3. backtest/live comparability collapses,
  4. TCA blames strategy urgency for what is really data-source mismatch,
  5. retraining pipelines learn source transitions instead of market structure.

This note treats that as a first-class production problem: source-parity drift.


Public facts that make this real

This is not a hypothetical edge case. U.S. equity market data semantics have already changed in public, documented ways.

A few examples are enough:

The implication is simple:

"best quote" is no longer a stable primitive unless you also specify the source semantics.

If your model logs only best_bid, best_ask, and spread, you are often logging an incomplete object.


The core failure mode

A slippage model really sees a market state through a source tuple.

For decision time t, define:

Call the full tuple:

[ \Sigma(t) = (Qsrc, Ssrc, Dsrc, Psrc, Tsrc, Nsrc) ]

Two observations with the same field names but different \Sigma(t) are not the same state.

That is the heart of source-parity drift.

A model trained under \Sigma_train and served under \Sigma_live is effectively doing domain transfer, whether the team realizes it or not.


Why this is worse than ordinary feature drift

Ordinary feature drift says:

Source-parity drift says:

That makes it more dangerous because:

  1. it often looks like genuine regime change,
  2. standard drift detectors can fire without identifying the root cause,
  3. performance regressions may be “fixed” with retraining that bakes in the mismatch,
  4. TCA comparisons across periods become apples-to-oranges,
  5. rollback decisions become confused because online metrics and offline replays disagree for structural reasons.

Mechanism map

1. Benchmark object drift

Suppose research used a protected round-lot arrival benchmark, but live TCA now benchmarks against visible best quote including odd-lot-inside improvement.

Now fills can look worse even if router behavior is unchanged, because the benchmark got tighter while actionable capacity did not improve much.

This is not strategy degradation.

It is benchmark object drift.

2. Feature meaning drift under the same column name

A column called spread_bps can mean very different things depending on whether the touch was:

Likewise for:

Same name, different object.

3. Label contamination

Suppose labels are built from execution price relative to arrival mid from a source richer than what the live model sees.

Then the model is trained to predict outcomes against a reference it never gets online.

That creates false optimism in offline training and fake underperformance online.

4. Counterfactual replay mismatch

A postmortem replay often reconstructs “what the model would have seen.”

But if replay data came from another vendor, another normalization version, or another source mix, the reconstructed state can differ materially from the actual online state at the same decision timestamp.

Now the replay is not a counterfactual.

It is a different universe with shared timestamps.

5. Production rollout illusion

A data vendor or parser upgrade can make a strategy look dramatically smarter or dumber overnight.

Not because routing changed, but because:

If this is not tagged and segmented, model governance starts chasing ghosts.


The right abstraction: source parity as a measurable state variable

Define a source distance between two environments:

[ \Delta_{src}(A,B) = w_q d(Q_A,Q_B) + w_s d(S_A,S_B) + w_d d(D_A,D_B) + w_p d(P_A,P_B) + w_t d(T_A,T_B) + w_n d(N_A,N_B) ]

Where each component is a structured mismatch measure:

You do not need perfect formal distance metrics for this to be useful. A practical implementation can encode each component as categorical parity / partial parity / broken parity.

The key idea:

before comparing model performance across environments, compare Δsrc first.

If source distance is large, performance comparison is not clean evidence of model quality.


A practical source-tuple contract

Every scored decision and every backtest sample should store a compact source descriptor like:

{
  "quoteSource": "sip",
  "sizeSemantics": "odd-lot-aware-visible-best",
  "depthSemantics": "l1-plus-aggregated-l5",
  "protectionSemantics": "visible-not-actionable",
  "timestampSemantics": "sip_event_ts",
  "normalizationVersion": "mdi-v3.2",
  "vendor": "vendor-x",
  "roundLotTableVersion": "2026-04-27-prep"
}

Without this, you cannot reliably answer:


Metrics worth instrumenting

1. SPD — Source-Parity Distance

A structured score between training, replay, TCA, and live environments.

Examples:

If SPD is high, segment metrics before drawing conclusions.

2. BSG — Benchmark Source Gap

For a buy order:

[ BSG(t) = P_{bench}(t) - P_{router}(t) ]

Where:

This catches cases where TCA uses visible best but the router acts on protected or actionable best.

3. FSC — Feature Semantic Change Rate

Share of decisions where the same named feature changes interpretation because the source tuple changed.

Practical example:

4. LCP — Label/Context Parity Rate

Fraction of training labels whose reference source matches the live feature source contract.

Low LCP means the model is learning against a target defined in another data universe.

5. RRG — Reachable-Reference Gap

Difference between:

This is especially useful when visible, protected, and actionable price sets diverge.

6. SSS — Source-Stable Sharpe / Source-Stable Slippage

Measure performance only on cohorts where source tuple parity is stable.

This helps separate:

7. VNC — Version-Normalization Churn

Rate at which parser/vendor/normalization versions change in the production pipeline.

Many teams track model version obsessively and data-normalization version barely at all. That is backwards.


Concrete failure examples

Example A: odd-lot rollout makes TCA harsher overnight

Before rollout:

After rollout:

What changed?

Not necessarily routing quality.

The benchmark source changed.

Example B: model trained on direct-feed queue features, served on SIP-plus-derived queue proxies

Offline:

Online:

What failed?

Not “markets got harder.”

Σ_train != Σ_live.

Example C: vendor parser change appears as regime drift

A vendor upgrades odd-lot support and round-lot-tier handling.

Suddenly:

A naive dashboard says “regime change.”

A source-aware dashboard says “normalization version changed on Tuesday.”


Modeling blueprint

A robust slippage model should condition on both market state and source state.

Think of expected cost as:

[ E[C \mid x_t, \Sigma_t] = f(x_t, \Sigma_t) ]

where:

There are three practical ways to implement this.

Path 1: hard parity enforcement

Only train / backtest / compare when source tuples match within strict tolerance.

Best when:

Path 2: source-conditioned model

Add source-state indicators directly into the model:

Best when:

Path 3: source-specific experts with gated blending

Train specialized experts for source regimes:

Then gate by detected source tuple.

Best when:


A useful decomposition for slippage attribution

When source parity is imperfect, decompose realized cost as:

[ C_{realized} = C_{market} + C_{execution} + C_{source} ]

Where:

C_source is not always direct economic loss. Sometimes it is a diagnostic distortion that causes:

In practice, teams should estimate both:

  1. economic source cost — real fill degradation from source mismatch,
  2. measurement source cost — benchmark / label distortion caused by source mismatch.

State machine for production controls

1. PARITY_STABLE

Conditions:

Action:

2. PARITY_WARNING

Conditions:

Action:

3. PARITY_BROKEN

Conditions:

Action:

4. SAFE_SHADOW

Conditions:

Action:

Use hysteresis. It should be easier to enter warning modes than to leave them.


Controls that actually help

Control 1: log the source tuple on every decision

If the source is not in the event log, postmortems will guess.

Guesses are where fake regime shifts are born.

Control 2: separate benchmark source from router source

Never assume the TCA benchmark is the same as the price object routing used.

Persist both explicitly.

Control 3: pin normalization versions in backtests

Do not let historical replays silently reparse old periods with today’s semantics unless that is the explicit experiment.

“Reproducible backtest” means reproducible source semantics too.

Control 4: build parity-stable cohorts

For governance dashboards, always keep a source-stable cohort where:

are held constant.

This becomes the clean baseline for true model drift.

Control 5: require source-diff review for data changes

A vendor/parser/benchmark change should get the same review seriousness as a model change.

At minimum, require:

Control 6: shadow old and new sources during transitions

Before fully switching a benchmark or data vendor:

Control 7: maintain source-aware feature dictionaries

Every important feature should declare:


What to retrain, and what not to retrain

When source parity breaks, the knee-jerk response is often:

retrain immediately on the new data.

Sometimes that is correct.

Sometimes it is the fastest way to bury the evidence.

Retrain immediately only if:

Do not retrain blindly if:

In those cases, first restore comparability or explicitly fork the regime.


Promotion gates for source-changing rollouts

Before promoting a source or benchmark change, require something like:

Rollback triggers:


Common mistakes

1. “Field names match, so the data matches.”

No. best_ask without source semantics is an underspecified object.

2. “If the model regressed after the rollout, the model is worse.”

Maybe. Or maybe the measuring stick changed.

3. “We can fix it by backfilling everything with the new parser.”

Sometimes that helps. Sometimes it destroys historical comparability unless the old periods are reinterpreted very carefully.

4. “TCA is independent ground truth.”

Only if benchmark semantics are stable and relevant to the router’s feasible action set.

5. “Vendor changes are ops details, not model risk.”

Wrong. In modern execution stacks, vendor/parser/normalization changes are often model-input contract changes.


A compact operating checklist

Before trusting a slippage comparison, ask:

  1. Did quote source change?
  2. Did odd-lot visibility change?
  3. Did protected vs visible benchmark semantics change?
  4. Did round-lot tier logic change?
  5. Did depth representation change?
  6. Did timestamp precedence change?
  7. Did parser / vendor / normalization version change?
  8. Did replay use the same source tuple as production?
  9. Did TCA benchmark the same economic object routing saw?
  10. Are we looking at a parity-stable cohort or a mixed cohort?

If you cannot answer those, you do not yet know whether the strategy changed.


Bottom line

A slippage model does not observe “the market” directly. It observes a representation of the market.

When that representation changes, the model’s world changes.

That means source-parity drift should be treated like any other serious production risk:

Otherwise teams end up retraining on phantom drift, grading execution with moving benchmarks, and congratulating or punishing the model for changes that actually came from the data pipe.

The practical rule is simple:

Before asking whether the model is wrong, ask whether the market-data object stayed the same.


References and public anchors