Manual-Intervention Contamination & Counterfactual-Label Leakage Slippage Playbook

2026-04-07 · finance

Manual-Intervention Contamination & Counterfactual-Label Leakage Slippage Playbook

Why this matters

A production execution stack is almost never fully autonomous.

When things get weird, desks intervene:

Operationally, that is often the right move.

Modeling-wise, it creates a nasty trap:

  1. the base policy gets you into trouble,
  2. a human or supervisory layer rescues the order,
  3. the final slippage gets stored as if it were the base policy’s own outcome,
  4. and the training / TCA stack learns from a reality the policy never actually produced.

That contaminates both research and operations:

This is not a minor reporting annoyance. It is a direct path to hidden slippage because policy quality, intervention quality, and crisis-case selection all get mixed into one mislabeled outcome.


Failure mode in one line

Observed slippage is treated as if it came from the autonomous policy, even though a human or supervisory intervention changed the action path mid-flight, so labels become counterfactually invalid and the stack learns the wrong cost surface.


What counts as an intervention

For this playbook, an intervention is any action that materially changes the execution path after the base policy has already started operating.

Typical examples:

The key idea is not “human vs machine.”

A fully automated supervisory controller can contaminate labels too if it activates only on stressed cases and its actions are not explicitly modeled.


Why the data becomes toxic

When an intervention occurs, the final outcome is no longer a clean sample from the base policy.

Three things happen at once:

1) Action-path substitution

The realized execution after intervention is generated by a different policy than the one you thought you were measuring.

2) Selection on hard cases

Interventions rarely happen at random. They cluster exactly when:

So intervened orders are systematically harder than average.

3) Counterfactual unobservability

Once the desk intervenes, you no longer observe what the base policy would have done from that point onward.

That means the recorded final slippage is not a valid label for either:

It is a mixed-path outcome.


Observable signatures

1) Great realized outcomes on orders that were clearly in trouble

What really happened: the rescue policy handled it acceptably.

2) Policy regressions that only appear after reducing desk babysitting

That usually means the training set contained intervention-assisted labels.

3) TCA explains away rescue behavior as model robustness

4) Tail slippage seems artificially compressed

5) Model features “predict” success by proxying intervention likelihood

Examples:

6) Evaluation changes when after-hours or unattended sessions are isolated


Mechanical path to hidden slippage

Step 1) The base policy enters a degraded state

Maybe queue progress stalls, a venue goes weird, adverse selection rises, or the parent falls behind schedule.

Step 2) An operator or supervisory layer intervenes

Examples:

Step 3) The execution path changes regime

The order is no longer following the base policy’s native control law.

Step 4) Final outcome is recorded as a single realized number

Implementation shortfall, markout, completion time, and residual path are stored without intervention-aware decomposition.

Step 5) Research consumes the mixed-path label

The training / evaluation stack implicitly assumes:

They do not.

Step 6) The system learns the wrong lesson

Typical wrong lessons:

Step 7) Live deployment pays the tax

When the same regime appears in unattended or reduced-oversight conditions:

It did not suddenly regress. It was never as good as the contaminated labels implied.


Core model

Let:

Then:

A_exec,t = (1 - I_t) * A_base,t + I_t * A_int,t

and the final realized outcome is:

Y_obs = Y(path(A_exec, market))

But the base-policy counterfactual you usually care about is:

Y_base = Y(path(A_base, market))

Once intervention occurs, Y_base is no longer observed.

A practical decomposition is:

IS_obs = IS_pre + I_parent * (IS_rescue + IS_cleanup) + (1 - I_parent) * IS_base_post

where:

The modeling trap is using IS_obs as a supervised label for A_base.

That creates two biases:

Label leakage bias

The label includes consequences of actions not chosen by the base policy.

Selection bias

Interventions occur on non-random, harder cases:

P(I = 1 | X_t, H_t) is strongly state-dependent.

So even if you tag intervened orders, naïve comparison between intervened and clean samples is still biased.


Slippage decomposition you actually want

A more useful parent-level decomposition is:

IS_total = IS_base_clean + IS_pre_distress + IS_intervention + IS_post_intervention_cleanup

with four questions:

  1. How much did the base policy cost on non-intervened paths?
  2. How much distress accumulated before intervention fired?
  3. How much did the rescue itself cost or save?
  4. How much extra cleanup happened because the path was broken mid-flight?

If you cannot answer all four, then “policy performance” is still mixed with desk support.


Intervention taxonomy

A) Pause / hold intervention

The order is temporarily stopped while the market keeps moving.

Risk: opportunity cost, schedule compression, re-entry toxicity.

B) Urgency override

The desk flips from passive logic to aggressive catch-up.

Risk: the final fill quality gets falsely credited to the original policy, while the spread-cross tax is hidden in aggregate averages.

C) Limit / protection override

A trader widens the executable band or relaxes protection logic.

Risk: completion looks better, but only because a human accepted a larger cost envelope than the policy itself would have allowed.

D) Venue-set intervention

The supervisor disables or enables venues mid-order.

Risk: venue-performance labels become contaminated because routing outcomes now reflect manual venue curation.

E) Manual residual cleanup

The tail of the order is crossed manually or through a rescue workflow.

Risk: the model learns that severe underfill states are less expensive than they really are under autonomous operation.

F) Hedge-first rescue

Inventory hedge or risk flattening happens before execution is complete.

Risk: base execution slippage and overlay risk management are conflated.

G) Restart / reseed intervention

The existing child schedule is canceled and a fresh schedule begins from a new state.

Risk: pre- and post-restart actions get treated as one continuous policy trajectory.


Feature set worth logging

If this data is not logged, you cannot repair it later.

Intervention identity

Pre-intervention distress state

Path-change features

Label-integrity features


Metrics that surface the problem

1) Intervention Frequency Rate (IFR)

IFR = intervened_parents / total_parents

Track by:

2) Intervened Notional Share (INS)

INS = intervened_notional / total_notional

A low count rate with high notional share means the desk is rescuing the biggest or hardest orders.

3) Post-Intervention Notional Share (PINS)

PINS = notional_executed_after_first_intervention / total_parent_notional

High PINS means the final outcome is mostly generated by the rescue path, not the base path.

4) Label Purity Score (LPS)

A practical version:

LPS = 1 - PINS

You can make it fancier, but the intuition is simple: once most of the order is filled after intervention, the label is not clean.

5) Rescue Contribution Ratio (RCR)

RCR = completion_after_intervention / total_completion

If RCR is high on “good outcomes,” you are probably over-crediting the autonomous policy.

6) Distress-to-Intervention Lag (DIL)

Time from entering a degraded state to first intervention.

Why it matters:

7) Intervention-Free Tail Cost (IFTC)

Compute p95 / p99 cost only on clean, non-intervened paths.

This is one of the closest things you have to base-policy tail truth.

8) Staffed-vs-Unstaffed Slippage Gap (SUSG)

Compare similar orders across:

If staffed performance is much better without any model change, desk support is part of the real policy.


Modeling approaches that do not lie as much

Approach 1) Hard exclusion for primary label set

For the base slippage model, exclude parents once intervention materially changes the path.

Use when:

Good: high label purity.

Bad: selection bias remains if you then forget that excluded cases were exactly the hard ones.

Approach 2) Two-stage model: intervention hazard + clean-path slippage

Model:

  1. P(I = 1 | X) — hazard of intervention
  2. E[IS | X, I = 0] — clean-path cost given no intervention

Use this to answer:

Approach 3) Mixture model with intervention branch

Estimate separate branches:

Then route decisions can optimize not just expected slippage, but expected dependence on human rescue.

Approach 4) Shadow-policy reconstruction after intervention boundary

After the intervention time, continue simulating what the base policy would have done in shadow mode.

This is imperfect but valuable.

It gives a rough proxy for:

Approach 5) Staffed-support as part of the true production policy

Sometimes the honest answer is:

the production policy is not autonomous; it is a human-in-the-loop system.

If so, treat operator availability and intervention logic as part of the policy contract. Do not report autonomous performance as if that were the real operating mode.


Practical state machine

Use a simple state model:

GREEN

WATCH

HUMAN_ASSIST

RESCUE_LOCK

RECOVERY

SAFE_REVIEW


TCA questions to ask on every intervened order

  1. Would the base policy have completed on time without intervention?
  2. How much cost had already accrued before intervention fired?
  3. How much notional was executed after intervention?
  4. Did the intervention genuinely reduce cost, or just preserve completion?
  5. Did it create cleanup cost later?
  6. Should this order be excluded, downweighted, or branched in training?
  7. If this happened unattended, what would the tail have looked like?

Anti-patterns

A) Counting all final fills as policy-generated fills

This is the canonical contamination bug.

B) Using intervention reason codes only for dashboards, not for labels

If the reason code is not part of dataset curation, it is just decorative telemetry.

C) Excluding intervened orders and calling the remainder “unbiased”

That only hides the hard cases. It does not make them random.

D) Letting fallback-policy actions masquerade as base-policy resilience

Automated rescue logic can contaminate labels just as much as a human can.

E) Comparing staffed and unstaffed performance without intervention coverage metrics

That comparison is meaningless unless support intensity is measured.

F) Rewarding a policy for states that mostly trigger rescue

If severe deficit states end well only because the desk steps in, the policy should not learn that those states are cheap.


A practical rollout plan

Phase 1) Instrumentation first

Before fancy modeling, capture:

Phase 2) Label partitioning

Create three buckets:

Do not train them together by default.

Phase 3) Hazard modeling

Build an intervention-hazard model.

Useful outcome:

Phase 4) Branch-aware TCA

Report:

Phase 5) Policy redesign

If certain states repeatedly require rescue, either:


What good looks like

A mature stack can say all of the following without handwaving:

Until you can say that, your labels are blending policy quality with operator heroics.


Bottom line

Intervened execution outcomes are not clean labels for the base policy.

If you do not separate:

then the slippage model learns a fantasy in which hard cases magically end better than the autonomous policy could have achieved alone.

That fantasy is expensive.

It shows up later as:

Human rescue is operationally useful.

But if you want truthful slippage modeling, you must log it, branch it, and stop calling it autonomous performance.