Manual-Intervention Contamination & Counterfactual-Label Leakage Slippage Playbook

Why this matters

A production execution stack is almost never fully autonomous.

When things get weird, desks intervene:

pause the algo,
widen a limit,
flip urgency,
disable a venue,
manually cross residual,
force a hedge,
or restart a child-order schedule with fresh constraints.

Operationally, that is often the right move.

Modeling-wise, it creates a nasty trap:

the base policy gets you into trouble,
a human or supervisory layer rescues the order,
the final slippage gets stored as if it were the base policy’s own outcome,
and the training / TCA stack learns from a reality the policy never actually produced.

That contaminates both research and operations:

offline evaluation becomes overoptimistic or perversely pessimistic,
tactic comparisons become selection-biased,
live guardrails get calibrated on rescue-adjusted data,
and the desk stops knowing whether the model is good or merely recoverable by human babysitting.

This is not a minor reporting annoyance. It is a direct path to hidden slippage because policy quality, intervention quality, and crisis-case selection all get mixed into one mislabeled outcome.

Failure mode in one line

Observed slippage is treated as if it came from the autonomous policy, even though a human or supervisory intervention changed the action path mid-flight, so labels become counterfactually invalid and the stack learns the wrong cost surface.

What counts as an intervention

For this playbook, an intervention is any action that materially changes the execution path after the base policy has already started operating.

Typical examples:

Pause / resume of the parent or venue subset
Urgency override (e.g. passive VWAP -> aggressive catch-up)
Limit override (wider cap, looser protection bands)
Venue mask changes (disable darks, force lit-only, remove a flaky broker)
Manual child-order entry or phone-assisted execution
Residual liquidation by a separate rescue tactic
Temporary participation-cap increase
Manual hedge-first action before completion is finalized
Supervisor-imposed kill / thaw / restart
Fallback-policy activation that is not part of the base policy’s normal decision rule

The key idea is not “human vs machine.”

A fully automated supervisory controller can contaminate labels too if it activates only on stressed cases and its actions are not explicitly modeled.

Why the data becomes toxic

When an intervention occurs, the final outcome is no longer a clean sample from the base policy.

Three things happen at once:

1) Action-path substitution

The realized execution after intervention is generated by a different policy than the one you thought you were measuring.

2) Selection on hard cases

Interventions rarely happen at random. They cluster exactly when:

fills are lagging,
toxicity rises,
venues misbehave,
deadlines get close,
or state integrity looks suspicious.

So intervened orders are systematically harder than average.

3) Counterfactual unobservability

Once the desk intervenes, you no longer observe what the base policy would have done from that point onward.

That means the recorded final slippage is not a valid label for either:

the base policy alone, or
the intervention policy alone.

It is a mixed-path outcome.

Observable signatures

1) Great realized outcomes on orders that were clearly in trouble

Underfill deficit spikes
Deadline slack collapses
Manual or supervisory override fires
Final implementation shortfall looks “not too bad”
Model review concludes the base policy handled the case acceptably

What really happened: the rescue policy handled it acceptably.

2) Policy regressions that only appear after reducing desk babysitting

Offline backtests look stable
Shadow/live rollout looks worse
Same code, fewer interventions, worse tail outcomes

That usually means the training set contained intervention-assisted labels.

3) TCA explains away rescue behavior as model robustness

Orders with mid-flight urgency overrides still count toward passive-policy statistics
Venue blacklists entered by ops are credited to the router
Manual block cleanup is scored as if the schedule naturally completed

4) Tail slippage seems artificially compressed

p50 may look normal
p95/p99 look suspiciously better than raw event-path quality suggests
The gap coincides with active desk hours or known supervisory windows

5) Model features “predict” success by proxying intervention likelihood

Examples:

a severe deficit signal becomes positively associated with eventual completion,
because severe deficits trigger human rescue more often.

6) Evaluation changes when after-hours or unattended sessions are isolated

During staffed sessions, slippage appears better
During unattended periods, same policy appears materially weaker
Intervention coverage, not policy quality, explains the gap

Mechanical path to hidden slippage

Step 1) The base policy enters a degraded state

Maybe queue progress stalls, a venue goes weird, adverse selection rises, or the parent falls behind schedule.

Step 2) An operator or supervisory layer intervenes

Examples:

increases aggression,
disables a venue,
widens limits,
or manually crosses the residual.

Step 3) The execution path changes regime

The order is no longer following the base policy’s native control law.

Step 4) Final outcome is recorded as a single realized number

Implementation shortfall, markout, completion time, and residual path are stored without intervention-aware decomposition.

Step 5) Research consumes the mixed-path label

The training / evaluation stack implicitly assumes:

the pre-intervention decisions and post-intervention outcomes belong to one stationary policy.

They do not.

Step 6) The system learns the wrong lesson

Typical wrong lessons:

“late deficits are recoverable without much extra cost,”
“this venue set is fine under stress,”
“passive patience works near the deadline,”
“this feature regime has lower tail risk than it really does.”

Step 7) Live deployment pays the tax

When the same regime appears in unattended or reduced-oversight conditions:

rescue does not arrive,
tails open up,
residual cleanup gets more expensive,
and the policy looks like it suddenly regressed.

It did not suddenly regress. It was never as good as the contaminated labels implied.

Core model

Let:

X_t: state available to the base policy at time t
A_base,t: action the base policy would choose
I_t in {0,1}: intervention indicator at time t
A_int,t: intervention action when intervention fires
A_exec,t: actually executed action
Y: realized final slippage / outcome
T_int: intervention time, if any
H_t: latent stress not perfectly captured in features

Then:

A_exec,t = (1 - I_t) * A_base,t + I_t * A_int,t

and the final realized outcome is:

Y_obs = Y(path(A_exec, market))

But the base-policy counterfactual you usually care about is:

Y_base = Y(path(A_base, market))

Once intervention occurs, Y_base is no longer observed.

A practical decomposition is:

IS_obs = IS_pre + I_parent * (IS_rescue + IS_cleanup) + (1 - I_parent) * IS_base_post

where:

IS_pre: slippage accumulated before intervention,
IS_rescue: slippage generated by the rescue path,
IS_cleanup: unwind / catch-up / over-hedge cost caused by the intervention regime,
IS_base_post: post-decision cost if no intervention occurred.

The modeling trap is using IS_obs as a supervised label for A_base.

That creates two biases:

Label leakage bias

The label includes consequences of actions not chosen by the base policy.

Selection bias

Interventions occur on non-random, harder cases:

P(I = 1 | X_t, H_t) is strongly state-dependent.

So even if you tag intervened orders, naïve comparison between intervened and clean samples is still biased.

Slippage decomposition you actually want

A more useful parent-level decomposition is:

IS_total = IS_base_clean + IS_pre_distress + IS_intervention + IS_post_intervention_cleanup

with four questions:

How much did the base policy cost on non-intervened paths?
How much distress accumulated before intervention fired?
How much did the rescue itself cost or save?
How much extra cleanup happened because the path was broken mid-flight?

If you cannot answer all four, then “policy performance” is still mixed with desk support.

Intervention taxonomy

A) Pause / hold intervention

The order is temporarily stopped while the market keeps moving.

Risk: opportunity cost, schedule compression, re-entry toxicity.

B) Urgency override

The desk flips from passive logic to aggressive catch-up.

Risk: the final fill quality gets falsely credited to the original policy, while the spread-cross tax is hidden in aggregate averages.

C) Limit / protection override

A trader widens the executable band or relaxes protection logic.

Risk: completion looks better, but only because a human accepted a larger cost envelope than the policy itself would have allowed.

D) Venue-set intervention

The supervisor disables or enables venues mid-order.

Risk: venue-performance labels become contaminated because routing outcomes now reflect manual venue curation.

E) Manual residual cleanup

The tail of the order is crossed manually or through a rescue workflow.

Risk: the model learns that severe underfill states are less expensive than they really are under autonomous operation.

F) Hedge-first rescue

Inventory hedge or risk flattening happens before execution is complete.

Risk: base execution slippage and overlay risk management are conflated.

G) Restart / reseed intervention

The existing child schedule is canceled and a fresh schedule begins from a new state.

Risk: pre- and post-restart actions get treated as one continuous policy trajectory.

Feature set worth logging

If this data is not logged, you cannot repair it later.

Intervention identity

intervention_flag
intervention_type
intervention_actor (human / supervisor / fallback-policy / ops)
intervention_reason_code
intervention_scope (parent / venue / broker / symbol / portfolio)
intervention_start_ts
intervention_end_ts

Pre-intervention distress state

time_to_deadline_sec
completion_deficit_pct
residual_qty_pct
queue_progress_gap
markout_state
toxicity_state
venue_health_state
risk_gate_state
book_fragility_state

Path-change features

urgency_before_after
participation_cap_before_after
venue_mask_before_after
limit_band_before_after
manual_cross_qty
manual_hedge_qty
child_schedule_reset_flag

Label-integrity features

counterfactual_label_valid_flag
label_contamination_window_ms
post_intervention_notional_share
pre_intervention_cost_bps
post_intervention_cost_bps
cleanup_cost_bps
intervention_assisted_completion_flag

Metrics that surface the problem

1) Intervention Frequency Rate (IFR)

IFR = intervened_parents / total_parents

Track by:

symbol,
tactic,
venue set,
time of day,
operator coverage window.

2) Intervened Notional Share (INS)

INS = intervened_notional / total_notional

A low count rate with high notional share means the desk is rescuing the biggest or hardest orders.

3) Post-Intervention Notional Share (PINS)

PINS = notional_executed_after_first_intervention / total_parent_notional

High PINS means the final outcome is mostly generated by the rescue path, not the base path.

4) Label Purity Score (LPS)

A practical version:

LPS = 1 - PINS

You can make it fancier, but the intuition is simple: once most of the order is filled after intervention, the label is not clean.

5) Rescue Contribution Ratio (RCR)

RCR = completion_after_intervention / total_completion

If RCR is high on “good outcomes,” you are probably over-crediting the autonomous policy.

6) Distress-to-Intervention Lag (DIL)

Time from entering a degraded state to first intervention.

Why it matters:

short lag may indicate the base policy is brittle,
long lag may mean operators arrive too late and pay larger rescue cost.

7) Intervention-Free Tail Cost (IFTC)

Compute p95 / p99 cost only on clean, non-intervened paths.

This is one of the closest things you have to base-policy tail truth.

8) Staffed-vs-Unstaffed Slippage Gap (SUSG)

Compare similar orders across:

staffed periods,
lightly monitored periods,
unattended periods.

If staffed performance is much better without any model change, desk support is part of the real policy.

Modeling approaches that do not lie as much

Approach 1) Hard exclusion for primary label set

For the base slippage model, exclude parents once intervention materially changes the path.

Use when:

you care about clean autonomous-policy estimation,
intervention rate is not overwhelming,
and you can afford smaller sample size.

Good: high label purity.

Bad: selection bias remains if you then forget that excluded cases were exactly the hard ones.

Approach 2) Two-stage model: intervention hazard + clean-path slippage

Model:

P(I = 1 | X) — hazard of intervention
E[IS | X, I = 0] — clean-path cost given no intervention

Use this to answer:

where is the policy likely to require rescue,
and how expensive is the clean path where it survives on its own.

Approach 3) Mixture model with intervention branch

Estimate separate branches:

clean autonomous branch,
intervention-assisted branch.

Then route decisions can optimize not just expected slippage, but expected dependence on human rescue.

Approach 4) Shadow-policy reconstruction after intervention boundary

After the intervention time, continue simulating what the base policy would have done in shadow mode.

This is imperfect but valuable.

It gives a rough proxy for:

Y_base after contamination begins,
and rescue lift vs rescue cost.

Approach 5) Staffed-support as part of the true production policy

Sometimes the honest answer is:

the production policy is not autonomous; it is a human-in-the-loop system.

If so, treat operator availability and intervention logic as part of the policy contract. Do not report autonomous performance as if that were the real operating mode.

Practical state machine

Use a simple state model:

GREEN

no intervention,
label remains clean.

WATCH

distress signals rising,
no path change yet,
intervention hazard elevated.

HUMAN_ASSIST

intervention occurred,
label contamination begins,
clean autonomous evaluation should stop or branch.

RESCUE_LOCK

majority of remaining execution is now driven by rescue logic,
final outcome must not be attributed to base policy.

RECOVERY

intervention ended,
but cleanup / hedge / residual effects remain.

SAFE_REVIEW

parent closed,
decompose cost into pre-distress, rescue, and cleanup components before feeding research.

TCA questions to ask on every intervened order

Would the base policy have completed on time without intervention?
How much cost had already accrued before intervention fired?
How much notional was executed after intervention?
Did the intervention genuinely reduce cost, or just preserve completion?
Did it create cleanup cost later?
Should this order be excluded, downweighted, or branched in training?
If this happened unattended, what would the tail have looked like?

Anti-patterns

A) Counting all final fills as policy-generated fills

This is the canonical contamination bug.

B) Using intervention reason codes only for dashboards, not for labels

If the reason code is not part of dataset curation, it is just decorative telemetry.

C) Excluding intervened orders and calling the remainder “unbiased”

That only hides the hard cases. It does not make them random.

D) Letting fallback-policy actions masquerade as base-policy resilience

Automated rescue logic can contaminate labels just as much as a human can.

E) Comparing staffed and unstaffed performance without intervention coverage metrics

That comparison is meaningless unless support intensity is measured.

F) Rewarding a policy for states that mostly trigger rescue

If severe deficit states end well only because the desk steps in, the policy should not learn that those states are cheap.

A practical rollout plan

Phase 1) Instrumentation first

Before fancy modeling, capture:

intervention start/end,
actor,
reason,
scope,
changed constraints,
and post-intervention notional share.

Phase 2) Label partitioning

Create three buckets:

clean_autonomous
intervened_partial
rescue_dominated

Do not train them together by default.

Phase 3) Hazard modeling

Build an intervention-hazard model.

Useful outcome:

you learn where the policy tends to need babysitting.

Phase 4) Branch-aware TCA

Report:

clean-path IS,
intervention frequency,
rescue lift,
cleanup drag,
staffed vs unstaffed gap.

Phase 5) Policy redesign

If certain states repeatedly require rescue, either:

improve the base policy,
codify the rescue into the official policy,
or narrow the autonomy envelope.

What good looks like

A mature stack can say all of the following without handwaving:

“This is the clean autonomous slippage distribution.”
“This is how often the policy enters states that require intervention.”
“This is how much the rescue policy saves or costs.”
“This is the staffed-support premium embedded in our historical data.”
“This is the unattended tail if rescue does not arrive.”

Until you can say that, your labels are blending policy quality with operator heroics.

Bottom line

Intervened execution outcomes are not clean labels for the base policy.

If you do not separate:

pre-distress cost,
intervention hazard,
rescue-path cost,
and cleanup drag,

then the slippage model learns a fantasy in which hard cases magically end better than the autonomous policy could have achieved alone.

That fantasy is expensive.

It shows up later as:

overconfident offline validation,
brittle unattended execution,
mispriced tail risk,
and a desk that thinks the algo is smarter than it really is.

Human rescue is operationally useful.

But if you want truthful slippage modeling, you must log it, branch it, and stop calling it autonomous performance.