Manual-Intervention Contamination & Counterfactual-Label Leakage Slippage Playbook
Why this matters
A production execution stack is almost never fully autonomous.
When things get weird, desks intervene:
- pause the algo,
- widen a limit,
- flip urgency,
- disable a venue,
- manually cross residual,
- force a hedge,
- or restart a child-order schedule with fresh constraints.
Operationally, that is often the right move.
Modeling-wise, it creates a nasty trap:
- the base policy gets you into trouble,
- a human or supervisory layer rescues the order,
- the final slippage gets stored as if it were the base policy’s own outcome,
- and the training / TCA stack learns from a reality the policy never actually produced.
That contaminates both research and operations:
- offline evaluation becomes overoptimistic or perversely pessimistic,
- tactic comparisons become selection-biased,
- live guardrails get calibrated on rescue-adjusted data,
- and the desk stops knowing whether the model is good or merely recoverable by human babysitting.
This is not a minor reporting annoyance. It is a direct path to hidden slippage because policy quality, intervention quality, and crisis-case selection all get mixed into one mislabeled outcome.
Failure mode in one line
Observed slippage is treated as if it came from the autonomous policy, even though a human or supervisory intervention changed the action path mid-flight, so labels become counterfactually invalid and the stack learns the wrong cost surface.
What counts as an intervention
For this playbook, an intervention is any action that materially changes the execution path after the base policy has already started operating.
Typical examples:
- Pause / resume of the parent or venue subset
- Urgency override (e.g. passive VWAP -> aggressive catch-up)
- Limit override (wider cap, looser protection bands)
- Venue mask changes (disable darks, force lit-only, remove a flaky broker)
- Manual child-order entry or phone-assisted execution
- Residual liquidation by a separate rescue tactic
- Temporary participation-cap increase
- Manual hedge-first action before completion is finalized
- Supervisor-imposed kill / thaw / restart
- Fallback-policy activation that is not part of the base policy’s normal decision rule
The key idea is not “human vs machine.”
A fully automated supervisory controller can contaminate labels too if it activates only on stressed cases and its actions are not explicitly modeled.
Why the data becomes toxic
When an intervention occurs, the final outcome is no longer a clean sample from the base policy.
Three things happen at once:
1) Action-path substitution
The realized execution after intervention is generated by a different policy than the one you thought you were measuring.
2) Selection on hard cases
Interventions rarely happen at random. They cluster exactly when:
- fills are lagging,
- toxicity rises,
- venues misbehave,
- deadlines get close,
- or state integrity looks suspicious.
So intervened orders are systematically harder than average.
3) Counterfactual unobservability
Once the desk intervenes, you no longer observe what the base policy would have done from that point onward.
That means the recorded final slippage is not a valid label for either:
- the base policy alone, or
- the intervention policy alone.
It is a mixed-path outcome.
Observable signatures
1) Great realized outcomes on orders that were clearly in trouble
- Underfill deficit spikes
- Deadline slack collapses
- Manual or supervisory override fires
- Final implementation shortfall looks “not too bad”
- Model review concludes the base policy handled the case acceptably
What really happened: the rescue policy handled it acceptably.
2) Policy regressions that only appear after reducing desk babysitting
- Offline backtests look stable
- Shadow/live rollout looks worse
- Same code, fewer interventions, worse tail outcomes
That usually means the training set contained intervention-assisted labels.
3) TCA explains away rescue behavior as model robustness
- Orders with mid-flight urgency overrides still count toward passive-policy statistics
- Venue blacklists entered by ops are credited to the router
- Manual block cleanup is scored as if the schedule naturally completed
4) Tail slippage seems artificially compressed
- p50 may look normal
- p95/p99 look suspiciously better than raw event-path quality suggests
- The gap coincides with active desk hours or known supervisory windows
5) Model features “predict” success by proxying intervention likelihood
Examples:
- a severe deficit signal becomes positively associated with eventual completion,
- because severe deficits trigger human rescue more often.
6) Evaluation changes when after-hours or unattended sessions are isolated
- During staffed sessions, slippage appears better
- During unattended periods, same policy appears materially weaker
- Intervention coverage, not policy quality, explains the gap
Mechanical path to hidden slippage
Step 1) The base policy enters a degraded state
Maybe queue progress stalls, a venue goes weird, adverse selection rises, or the parent falls behind schedule.
Step 2) An operator or supervisory layer intervenes
Examples:
- increases aggression,
- disables a venue,
- widens limits,
- or manually crosses the residual.
Step 3) The execution path changes regime
The order is no longer following the base policy’s native control law.
Step 4) Final outcome is recorded as a single realized number
Implementation shortfall, markout, completion time, and residual path are stored without intervention-aware decomposition.
Step 5) Research consumes the mixed-path label
The training / evaluation stack implicitly assumes:
- the pre-intervention decisions and post-intervention outcomes belong to one stationary policy.
They do not.
Step 6) The system learns the wrong lesson
Typical wrong lessons:
- “late deficits are recoverable without much extra cost,”
- “this venue set is fine under stress,”
- “passive patience works near the deadline,”
- “this feature regime has lower tail risk than it really does.”
Step 7) Live deployment pays the tax
When the same regime appears in unattended or reduced-oversight conditions:
- rescue does not arrive,
- tails open up,
- residual cleanup gets more expensive,
- and the policy looks like it suddenly regressed.
It did not suddenly regress. It was never as good as the contaminated labels implied.
Core model
Let:
X_t: state available to the base policy at timetA_base,t: action the base policy would chooseI_t in {0,1}: intervention indicator at timetA_int,t: intervention action when intervention firesA_exec,t: actually executed actionY: realized final slippage / outcomeT_int: intervention time, if anyH_t: latent stress not perfectly captured in features
Then:
A_exec,t = (1 - I_t) * A_base,t + I_t * A_int,t
and the final realized outcome is:
Y_obs = Y(path(A_exec, market))
But the base-policy counterfactual you usually care about is:
Y_base = Y(path(A_base, market))
Once intervention occurs, Y_base is no longer observed.
A practical decomposition is:
IS_obs = IS_pre + I_parent * (IS_rescue + IS_cleanup) + (1 - I_parent) * IS_base_post
where:
IS_pre: slippage accumulated before intervention,IS_rescue: slippage generated by the rescue path,IS_cleanup: unwind / catch-up / over-hedge cost caused by the intervention regime,IS_base_post: post-decision cost if no intervention occurred.
The modeling trap is using IS_obs as a supervised label for A_base.
That creates two biases:
Label leakage bias
The label includes consequences of actions not chosen by the base policy.
Selection bias
Interventions occur on non-random, harder cases:
P(I = 1 | X_t, H_t) is strongly state-dependent.
So even if you tag intervened orders, naïve comparison between intervened and clean samples is still biased.
Slippage decomposition you actually want
A more useful parent-level decomposition is:
IS_total = IS_base_clean + IS_pre_distress + IS_intervention + IS_post_intervention_cleanup
with four questions:
- How much did the base policy cost on non-intervened paths?
- How much distress accumulated before intervention fired?
- How much did the rescue itself cost or save?
- How much extra cleanup happened because the path was broken mid-flight?
If you cannot answer all four, then “policy performance” is still mixed with desk support.
Intervention taxonomy
A) Pause / hold intervention
The order is temporarily stopped while the market keeps moving.
Risk: opportunity cost, schedule compression, re-entry toxicity.
B) Urgency override
The desk flips from passive logic to aggressive catch-up.
Risk: the final fill quality gets falsely credited to the original policy, while the spread-cross tax is hidden in aggregate averages.
C) Limit / protection override
A trader widens the executable band or relaxes protection logic.
Risk: completion looks better, but only because a human accepted a larger cost envelope than the policy itself would have allowed.
D) Venue-set intervention
The supervisor disables or enables venues mid-order.
Risk: venue-performance labels become contaminated because routing outcomes now reflect manual venue curation.
E) Manual residual cleanup
The tail of the order is crossed manually or through a rescue workflow.
Risk: the model learns that severe underfill states are less expensive than they really are under autonomous operation.
F) Hedge-first rescue
Inventory hedge or risk flattening happens before execution is complete.
Risk: base execution slippage and overlay risk management are conflated.
G) Restart / reseed intervention
The existing child schedule is canceled and a fresh schedule begins from a new state.
Risk: pre- and post-restart actions get treated as one continuous policy trajectory.
Feature set worth logging
If this data is not logged, you cannot repair it later.
Intervention identity
intervention_flagintervention_typeintervention_actor(human / supervisor / fallback-policy / ops)intervention_reason_codeintervention_scope(parent / venue / broker / symbol / portfolio)intervention_start_tsintervention_end_ts
Pre-intervention distress state
time_to_deadline_seccompletion_deficit_pctresidual_qty_pctqueue_progress_gapmarkout_statetoxicity_statevenue_health_staterisk_gate_statebook_fragility_state
Path-change features
urgency_before_afterparticipation_cap_before_aftervenue_mask_before_afterlimit_band_before_aftermanual_cross_qtymanual_hedge_qtychild_schedule_reset_flag
Label-integrity features
counterfactual_label_valid_flaglabel_contamination_window_mspost_intervention_notional_sharepre_intervention_cost_bpspost_intervention_cost_bpscleanup_cost_bpsintervention_assisted_completion_flag
Metrics that surface the problem
1) Intervention Frequency Rate (IFR)
IFR = intervened_parents / total_parents
Track by:
- symbol,
- tactic,
- venue set,
- time of day,
- operator coverage window.
2) Intervened Notional Share (INS)
INS = intervened_notional / total_notional
A low count rate with high notional share means the desk is rescuing the biggest or hardest orders.
3) Post-Intervention Notional Share (PINS)
PINS = notional_executed_after_first_intervention / total_parent_notional
High PINS means the final outcome is mostly generated by the rescue path, not the base path.
4) Label Purity Score (LPS)
A practical version:
LPS = 1 - PINS
You can make it fancier, but the intuition is simple: once most of the order is filled after intervention, the label is not clean.
5) Rescue Contribution Ratio (RCR)
RCR = completion_after_intervention / total_completion
If RCR is high on “good outcomes,” you are probably over-crediting the autonomous policy.
6) Distress-to-Intervention Lag (DIL)
Time from entering a degraded state to first intervention.
Why it matters:
- short lag may indicate the base policy is brittle,
- long lag may mean operators arrive too late and pay larger rescue cost.
7) Intervention-Free Tail Cost (IFTC)
Compute p95 / p99 cost only on clean, non-intervened paths.
This is one of the closest things you have to base-policy tail truth.
8) Staffed-vs-Unstaffed Slippage Gap (SUSG)
Compare similar orders across:
- staffed periods,
- lightly monitored periods,
- unattended periods.
If staffed performance is much better without any model change, desk support is part of the real policy.
Modeling approaches that do not lie as much
Approach 1) Hard exclusion for primary label set
For the base slippage model, exclude parents once intervention materially changes the path.
Use when:
- you care about clean autonomous-policy estimation,
- intervention rate is not overwhelming,
- and you can afford smaller sample size.
Good: high label purity.
Bad: selection bias remains if you then forget that excluded cases were exactly the hard ones.
Approach 2) Two-stage model: intervention hazard + clean-path slippage
Model:
P(I = 1 | X)— hazard of interventionE[IS | X, I = 0]— clean-path cost given no intervention
Use this to answer:
- where is the policy likely to require rescue,
- and how expensive is the clean path where it survives on its own.
Approach 3) Mixture model with intervention branch
Estimate separate branches:
- clean autonomous branch,
- intervention-assisted branch.
Then route decisions can optimize not just expected slippage, but expected dependence on human rescue.
Approach 4) Shadow-policy reconstruction after intervention boundary
After the intervention time, continue simulating what the base policy would have done in shadow mode.
This is imperfect but valuable.
It gives a rough proxy for:
Y_baseafter contamination begins,- and rescue lift vs rescue cost.
Approach 5) Staffed-support as part of the true production policy
Sometimes the honest answer is:
the production policy is not autonomous; it is a human-in-the-loop system.
If so, treat operator availability and intervention logic as part of the policy contract. Do not report autonomous performance as if that were the real operating mode.
Practical state machine
Use a simple state model:
GREEN
- no intervention,
- label remains clean.
WATCH
- distress signals rising,
- no path change yet,
- intervention hazard elevated.
HUMAN_ASSIST
- intervention occurred,
- label contamination begins,
- clean autonomous evaluation should stop or branch.
RESCUE_LOCK
- majority of remaining execution is now driven by rescue logic,
- final outcome must not be attributed to base policy.
RECOVERY
- intervention ended,
- but cleanup / hedge / residual effects remain.
SAFE_REVIEW
- parent closed,
- decompose cost into pre-distress, rescue, and cleanup components before feeding research.
TCA questions to ask on every intervened order
- Would the base policy have completed on time without intervention?
- How much cost had already accrued before intervention fired?
- How much notional was executed after intervention?
- Did the intervention genuinely reduce cost, or just preserve completion?
- Did it create cleanup cost later?
- Should this order be excluded, downweighted, or branched in training?
- If this happened unattended, what would the tail have looked like?
Anti-patterns
A) Counting all final fills as policy-generated fills
This is the canonical contamination bug.
B) Using intervention reason codes only for dashboards, not for labels
If the reason code is not part of dataset curation, it is just decorative telemetry.
C) Excluding intervened orders and calling the remainder “unbiased”
That only hides the hard cases. It does not make them random.
D) Letting fallback-policy actions masquerade as base-policy resilience
Automated rescue logic can contaminate labels just as much as a human can.
E) Comparing staffed and unstaffed performance without intervention coverage metrics
That comparison is meaningless unless support intensity is measured.
F) Rewarding a policy for states that mostly trigger rescue
If severe deficit states end well only because the desk steps in, the policy should not learn that those states are cheap.
A practical rollout plan
Phase 1) Instrumentation first
Before fancy modeling, capture:
- intervention start/end,
- actor,
- reason,
- scope,
- changed constraints,
- and post-intervention notional share.
Phase 2) Label partitioning
Create three buckets:
clean_autonomousintervened_partialrescue_dominated
Do not train them together by default.
Phase 3) Hazard modeling
Build an intervention-hazard model.
Useful outcome:
- you learn where the policy tends to need babysitting.
Phase 4) Branch-aware TCA
Report:
- clean-path IS,
- intervention frequency,
- rescue lift,
- cleanup drag,
- staffed vs unstaffed gap.
Phase 5) Policy redesign
If certain states repeatedly require rescue, either:
- improve the base policy,
- codify the rescue into the official policy,
- or narrow the autonomy envelope.
What good looks like
A mature stack can say all of the following without handwaving:
- “This is the clean autonomous slippage distribution.”
- “This is how often the policy enters states that require intervention.”
- “This is how much the rescue policy saves or costs.”
- “This is the staffed-support premium embedded in our historical data.”
- “This is the unattended tail if rescue does not arrive.”
Until you can say that, your labels are blending policy quality with operator heroics.
Bottom line
Intervened execution outcomes are not clean labels for the base policy.
If you do not separate:
- pre-distress cost,
- intervention hazard,
- rescue-path cost,
- and cleanup drag,
then the slippage model learns a fantasy in which hard cases magically end better than the autonomous policy could have achieved alone.
That fantasy is expensive.
It shows up later as:
- overconfident offline validation,
- brittle unattended execution,
- mispriced tail risk,
- and a desk that thinks the algo is smarter than it really is.
Human rescue is operationally useful.
But if you want truthful slippage modeling, you must log it, branch it, and stop calling it autonomous performance.