Kill-Switch False-Positive & Rearm-Cooldown Slippage Playbook
Model Safety Halts as a Branching Execution Regime (Not a Binary On/Off Event)
Why this note: Most desks treat kill-switch activation as pure risk protection and ignore execution drag after restart. In reality, the biggest hidden tax is often the false-positive halt + slow rearm path that leaves you under-participating through a liquidity regime shift.
1) Failure Mode in One Sentence
If your slippage stack models kill-switch events as “risk avoided” without pricing rearm latency and queue reset, it will systematically understate p95/p99 implementation shortfall during stressed-but-tradeable windows.
2) Branch-Aware Cost Decomposition
Let a parent order have residual inventory (Q_t) when a kill-switch trigger fires.
[ \mathbb{E}[IS_t] = p_{TP}\cdot C_{true_halt} + (1-p_{TP})\cdot C_{false_halt} ]
Where:
- (p_{TP}): probability the trigger was a true risk event (halt was justified)
- (C_{true_halt}): cost when halt genuinely prevented severe adverse outcomes
- (C_{false_halt}): cost when halt was unnecessary and re-entry caused drag
A practical expansion:
[ C_{false_halt} = C_{cancel_flush} + C_{cooldown_idle} + C_{rearm_ramp} + C_{queue_reset} + C_{deadline_miss} ]
[ C_{true_halt} = C_{protective_delay} - B_{tail_loss_avoided} ]
This framing avoids the common accounting mistake: counting only losses avoided while ignoring restart friction.
3) Why Rearm Is Its Own Slippage Regime
After a kill event, you usually face all of the following simultaneously:
- Book-position reset: all passive queue priority is gone.
- Participation clamps: temporary risk caps throttle child-order aggression.
- Warm-up guards: staggered re-enable logic delays normal routing.
- Signal staleness: pre-halt microstructure features are partially invalid.
- Operator friction: manual confirmations or staged unlocks introduce latency variance.
Treating this as “just back to normal after unpause” is a recurring source of tail misses.
4) Production State Machine (Kill/Recover Path)
- K0 NORMAL — baseline policy
- K1 PRE-TRIGGER STRESS — hazard signals rising
- K2 KILL ACTIVATED — new flow blocked, safe cancels in-flight
- K3 FLUSH PENDING — cancel-ack convergence / residual uncertainty
- K4 COOLDOWN IDLE — hard minimum dwell before re-enable
- K5 REARM RAMP — staged participation/routing reactivation
- K6 POST-REARM FRAGILE — active again but queue depth/markout still unstable
- K7 STABLE — recovered to baseline envelope
Use hysteresis and minimum dwell time for K5→K7 to prevent oscillatory re-kills.
5) Features That Matter Most
A) Trigger validity (true vs false positive)
trigger_context_consistency_scoremulti-signal_agreement_ratiovenue_reject_spike_confirmedlatency_spike_persistence_msmarkout_adversity_confirmation
B) Flush/rearm readiness
cancel_ack_completion_ratioorphan_child_order_probabilitydropcopy_lag_msrisk_limit_sync_lag_mscooldown_remaining_ms
C) Post-rearm fragility
rearm_first_60s_spread_multiplerebuild_depth_half_life_msfirst_minute_fill_shortfallpost_rearm_markout_skew
D) Urgency coupling
residual_notional_to_expected_volumedeadline_slack_secalpha_decay_half_life_sec
Without explicit K4–K6 features, models often look fine in averages and fail in recovery tails.
6) Three-Stage Modeling Stack
Stage A — Trigger-validity model
Estimate (p_{TP}) at trigger time:
- calibrated classifier (GBDT/logit with monotonic constraints where needed)
- explicit confidence buckets to support policy gating
Stage B — Recovery-time model
Estimate conditional durations:
- (T_{flush}), (T_{cooldown}), (T_{rearm})
- survival/hazard models are practical because right-censoring is common
Stage C — Conditional slippage model (post-rearm)
Quantile heads (q50/q90/q97.5) for:
IS | K5,K6under ramp schedules- missed-completion risk under residual urgency
Unified action score:
[ Score(a_t)=\mathbb{E}[IS_t(a_t)] + \lambda,CVaR_{\alpha}(IS_t(a_t)) + \gamma,P(\text{deadline miss}\mid a_t) ]
This prevents false confidence from mean-only recovery estimates.
7) Control Policy by State
K1 PRE-TRIGGER STRESS
- Require multi-signal confirmation before hard kill when possible
- Prefer graduated throttles first if confidence is low
K2–K3 KILL/FLUSH
- Enforce deterministic cancel-completion checks
- Freeze strategy-level alpha assumptions until state is reconciled
K4 COOLDOWN
- Run branch simulation: immediate full rearm vs staged ramp
- Precompute participation ladders by residual urgency tier
K5 REARM RAMP
- Step participation caps (e.g., 20% → 40% → baseline) with guard checks
- Tighten stale-signal TTL and passive timeout during first ramps
K6 POST-REARM FRAGILE
- Cap bursty catch-up behavior
- Prefer tactics robust to refill uncertainty over nominal spread capture
K7 STABLE
- Return to baseline gradually; avoid one-tick full-policy snapback
8) Diagnostics & KPIs
- FTR — False Trigger Rate
- RLT95 — Rearm Latency Tail p95 (trigger→active)
- FCD — Flush Completion Drift (predicted vs realized)
- RFS95 — Rearm Fragility Slippage p95 (first 1–5 min)
- QRT — Queue Rebuild Time to baseline depth
- DMP — Deadline Miss Probability after rearm
If total incident count drops but RFS95 worsens, you’re probably over-killing and under-recovering.
9) Rollout Blueprint
- Shadow (2–4 weeks): log trigger-validity probabilities + recovery-time forecasts
- Replay: evaluate historical kill episodes with branch-aware counterfactuals
- Canary: enable staged rearm policy on limited symbols/notional
- Promotion gates:
- lower p95/p99 recovery slippage
- no increase in severe-loss incidents
- stable orphan/residual reconciliation metrics
10) Common Anti-Patterns
- Treating kill-switch as a binary risk metric, not an execution regime
- Assuming cooldown cost is negligible versus protective benefit
- Re-enabling full aggression immediately after first green signal
- Ignoring queue reset and depth rebuild in post-rearm cost
- Evaluating policy with mean IS only (no tail/deadline lenses)
11) Fast Implementation Checklist
[ ] Label K0..K7 state transitions in execution telemetry
[ ] Train trigger-validity model for true/false halt probability
[ ] Model flush/cooldown/rearm durations with censored-time methods
[ ] Add post-rearm quantile slippage + deadline-miss heads
[ ] Implement staged rearm controller with hysteresis and caps
[ ] Gate promotion on RLT95/RFS95/DMP tail improvements
References
- SEC Rule 15c3-5 (Market Access Rule), risk controls including kill functionality context.
- ESMA MiFID II RTS 6, algorithmic trading controls and kill-switch governance.
- CFTC/SEC joint statements and industry guidance on automated risk controls during stressed markets.
- Cartea, Á., Jaimungal, S., Penalva, J. (2015), Algorithmic and High-Frequency Trading.
- Kissell, R. (2014), The Science of Algorithmic Trading and Portfolio Management.
TL;DR
Kill-switch events should be modeled as a branching lifecycle (trigger validity → flush/cooldown → rearm fragility), not a single protective switch. Pricing false-positive and recovery tails explicitly usually improves p95/p99 slippage and completion reliability more than tuning baseline impact curves alone.