Kill-Switch False-Positive & Rearm-Cooldown Slippage Playbook

Model Safety Halts as a Branching Execution Regime (Not a Binary On/Off Event)

Why this note: Most desks treat kill-switch activation as pure risk protection and ignore execution drag after restart. In reality, the biggest hidden tax is often the false-positive halt + slow rearm path that leaves you under-participating through a liquidity regime shift.

1) Failure Mode in One Sentence

If your slippage stack models kill-switch events as “risk avoided” without pricing rearm latency and queue reset, it will systematically understate p95/p99 implementation shortfall during stressed-but-tradeable windows.

2) Branch-Aware Cost Decomposition

Let a parent order have residual inventory (Q_t) when a kill-switch trigger fires.

[ \mathbb{E}[IS_t] = p_{TP}\cdot C_{true_halt} + (1-p_{TP})\cdot C_{false_halt} ]

Where:

(p_{TP}): probability the trigger was a true risk event (halt was justified)
(C_{true_halt}): cost when halt genuinely prevented severe adverse outcomes
(C_{false_halt}): cost when halt was unnecessary and re-entry caused drag

A practical expansion:

[ C_{false_halt} = C_{cancel_flush} + C_{cooldown_idle} + C_{rearm_ramp} + C_{queue_reset} + C_{deadline_miss} ]

[ C_{true_halt} = C_{protective_delay} - B_{tail_loss_avoided} ]

This framing avoids the common accounting mistake: counting only losses avoided while ignoring restart friction.

3) Why Rearm Is Its Own Slippage Regime

After a kill event, you usually face all of the following simultaneously:

Book-position reset: all passive queue priority is gone.
Participation clamps: temporary risk caps throttle child-order aggression.
Warm-up guards: staggered re-enable logic delays normal routing.
Signal staleness: pre-halt microstructure features are partially invalid.
Operator friction: manual confirmations or staged unlocks introduce latency variance.

Treating this as “just back to normal after unpause” is a recurring source of tail misses.

4) Production State Machine (Kill/Recover Path)

K0 NORMAL — baseline policy
K1 PRE-TRIGGER STRESS — hazard signals rising
K2 KILL ACTIVATED — new flow blocked, safe cancels in-flight
K3 FLUSH PENDING — cancel-ack convergence / residual uncertainty
K4 COOLDOWN IDLE — hard minimum dwell before re-enable
K5 REARM RAMP — staged participation/routing reactivation
K6 POST-REARM FRAGILE — active again but queue depth/markout still unstable
K7 STABLE — recovered to baseline envelope

Use hysteresis and minimum dwell time for K5→K7 to prevent oscillatory re-kills.

5) Features That Matter Most

A) Trigger validity (true vs false positive)

trigger_context_consistency_score
multi-signal_agreement_ratio
venue_reject_spike_confirmed
latency_spike_persistence_ms
markout_adversity_confirmation

B) Flush/rearm readiness

cancel_ack_completion_ratio
orphan_child_order_probability
dropcopy_lag_ms
risk_limit_sync_lag_ms
cooldown_remaining_ms

C) Post-rearm fragility

rearm_first_60s_spread_multiple
rebuild_depth_half_life_ms
first_minute_fill_shortfall
post_rearm_markout_skew

D) Urgency coupling

residual_notional_to_expected_volume
deadline_slack_sec
alpha_decay_half_life_sec

Without explicit K4–K6 features, models often look fine in averages and fail in recovery tails.

6) Three-Stage Modeling Stack

Stage A — Trigger-validity model

Estimate (p_{TP}) at trigger time:

calibrated classifier (GBDT/logit with monotonic constraints where needed)
explicit confidence buckets to support policy gating

Stage B — Recovery-time model

Estimate conditional durations:

(T_{flush}), (T_{cooldown}), (T_{rearm})
survival/hazard models are practical because right-censoring is common

Stage C — Conditional slippage model (post-rearm)

Quantile heads (q50/q90/q97.5) for:

IS | K5,K6 under ramp schedules
missed-completion risk under residual urgency

Unified action score:

[ Score(a_t)=\mathbb{E}[IS_t(a_t)] + \lambda,CVaR_{\alpha}(IS_t(a_t)) + \gamma,P(\text{deadline miss}\mid a_t) ]

This prevents false confidence from mean-only recovery estimates.

7) Control Policy by State

K1 PRE-TRIGGER STRESS

Require multi-signal confirmation before hard kill when possible
Prefer graduated throttles first if confidence is low

K2–K3 KILL/FLUSH

Enforce deterministic cancel-completion checks
Freeze strategy-level alpha assumptions until state is reconciled

K4 COOLDOWN

Run branch simulation: immediate full rearm vs staged ramp
Precompute participation ladders by residual urgency tier

K5 REARM RAMP

Step participation caps (e.g., 20% → 40% → baseline) with guard checks
Tighten stale-signal TTL and passive timeout during first ramps

K6 POST-REARM FRAGILE

Cap bursty catch-up behavior
Prefer tactics robust to refill uncertainty over nominal spread capture

K7 STABLE

Return to baseline gradually; avoid one-tick full-policy snapback

8) Diagnostics & KPIs

FTR — False Trigger Rate
RLT95 — Rearm Latency Tail p95 (trigger→active)
FCD — Flush Completion Drift (predicted vs realized)
RFS95 — Rearm Fragility Slippage p95 (first 1–5 min)
QRT — Queue Rebuild Time to baseline depth
DMP — Deadline Miss Probability after rearm

If total incident count drops but RFS95 worsens, you’re probably over-killing and under-recovering.

9) Rollout Blueprint

Shadow (2–4 weeks): log trigger-validity probabilities + recovery-time forecasts
Replay: evaluate historical kill episodes with branch-aware counterfactuals
Canary: enable staged rearm policy on limited symbols/notional
Promotion gates:
- lower p95/p99 recovery slippage
- no increase in severe-loss incidents
- stable orphan/residual reconciliation metrics

10) Common Anti-Patterns

Treating kill-switch as a binary risk metric, not an execution regime
Assuming cooldown cost is negligible versus protective benefit
Re-enabling full aggression immediately after first green signal
Ignoring queue reset and depth rebuild in post-rearm cost
Evaluating policy with mean IS only (no tail/deadline lenses)

11) Fast Implementation Checklist

[ ] Label K0..K7 state transitions in execution telemetry
[ ] Train trigger-validity model for true/false halt probability
[ ] Model flush/cooldown/rearm durations with censored-time methods
[ ] Add post-rearm quantile slippage + deadline-miss heads
[ ] Implement staged rearm controller with hysteresis and caps
[ ] Gate promotion on RLT95/RFS95/DMP tail improvements

References

SEC Rule 15c3-5 (Market Access Rule), risk controls including kill functionality context.
ESMA MiFID II RTS 6, algorithmic trading controls and kill-switch governance.
CFTC/SEC joint statements and industry guidance on automated risk controls during stressed markets.
Cartea, Á., Jaimungal, S., Penalva, J. (2015), Algorithmic and High-Frequency Trading.
Kissell, R. (2014), The Science of Algorithmic Trading and Portfolio Management.

TL;DR

Kill-switch events should be modeled as a branching lifecycle (trigger validity → flush/cooldown → rearm fragility), not a single protective switch. Pricing false-positive and recovery tails explicitly usually improves p95/p99 slippage and completion reliability more than tuning baseline impact curves alone.