Kill-Switch False-Positive & Rearm-Cooldown Slippage Playbook

2026-03-31 · finance

Kill-Switch False-Positive & Rearm-Cooldown Slippage Playbook

Model Safety Halts as a Branching Execution Regime (Not a Binary On/Off Event)

Why this note: Most desks treat kill-switch activation as pure risk protection and ignore execution drag after restart. In reality, the biggest hidden tax is often the false-positive halt + slow rearm path that leaves you under-participating through a liquidity regime shift.


1) Failure Mode in One Sentence

If your slippage stack models kill-switch events as “risk avoided” without pricing rearm latency and queue reset, it will systematically understate p95/p99 implementation shortfall during stressed-but-tradeable windows.


2) Branch-Aware Cost Decomposition

Let a parent order have residual inventory (Q_t) when a kill-switch trigger fires.

[ \mathbb{E}[IS_t] = p_{TP}\cdot C_{true_halt} + (1-p_{TP})\cdot C_{false_halt} ]

Where:

A practical expansion:

[ C_{false_halt} = C_{cancel_flush} + C_{cooldown_idle} + C_{rearm_ramp} + C_{queue_reset} + C_{deadline_miss} ]

[ C_{true_halt} = C_{protective_delay} - B_{tail_loss_avoided} ]

This framing avoids the common accounting mistake: counting only losses avoided while ignoring restart friction.


3) Why Rearm Is Its Own Slippage Regime

After a kill event, you usually face all of the following simultaneously:

  1. Book-position reset: all passive queue priority is gone.
  2. Participation clamps: temporary risk caps throttle child-order aggression.
  3. Warm-up guards: staggered re-enable logic delays normal routing.
  4. Signal staleness: pre-halt microstructure features are partially invalid.
  5. Operator friction: manual confirmations or staged unlocks introduce latency variance.

Treating this as “just back to normal after unpause” is a recurring source of tail misses.


4) Production State Machine (Kill/Recover Path)

Use hysteresis and minimum dwell time for K5→K7 to prevent oscillatory re-kills.


5) Features That Matter Most

A) Trigger validity (true vs false positive)

B) Flush/rearm readiness

C) Post-rearm fragility

D) Urgency coupling

Without explicit K4–K6 features, models often look fine in averages and fail in recovery tails.


6) Three-Stage Modeling Stack

Stage A — Trigger-validity model

Estimate (p_{TP}) at trigger time:

Stage B — Recovery-time model

Estimate conditional durations:

Stage C — Conditional slippage model (post-rearm)

Quantile heads (q50/q90/q97.5) for:

Unified action score:

[ Score(a_t)=\mathbb{E}[IS_t(a_t)] + \lambda,CVaR_{\alpha}(IS_t(a_t)) + \gamma,P(\text{deadline miss}\mid a_t) ]

This prevents false confidence from mean-only recovery estimates.


7) Control Policy by State

K1 PRE-TRIGGER STRESS

K2–K3 KILL/FLUSH

K4 COOLDOWN

K5 REARM RAMP

K6 POST-REARM FRAGILE

K7 STABLE


8) Diagnostics & KPIs

  1. FTR — False Trigger Rate
  2. RLT95 — Rearm Latency Tail p95 (trigger→active)
  3. FCD — Flush Completion Drift (predicted vs realized)
  4. RFS95 — Rearm Fragility Slippage p95 (first 1–5 min)
  5. QRT — Queue Rebuild Time to baseline depth
  6. DMP — Deadline Miss Probability after rearm

If total incident count drops but RFS95 worsens, you’re probably over-killing and under-recovering.


9) Rollout Blueprint

  1. Shadow (2–4 weeks): log trigger-validity probabilities + recovery-time forecasts
  2. Replay: evaluate historical kill episodes with branch-aware counterfactuals
  3. Canary: enable staged rearm policy on limited symbols/notional
  4. Promotion gates:
    • lower p95/p99 recovery slippage
    • no increase in severe-loss incidents
    • stable orphan/residual reconciliation metrics

10) Common Anti-Patterns


11) Fast Implementation Checklist

[ ] Label K0..K7 state transitions in execution telemetry
[ ] Train trigger-validity model for true/false halt probability
[ ] Model flush/cooldown/rearm durations with censored-time methods
[ ] Add post-rearm quantile slippage + deadline-miss heads
[ ] Implement staged rearm controller with hysteresis and caps
[ ] Gate promotion on RLT95/RFS95/DMP tail improvements

References


TL;DR

Kill-switch events should be modeled as a branching lifecycle (trigger validity → flush/cooldown → rearm fragility), not a single protective switch. Pricing false-positive and recovery tails explicitly usually improves p95/p99 slippage and completion reliability more than tuning baseline impact curves alone.