Venue Health-Probe Flapping & Quarantine-Churn Slippage Playbook
Model Router Health as a Noisy State Process, Not a Binary Up/Down Switch
Why this note: Multi-venue routers often use heartbeat/probe logic to mark venues healthy or unhealthy. When probe thresholds are too twitchy, routers repeatedly quarantine and re-enable venues. That churn quietly destroys queue position and inflates tail slippage even when no venue is truly down.
1) Failure Mode in One Sentence
If venue health is modeled as a binary flag without false-positive/false-negative costs, probe flapping can become a hidden slippage engine via repeated reroute, queue-reset, and re-entry toxicity.
2) Branch-Aware Cost Decomposition
For a child-order decision at time (t), define router health belief over a venue:
- (P(U_t)): truly healthy (Up)
- (P(D_t)): degraded but executable
- (P(O_t)): effectively unavailable
Expected execution cost for action (a_t):
[ \mathbb{E}[IS_t(a_t)] = P(U_t),C_U(a_t) + P(D_t),C_D(a_t) + P(O_t),C_O(a_t) ]
When the health classifier is noisy, realized cost includes misclassification branches:
[ \mathbb{E}[IS_t] = C_{base} + C_{FP_quarantine} + C_{FN_stale_route} + C_{flip_churn} ]
Where:
- (C_{FP_quarantine}): venue was actually tradable, but flow was withdrawn
- (C_{FN_stale_route}): venue was degraded, but router kept sending into bad path
- (C_{flip_churn}): repeated quarantine/re-enable loops and queue reset tax
A practical expansion:
[ C_{flip_churn} = C_{cancel_flush} + C_{reroute_delay} + C_{queue_reset} + C_{reentry_markout} + C_{deadline_catchup} ]
3) Why Probe Flapping Is Expensive Even Without Outages
Repeated health flips create a specific execution pathology:
- Passive queue attrition: every false quarantine discards earned queue rank.
- Fallback crowding: rerouted flow saturates secondary venues.
- Re-entry toxicity: venue looks healthy again, but first re-entry window is often adverse.
- Policy oscillation: aggression and venue weights ping-pong faster than market microstructure can stabilize.
- Deadline convexity: each flip consumes slack, forcing costly late urgency.
This is why desks can report "no major outages" yet still see rising p95/p99 slippage.
4) Health-State Machine (Router-Side)
- H0 STABLE_UP โ normal scoring and routing
- H1 PROBE_WARNING โ probe/ack anomalies rising, confidence falling
- H2 SOFT_QUARANTINE โ reduced weight + strict caps (not full exclusion)
- H3 HARD_QUARANTINE โ venue excluded except emergency lanes
- H4 REENTRY_CANDIDATE โ recovery evidence accumulating
- H5 REENTRY_WARMUP โ staged reactivation with tight risk rails
- H6 RECOVERED_MONITORED โ normal-ish, but with elevated monitoring
Use dwell-time + hysteresis gates for H2โH3 and H4โH5โH6 to avoid flip storms.
5) Features That Matter Most
A) Probe-path integrity
probe_timeout_rate_1mprobe_jitter_p95_msprobe_vs_order_ack_divergenceheartbeat_loss_burst_count
B) Execution-path confirmation
order_ack_tail_p95_msreject_rate_spike_zscorecancel_ack_completion_ratiodropcopy_alignment_delay_ms
C) Flap/churn risk
health_flip_count_5mquarantine_dwell_instabilityreentry_first_60s_markoutqueue_rebuild_half_life_ms
D) Urgency coupling
deadline_slack_secresidual_notional_over_reliable_depthparticipation_gap_to_schedule
Without explicit churn features, models often misattribute losses to "market volatility."
6) Modeling Stack
Stage A โ Health-belief model
Estimate (P(U_t), P(D_t), P(O_t)) from probe + execution features:
- calibrated multiclass model (or ordinal model)
- reliability layer to downweight probe-only evidence when execution-path contradicts it
Stage B โ Flip-hazard model
Estimate short-horizon probability of another health transition:
[ P(\text{flip in }\Delta t \mid x_t) ]
Useful for deciding whether to quarantine now or hold with tighter caps.
Stage C โ Cost heads by state
Quantile forecasts (q50/q90/q97.5) for each action under current health state:
- cost if staying,
- cost if soft/hard quarantining,
- cost if staged re-entry.
Unified action score:
[ Score(a_t)=\mathbb{E}[IS_t(a_t)] + \lambda,CVaR_\alpha(IS_t(a_t)) + \gamma,P(\text{deadline miss}\mid a_t) ]
This prevents overreacting to noisy probes with mean-only logic.
7) Controller Policy by State
H1 PROBE_WARNING
- require cross-path confirmation (probe + ACK/reject evidence)
- reduce passive TTL and cap max child size before hard quarantine
H2 SOFT_QUARANTINE
- downweight venue rather than full exclusion
- enforce bounded participation and tighter retry budget
H3 HARD_QUARANTINE
- deterministic cancel/flush checks
- preserve emergency completion lane with strict toxicity guard
H4 REENTRY_CANDIDATE
- require minimum healthy dwell window
- check first-pass reliability KPIs before enabling normal flow
H5 REENTRY_WARMUP
- staged weights (e.g., 10% -> 25% -> 50% -> baseline)
- short passive timeout, strict markout stop rules
H6 RECOVERED_MONITORED
- keep temporary flip guardrails for one monitoring horizon
- remove protections gradually, not instantly
8) Diagnostics & KPIs
- HFR โ Health Flip Rate (transitions/hour)
- QCT โ Quarantine Churn Tax (bps cost vs matched stable periods)
- FPR-Q โ False-Positive Quarantine Rate
- RMT95 โ Re-entry Markout Tail p95 (first 60s/180s)
- QRT โ Queue Rebuild Time after re-entry
- SDL โ Slack Depletion Loss (cost increase per second of lost deadline slack)
If outages are flat but HFR and QCT rise, probe policy is probably too twitchy.
9) Rollout Blueprint
- Shadow (2-3 weeks): log health-belief probabilities and hypothetical state decisions
- Backtest + replay: compare binary health policy vs probabilistic controller
- Canary: enable only soft-quarantine logic on limited symbols/notional
- Promotion gates:
- lower QCT and RMT95,
- no increase in severe completion misses,
- stable FPR-Q under stressed sessions
10) Common Anti-Patterns
- Binary up/down health flags with no uncertainty score
- Probe-only decisions that ignore execution-path reality
- Instant hard-quarantine on one timeout burst
- Instant full re-entry after first green probe
- No dwell-time/hysteresis controls
- Evaluating policy on mean IS only (ignoring tail/deadline risk)
11) Fast Implementation Checklist
[ ] Label H0..H6 transitions in router telemetry
[ ] Build calibrated health-belief model (U/D/O probabilities)
[ ] Add short-horizon flip-hazard prediction
[ ] Train quantile slippage heads for quarantine/re-entry actions
[ ] Implement dwell-time + hysteresis guards in controller
[ ] Gate rollout on QCT/RMT95/FPR-Q improvements
References
- Cartea, ร., Jaimungal, S., Penalva, J. (2015), Algorithmic and High-Frequency Trading.
- Kissell, R. (2014), The Science of Algorithmic Trading and Portfolio Management.
- SEC Rule 15c3-5 (Market Access Rule), automated pre-trade risk control context.
- ESMA MiFID II RTS 6, algorithmic controls and resilience expectations.
TL;DR
Venue-health flapping is not just an infrastructure nuisance; it is a measurable slippage regime. Model health as probabilities, price false quarantine and re-entry tails explicitly, and control flips with hysteresis + staged reactivation to protect p95/p99 execution cost and completion reliability.