Venue Health-Probe Flapping & Quarantine-Churn Slippage Playbook

Model Router Health as a Noisy State Process, Not a Binary Up/Down Switch

Why this note: Multi-venue routers often use heartbeat/probe logic to mark venues healthy or unhealthy. When probe thresholds are too twitchy, routers repeatedly quarantine and re-enable venues. That churn quietly destroys queue position and inflates tail slippage even when no venue is truly down.

1) Failure Mode in One Sentence

If venue health is modeled as a binary flag without false-positive/false-negative costs, probe flapping can become a hidden slippage engine via repeated reroute, queue-reset, and re-entry toxicity.

2) Branch-Aware Cost Decomposition

For a child-order decision at time (t), define router health belief over a venue:

(P(U_t)): truly healthy (Up)
(P(D_t)): degraded but executable
(P(O_t)): effectively unavailable

Expected execution cost for action (a_t):

[ \mathbb{E}[IS_t(a_t)] = P(U_t),C_U(a_t) + P(D_t),C_D(a_t) + P(O_t),C_O(a_t) ]

When the health classifier is noisy, realized cost includes misclassification branches:

[ \mathbb{E}[IS_t] = C_{base} + C_{FP_quarantine} + C_{FN_stale_route} + C_{flip_churn} ]

Where:

(C_{FP_quarantine}): venue was actually tradable, but flow was withdrawn
(C_{FN_stale_route}): venue was degraded, but router kept sending into bad path
(C_{flip_churn}): repeated quarantine/re-enable loops and queue reset tax

A practical expansion:

[ C_{flip_churn} = C_{cancel_flush} + C_{reroute_delay} + C_{queue_reset} + C_{reentry_markout} + C_{deadline_catchup} ]

3) Why Probe Flapping Is Expensive Even Without Outages

Repeated health flips create a specific execution pathology:

Passive queue attrition: every false quarantine discards earned queue rank.
Fallback crowding: rerouted flow saturates secondary venues.
Re-entry toxicity: venue looks healthy again, but first re-entry window is often adverse.
Policy oscillation: aggression and venue weights ping-pong faster than market microstructure can stabilize.
Deadline convexity: each flip consumes slack, forcing costly late urgency.

This is why desks can report "no major outages" yet still see rising p95/p99 slippage.

4) Health-State Machine (Router-Side)

H0 STABLE_UP — normal scoring and routing
H1 PROBE_WARNING — probe/ack anomalies rising, confidence falling
H2 SOFT_QUARANTINE — reduced weight + strict caps (not full exclusion)
H3 HARD_QUARANTINE — venue excluded except emergency lanes
H4 REENTRY_CANDIDATE — recovery evidence accumulating
H5 REENTRY_WARMUP — staged reactivation with tight risk rails
H6 RECOVERED_MONITORED — normal-ish, but with elevated monitoring

Use dwell-time + hysteresis gates for H2↔H3 and H4↔H5↔H6 to avoid flip storms.

5) Features That Matter Most

A) Probe-path integrity

probe_timeout_rate_1m
probe_jitter_p95_ms
probe_vs_order_ack_divergence
heartbeat_loss_burst_count

B) Execution-path confirmation

order_ack_tail_p95_ms
reject_rate_spike_zscore
cancel_ack_completion_ratio
dropcopy_alignment_delay_ms

C) Flap/churn risk

health_flip_count_5m
quarantine_dwell_instability
reentry_first_60s_markout
queue_rebuild_half_life_ms

D) Urgency coupling

deadline_slack_sec
residual_notional_over_reliable_depth
participation_gap_to_schedule

Without explicit churn features, models often misattribute losses to "market volatility."

6) Modeling Stack

Stage A — Health-belief model

Estimate (P(U_t), P(D_t), P(O_t)) from probe + execution features:

calibrated multiclass model (or ordinal model)
reliability layer to downweight probe-only evidence when execution-path contradicts it

Stage B — Flip-hazard model

Estimate short-horizon probability of another health transition:

[ P(\text{flip in }\Delta t \mid x_t) ]

Useful for deciding whether to quarantine now or hold with tighter caps.

Stage C — Cost heads by state

Quantile forecasts (q50/q90/q97.5) for each action under current health state:

cost if staying,
cost if soft/hard quarantining,
cost if staged re-entry.

Unified action score:

[ Score(a_t)=\mathbb{E}[IS_t(a_t)] + \lambda,CVaR_\alpha(IS_t(a_t)) + \gamma,P(\text{deadline miss}\mid a_t) ]

This prevents overreacting to noisy probes with mean-only logic.

7) Controller Policy by State

H1 PROBE_WARNING

require cross-path confirmation (probe + ACK/reject evidence)
reduce passive TTL and cap max child size before hard quarantine

H2 SOFT_QUARANTINE

downweight venue rather than full exclusion
enforce bounded participation and tighter retry budget

H3 HARD_QUARANTINE

deterministic cancel/flush checks
preserve emergency completion lane with strict toxicity guard

H4 REENTRY_CANDIDATE

require minimum healthy dwell window
check first-pass reliability KPIs before enabling normal flow

H5 REENTRY_WARMUP

staged weights (e.g., 10% -> 25% -> 50% -> baseline)
short passive timeout, strict markout stop rules

H6 RECOVERED_MONITORED

keep temporary flip guardrails for one monitoring horizon
remove protections gradually, not instantly

8) Diagnostics & KPIs

HFR — Health Flip Rate (transitions/hour)
QCT — Quarantine Churn Tax (bps cost vs matched stable periods)
FPR-Q — False-Positive Quarantine Rate
RMT95 — Re-entry Markout Tail p95 (first 60s/180s)
QRT — Queue Rebuild Time after re-entry
SDL — Slack Depletion Loss (cost increase per second of lost deadline slack)

If outages are flat but HFR and QCT rise, probe policy is probably too twitchy.

9) Rollout Blueprint

Shadow (2-3 weeks): log health-belief probabilities and hypothetical state decisions
Backtest + replay: compare binary health policy vs probabilistic controller
Canary: enable only soft-quarantine logic on limited symbols/notional
Promotion gates:
- lower QCT and RMT95,
- no increase in severe completion misses,
- stable FPR-Q under stressed sessions

10) Common Anti-Patterns

Binary up/down health flags with no uncertainty score
Probe-only decisions that ignore execution-path reality
Instant hard-quarantine on one timeout burst
Instant full re-entry after first green probe
No dwell-time/hysteresis controls
Evaluating policy on mean IS only (ignoring tail/deadline risk)

11) Fast Implementation Checklist

[ ] Label H0..H6 transitions in router telemetry
[ ] Build calibrated health-belief model (U/D/O probabilities)
[ ] Add short-horizon flip-hazard prediction
[ ] Train quantile slippage heads for quarantine/re-entry actions
[ ] Implement dwell-time + hysteresis guards in controller
[ ] Gate rollout on QCT/RMT95/FPR-Q improvements

References

Cartea, Á., Jaimungal, S., Penalva, J. (2015), Algorithmic and High-Frequency Trading.
Kissell, R. (2014), The Science of Algorithmic Trading and Portfolio Management.
SEC Rule 15c3-5 (Market Access Rule), automated pre-trade risk control context.
ESMA MiFID II RTS 6, algorithmic controls and resilience expectations.

TL;DR

Venue-health flapping is not just an infrastructure nuisance; it is a measurable slippage regime. Model health as probabilities, price false quarantine and re-entry tails explicitly, and control flips with hysteresis + staged reactivation to protect p95/p99 execution cost and completion reliability.