Venue Health-Probe Flapping & Quarantine-Churn Slippage Playbook

2026-03-31 ยท finance

Venue Health-Probe Flapping & Quarantine-Churn Slippage Playbook

Model Router Health as a Noisy State Process, Not a Binary Up/Down Switch

Why this note: Multi-venue routers often use heartbeat/probe logic to mark venues healthy or unhealthy. When probe thresholds are too twitchy, routers repeatedly quarantine and re-enable venues. That churn quietly destroys queue position and inflates tail slippage even when no venue is truly down.


1) Failure Mode in One Sentence

If venue health is modeled as a binary flag without false-positive/false-negative costs, probe flapping can become a hidden slippage engine via repeated reroute, queue-reset, and re-entry toxicity.


2) Branch-Aware Cost Decomposition

For a child-order decision at time (t), define router health belief over a venue:

Expected execution cost for action (a_t):

[ \mathbb{E}[IS_t(a_t)] = P(U_t),C_U(a_t) + P(D_t),C_D(a_t) + P(O_t),C_O(a_t) ]

When the health classifier is noisy, realized cost includes misclassification branches:

[ \mathbb{E}[IS_t] = C_{base} + C_{FP_quarantine} + C_{FN_stale_route} + C_{flip_churn} ]

Where:

A practical expansion:

[ C_{flip_churn} = C_{cancel_flush} + C_{reroute_delay} + C_{queue_reset} + C_{reentry_markout} + C_{deadline_catchup} ]


3) Why Probe Flapping Is Expensive Even Without Outages

Repeated health flips create a specific execution pathology:

  1. Passive queue attrition: every false quarantine discards earned queue rank.
  2. Fallback crowding: rerouted flow saturates secondary venues.
  3. Re-entry toxicity: venue looks healthy again, but first re-entry window is often adverse.
  4. Policy oscillation: aggression and venue weights ping-pong faster than market microstructure can stabilize.
  5. Deadline convexity: each flip consumes slack, forcing costly late urgency.

This is why desks can report "no major outages" yet still see rising p95/p99 slippage.


4) Health-State Machine (Router-Side)

Use dwell-time + hysteresis gates for H2โ†”H3 and H4โ†”H5โ†”H6 to avoid flip storms.


5) Features That Matter Most

A) Probe-path integrity

B) Execution-path confirmation

C) Flap/churn risk

D) Urgency coupling

Without explicit churn features, models often misattribute losses to "market volatility."


6) Modeling Stack

Stage A โ€” Health-belief model

Estimate (P(U_t), P(D_t), P(O_t)) from probe + execution features:

Stage B โ€” Flip-hazard model

Estimate short-horizon probability of another health transition:

[ P(\text{flip in }\Delta t \mid x_t) ]

Useful for deciding whether to quarantine now or hold with tighter caps.

Stage C โ€” Cost heads by state

Quantile forecasts (q50/q90/q97.5) for each action under current health state:

Unified action score:

[ Score(a_t)=\mathbb{E}[IS_t(a_t)] + \lambda,CVaR_\alpha(IS_t(a_t)) + \gamma,P(\text{deadline miss}\mid a_t) ]

This prevents overreacting to noisy probes with mean-only logic.


7) Controller Policy by State

H1 PROBE_WARNING

H2 SOFT_QUARANTINE

H3 HARD_QUARANTINE

H4 REENTRY_CANDIDATE

H5 REENTRY_WARMUP

H6 RECOVERED_MONITORED


8) Diagnostics & KPIs

  1. HFR โ€” Health Flip Rate (transitions/hour)
  2. QCT โ€” Quarantine Churn Tax (bps cost vs matched stable periods)
  3. FPR-Q โ€” False-Positive Quarantine Rate
  4. RMT95 โ€” Re-entry Markout Tail p95 (first 60s/180s)
  5. QRT โ€” Queue Rebuild Time after re-entry
  6. SDL โ€” Slack Depletion Loss (cost increase per second of lost deadline slack)

If outages are flat but HFR and QCT rise, probe policy is probably too twitchy.


9) Rollout Blueprint

  1. Shadow (2-3 weeks): log health-belief probabilities and hypothetical state decisions
  2. Backtest + replay: compare binary health policy vs probabilistic controller
  3. Canary: enable only soft-quarantine logic on limited symbols/notional
  4. Promotion gates:
    • lower QCT and RMT95,
    • no increase in severe completion misses,
    • stable FPR-Q under stressed sessions

10) Common Anti-Patterns


11) Fast Implementation Checklist

[ ] Label H0..H6 transitions in router telemetry
[ ] Build calibrated health-belief model (U/D/O probabilities)
[ ] Add short-horizon flip-hazard prediction
[ ] Train quantile slippage heads for quarantine/re-entry actions
[ ] Implement dwell-time + hysteresis guards in controller
[ ] Gate rollout on QCT/RMT95/FPR-Q improvements

References


TL;DR

Venue-health flapping is not just an infrastructure nuisance; it is a measurable slippage regime. Model health as probabilities, price false quarantine and re-entry tails explicitly, and control flips with hysteresis + staged reactivation to protect p95/p99 execution cost and completion reliability.