Venue-Outage Failover Spillover Slippage Playbook
Date: 2026-03-12
Category: research
Audience: small quant execution teams running multi-venue routers in production
Why this playbook exists
Most slippage models assume venue availability is stable and execution cost is mostly driven by spread, impact, and urgency.
In live routing, one degraded or unavailable venue can suddenly re-route flow to the rest of the market, creating a failover spillover shock:
- queue age resets from cancel/re-submit paths,
- residual venues saturate,
- adverse selection rises because everyone crowds the same fallback pool.
If your model treats this as ordinary volatility, you underprice tail cost exactly when it matters.
Core idea
Model outage episodes as a branch process, not a single-point latency event.
For child order e:
E[Cost_e] = p_up * C_normal + p_deg * C_degraded + p_out * C_failover
Where:
p_up: venue fully available probabilityp_deg: venue degraded (slow ACK/reject bursts/partial executability)p_out: venue unavailable or hard-reroutedC_failover: includes reroute delay + queue reset + fallback toxicity
This makes outage risk an explicit slippage term before the router is forced to panic.
Cost decomposition (failover-aware)
Cost = IS + DelayTax + QueueResetTax + SaturationTax + MissPenalty
1) Delay Tax
Extra time from detect -> withdraw -> reroute -> ack
2) Queue Reset Tax
Lost queue priority when working liquidity must be re-entered elsewhere
3) Saturation Tax
Fallback venues become crowded; same urgency now crosses wider effective spread
4) Miss Penalty
Completion risk when fallback depth cannot absorb residual before deadline
Key episode metrics
Outage Spillover Index (OSI)
OSI = (FallbackFlow_t - BaselineFallbackFlow_t) / BaselineFallbackFlow_t
Measures how much routing pressure suddenly moved to fallback venues.
Residual Venue Saturation (RVS)
RVS = AggressiveVolume_t / ReliableDepth_t
Reliable depth should be cancel-hazard adjusted, not raw top-of-book size.
Failover Congestion Tax (FCT)
FCT = Slippage_failover - Slippage_normal_matched
Matched by volatility/spread/imbalance/participation to isolate outage-driven cost.
Failover Detection Lag (FDL)
FDL = t_router_failover - t_first_degradation_signal
If FDL is large, tail slippage is usually nonlinear.
State machine
NORMAL
- Standard venue scoring and participation
- Track OSI/RVS baseline
DEGRADED
Triggered by reject burst, ACK tail expansion, or stale-book reliability collapse
- Pre-emptively reduce exposure to suspect venue
- Increase passive TTL discipline to avoid stale queue traps
FAILOVER_SURGE
Triggered when venue outage probability crosses threshold or hard outage confirmed
- Enforce fallback venue caps
- Reduce child size and shorten horizon slices
- Prefer completion reliability over maker-fee optimization
STABILIZING
Outage is resolving, but crowding remains elevated
- Slowly restore venue weights with hysteresis
- Keep saturation guardrails active
SAFE
Emergency protection mode
- Cap urgency and max child notional
- Freeze non-essential tactics (complex conditional logic, aggressive retries)
- Alert operator with episode artifact package
Control policy by state
DEGRADED controls
- Penalize affected venue in SOR score using reliability multiplier
- Increase stale-quote and stale-ack penalties
- Route more conservatively but avoid full evacuation too early
FAILOVER_SURGE controls
- Child size down, spacing up (anti-crowding)
- Temporary per-venue fallback caps to avoid self-inflicted clustering
- Raise toxicity penalties in objective function
STABILIZING controls
- Require sustained recovery before reweighting (time + metric hysteresis)
- Keep queue-reset penalties elevated for a cooldown window
Routing objective (outage-aware)
For venue v:
Score_v = Edge_v - Fee_v - Impact_v - λ1*ReliabilityRisk_v - λ2*SaturationRisk_v - λ3*QueueResetRisk_v
Where:
ReliabilityRisk_vincreases with reject/ACK anomaliesSaturationRisk_vincreases with OSI/RVS stressQueueResetRisk_vcaptures expected loss from reroute churn
Policy chooses venue mix minimizing expected mean + tail loss, not just expected fill speed.
Data contract (must-have)
Per child-order and venue-attempt:
intent_ts,send_ts,ack_ts,fill_ts,cancel_ack_ts- venue status snapshots (
up/degraded/out) with reason code - reject reason, retry count, reroute count
- pre/post-reroute queue-age proxy
- reliable depth estimate (cancel-adjusted)
- spread, imbalance, microprice drift, realized markout horizon
- completion deadline and residual trajectory
Without per-attempt lineage, failover tax is invisible in TCA.
Calibration loop
Intraday (5–10 min)
- Refresh outage probability and state labels
- Recompute OSI/RVS and apply control transitions
Daily
- Refit matched
FCTestimates - Validate tail metrics (
q95,CVaR95) by state - Audit false FAILOVER_SURGE triggers
Weekly
- Rebaseline reliability priors per venue/session
- Review SAFE incidents and rollback decisions
- Promote/demote fallback policy variants
Rollout plan
Shadow mode (1-2 weeks)
Compute outage states and score penalties, no live action.Guardrail mode
Enable DEGRADED controls only (no hard failover throttles yet).Full mode
Enable FAILOVER_SURGE/STABILIZING/SAFE with rollback switch.Governance mode
Weekly review of outage episodes, FCT trend, and state quality.
Rollback triggers:
- completion rate drops below target for N intervals,
- q95 slippage exceeds outage-adjusted budget,
- SAFE dwell time exceeds expected envelope.
Common failure modes
Binary venue health logic
Treating health as up/down misses degraded regime where most tax accumulates.No saturation term in router objective
Failover sends everyone to same venue and amplifies markout.Fast reroute without queue-reset accounting
“Quick reaction” can still be expensive if every reroute restarts queue age.No hysteresis on recovery
Premature re-entry into recovering venue causes flip-flop churn.Postmortem without per-attempt lineage
You cannot separate market move vs failover-control mistakes.
Dashboard minimum
- State timeline (NORMAL/DEGRADED/FAILOVER_SURGE/STABILIZING/SAFE)
- OSI, RVS, FDL by venue and session
- Failover branch cost decomposition (delay/reset/saturation/miss)
- q50/q95/CVaR slippage by state
- completion reliability + residual backlog trajectory
- Top incidents with reason codes and action decisions
Practical takeaway
Venue outages are not just connectivity incidents. They are microstructure regime shifts that change executable depth, queue economics, and tail risk.
Model failover as a branch process, price saturation and queue-reset tax explicitly, and let a state controller protect completion and p95 cost before crowding chaos takes over.
One-line implementation mantra
Treat venue health as a probabilistic state, not a binary flag, and pay failover tax in the model before paying it in live capital.