Venue-Outage Failover Spillover Slippage Playbook

2026-03-12 · finance

Venue-Outage Failover Spillover Slippage Playbook

Date: 2026-03-12
Category: research
Audience: small quant execution teams running multi-venue routers in production


Why this playbook exists

Most slippage models assume venue availability is stable and execution cost is mostly driven by spread, impact, and urgency.

In live routing, one degraded or unavailable venue can suddenly re-route flow to the rest of the market, creating a failover spillover shock:

If your model treats this as ordinary volatility, you underprice tail cost exactly when it matters.


Core idea

Model outage episodes as a branch process, not a single-point latency event.

For child order e:

E[Cost_e] = p_up * C_normal + p_deg * C_degraded + p_out * C_failover

Where:

This makes outage risk an explicit slippage term before the router is forced to panic.


Cost decomposition (failover-aware)

Cost = IS + DelayTax + QueueResetTax + SaturationTax + MissPenalty

1) Delay Tax

Extra time from detect -> withdraw -> reroute -> ack

2) Queue Reset Tax

Lost queue priority when working liquidity must be re-entered elsewhere

3) Saturation Tax

Fallback venues become crowded; same urgency now crosses wider effective spread

4) Miss Penalty

Completion risk when fallback depth cannot absorb residual before deadline


Key episode metrics

Outage Spillover Index (OSI)

OSI = (FallbackFlow_t - BaselineFallbackFlow_t) / BaselineFallbackFlow_t

Measures how much routing pressure suddenly moved to fallback venues.

Residual Venue Saturation (RVS)

RVS = AggressiveVolume_t / ReliableDepth_t

Reliable depth should be cancel-hazard adjusted, not raw top-of-book size.

Failover Congestion Tax (FCT)

FCT = Slippage_failover - Slippage_normal_matched

Matched by volatility/spread/imbalance/participation to isolate outage-driven cost.

Failover Detection Lag (FDL)

FDL = t_router_failover - t_first_degradation_signal

If FDL is large, tail slippage is usually nonlinear.


State machine

NORMAL

DEGRADED

Triggered by reject burst, ACK tail expansion, or stale-book reliability collapse

FAILOVER_SURGE

Triggered when venue outage probability crosses threshold or hard outage confirmed

STABILIZING

Outage is resolving, but crowding remains elevated

SAFE

Emergency protection mode


Control policy by state

DEGRADED controls

FAILOVER_SURGE controls

STABILIZING controls


Routing objective (outage-aware)

For venue v:

Score_v = Edge_v - Fee_v - Impact_v - λ1*ReliabilityRisk_v - λ2*SaturationRisk_v - λ3*QueueResetRisk_v

Where:

Policy chooses venue mix minimizing expected mean + tail loss, not just expected fill speed.


Data contract (must-have)

Per child-order and venue-attempt:

Without per-attempt lineage, failover tax is invisible in TCA.


Calibration loop

Intraday (5–10 min)

Daily

Weekly


Rollout plan

  1. Shadow mode (1-2 weeks)
    Compute outage states and score penalties, no live action.

  2. Guardrail mode
    Enable DEGRADED controls only (no hard failover throttles yet).

  3. Full mode
    Enable FAILOVER_SURGE/STABILIZING/SAFE with rollback switch.

  4. Governance mode
    Weekly review of outage episodes, FCT trend, and state quality.

Rollback triggers:


Common failure modes

  1. Binary venue health logic
    Treating health as up/down misses degraded regime where most tax accumulates.

  2. No saturation term in router objective
    Failover sends everyone to same venue and amplifies markout.

  3. Fast reroute without queue-reset accounting
    “Quick reaction” can still be expensive if every reroute restarts queue age.

  4. No hysteresis on recovery
    Premature re-entry into recovering venue causes flip-flop churn.

  5. Postmortem without per-attempt lineage
    You cannot separate market move vs failover-control mistakes.


Dashboard minimum


Practical takeaway

Venue outages are not just connectivity incidents. They are microstructure regime shifts that change executable depth, queue economics, and tail risk.

Model failover as a branch process, price saturation and queue-reset tax explicitly, and let a state controller protect completion and p95 cost before crowding chaos takes over.


One-line implementation mantra

Treat venue health as a probabilistic state, not a binary flag, and pay failover tax in the model before paying it in live capital.