Venue-Outage Failover Spillover Slippage Playbook

Date: 2026-03-12
Category: research
Audience: small quant execution teams running multi-venue routers in production

Why this playbook exists

Most slippage models assume venue availability is stable and execution cost is mostly driven by spread, impact, and urgency.

In live routing, one degraded or unavailable venue can suddenly re-route flow to the rest of the market, creating a failover spillover shock:

queue age resets from cancel/re-submit paths,
residual venues saturate,
adverse selection rises because everyone crowds the same fallback pool.

If your model treats this as ordinary volatility, you underprice tail cost exactly when it matters.

Core idea

Model outage episodes as a branch process, not a single-point latency event.

For child order e:

E[Cost_e] = p_up * C_normal + p_deg * C_degraded + p_out * C_failover

Where:

p_up: venue fully available probability
p_deg: venue degraded (slow ACK/reject bursts/partial executability)
p_out: venue unavailable or hard-rerouted
C_failover: includes reroute delay + queue reset + fallback toxicity

This makes outage risk an explicit slippage term before the router is forced to panic.

Cost decomposition (failover-aware)

Cost = IS + DelayTax + QueueResetTax + SaturationTax + MissPenalty

1) Delay Tax

Extra time from detect -> withdraw -> reroute -> ack

2) Queue Reset Tax

Lost queue priority when working liquidity must be re-entered elsewhere

3) Saturation Tax

Fallback venues become crowded; same urgency now crosses wider effective spread

4) Miss Penalty

Completion risk when fallback depth cannot absorb residual before deadline

Key episode metrics

Outage Spillover Index (OSI)

OSI = (FallbackFlow_t - BaselineFallbackFlow_t) / BaselineFallbackFlow_t

Measures how much routing pressure suddenly moved to fallback venues.

Residual Venue Saturation (RVS)

RVS = AggressiveVolume_t / ReliableDepth_t

Reliable depth should be cancel-hazard adjusted, not raw top-of-book size.

Failover Congestion Tax (FCT)

FCT = Slippage_failover - Slippage_normal_matched

Matched by volatility/spread/imbalance/participation to isolate outage-driven cost.

Failover Detection Lag (FDL)

FDL = t_router_failover - t_first_degradation_signal

If FDL is large, tail slippage is usually nonlinear.

State machine

NORMAL

Standard venue scoring and participation
Track OSI/RVS baseline

DEGRADED

Triggered by reject burst, ACK tail expansion, or stale-book reliability collapse

Pre-emptively reduce exposure to suspect venue
Increase passive TTL discipline to avoid stale queue traps

FAILOVER_SURGE

Triggered when venue outage probability crosses threshold or hard outage confirmed

Enforce fallback venue caps
Reduce child size and shorten horizon slices
Prefer completion reliability over maker-fee optimization

STABILIZING

Outage is resolving, but crowding remains elevated

Slowly restore venue weights with hysteresis
Keep saturation guardrails active

SAFE

Emergency protection mode

Cap urgency and max child notional
Freeze non-essential tactics (complex conditional logic, aggressive retries)
Alert operator with episode artifact package

Control policy by state

DEGRADED controls

Penalize affected venue in SOR score using reliability multiplier
Increase stale-quote and stale-ack penalties
Route more conservatively but avoid full evacuation too early

FAILOVER_SURGE controls

Child size down, spacing up (anti-crowding)
Temporary per-venue fallback caps to avoid self-inflicted clustering
Raise toxicity penalties in objective function

STABILIZING controls

Require sustained recovery before reweighting (time + metric hysteresis)
Keep queue-reset penalties elevated for a cooldown window

Routing objective (outage-aware)

For venue v:

Score_v = Edge_v - Fee_v - Impact_v - λ1*ReliabilityRisk_v - λ2*SaturationRisk_v - λ3*QueueResetRisk_v

Where:

ReliabilityRisk_v increases with reject/ACK anomalies
SaturationRisk_v increases with OSI/RVS stress
QueueResetRisk_v captures expected loss from reroute churn

Policy chooses venue mix minimizing expected mean + tail loss, not just expected fill speed.

Data contract (must-have)

Per child-order and venue-attempt:

intent_ts, send_ts, ack_ts, fill_ts, cancel_ack_ts
venue status snapshots (up/degraded/out) with reason code
reject reason, retry count, reroute count
pre/post-reroute queue-age proxy
reliable depth estimate (cancel-adjusted)
spread, imbalance, microprice drift, realized markout horizon
completion deadline and residual trajectory

Without per-attempt lineage, failover tax is invisible in TCA.

Calibration loop

Intraday (5–10 min)

Refresh outage probability and state labels
Recompute OSI/RVS and apply control transitions

Daily

Refit matched FCT estimates
Validate tail metrics (q95, CVaR95) by state
Audit false FAILOVER_SURGE triggers

Weekly

Rebaseline reliability priors per venue/session
Review SAFE incidents and rollback decisions
Promote/demote fallback policy variants

Rollout plan

Shadow mode (1-2 weeks)
Compute outage states and score penalties, no live action.
Guardrail mode
Enable DEGRADED controls only (no hard failover throttles yet).
Full mode
Enable FAILOVER_SURGE/STABILIZING/SAFE with rollback switch.
Governance mode
Weekly review of outage episodes, FCT trend, and state quality.

Rollback triggers:

completion rate drops below target for N intervals,
q95 slippage exceeds outage-adjusted budget,
SAFE dwell time exceeds expected envelope.

Common failure modes

Binary venue health logic
Treating health as up/down misses degraded regime where most tax accumulates.
No saturation term in router objective
Failover sends everyone to same venue and amplifies markout.
Fast reroute without queue-reset accounting
“Quick reaction” can still be expensive if every reroute restarts queue age.
No hysteresis on recovery
Premature re-entry into recovering venue causes flip-flop churn.
Postmortem without per-attempt lineage
You cannot separate market move vs failover-control mistakes.

Dashboard minimum

State timeline (NORMAL/DEGRADED/FAILOVER_SURGE/STABILIZING/SAFE)
OSI, RVS, FDL by venue and session
Failover branch cost decomposition (delay/reset/saturation/miss)
q50/q95/CVaR slippage by state
completion reliability + residual backlog trajectory
Top incidents with reason codes and action decisions

Practical takeaway

Venue outages are not just connectivity incidents. They are microstructure regime shifts that change executable depth, queue economics, and tail risk.

Model failover as a branch process, price saturation and queue-reset tax explicitly, and let a state controller protect completion and p95 cost before crowding chaos takes over.

One-line implementation mantra

Treat venue health as a probabilistic state, not a binary flag, and pay failover tax in the model before paying it in live capital.