Direct-Feed Failover Flap Slippage Playbook
When Primary/Backup Feed Arbitration Becomes a Hidden Execution Tax
Why this note: Many execution stacks run dual market-data paths (primary + backup direct feeds) and switch on packet-loss/latency alarms. The failure is not one clean failover — it is flapping (rapid switch-back/switch-forth), which injects state discontinuities, quote-age spikes, and decision jitter that standard slippage models miss.
1) Failure Mode in One Sentence
If feed arbitration can flap faster than your execution controller can stabilize, you will overtrade stale state, reset queue edge, and pay convex catch-up slippage.
2) Cost Decomposition with Feed-Switch Risk
For child action (a_t):
[ \mathbb{E}[IS_t(a_t)] = C_{spread} + C_{impact} + C_{queue} + C_{stale} + C_{switch} + C_{opportunity} ]
Where new hidden terms are:
- (C_{stale}): executable-truth drift from quote-age spikes during switch windows
- (C_{switch}): policy error introduced by arbitration transitions (book discontinuity, reset, delayed confidence)
A practical branch form:
[ \mathbb{E}[IS_t] = (1-p_{sw}),\mu_{stable}(x_t,a_t) + p_{sw},\mu_{switch}(x_t,a_t) ]
with (p_{sw}=\Pr(\text{feed switch in next }\Delta\text{ ms})).
3) Model the Hazard, Not Just the Outcome
Let (z_t) include feed health state:
- sequence-gap pressure
- packet-loss burst score
- feed-latency spread (primary vs backup)
- recent switch count
- book divergence signal
Hazard model example:
[ \text{logit}(p_{sw}) = \beta_0 + \beta_1,gapRate_t + \beta_2,lossBurst_t + \beta_3,latencySkew_t + \beta_4,switchCount_{1s} ]
Then use a mixture slippage estimator for routing decisions.
4) Regime States (Encode Explicitly)
- F0 STABLE_PRIMARY: low hazard, consistent primary feed
- F1 DEGRADED_PRIMARY: rising loss/gap, no switch yet
- F2 FAILOVER_TRANSITION: active switch in progress
- F3 FLAP_STORM: repeated switch-back events in short window
- F4 SAFE_CONTAIN: confidence floor breached; conservative execution only
Use hysteresis + minimum dwell times to avoid control thrash.
5) Features Worth Adding to the Slippage Stack
Feed integrity
primary_gap_rate_100msbackup_gap_rate_100msprimary_backup_latency_skew_usfeed_switch_count_1stime_since_last_switch_ms
Book continuity
top_of_book_discontinuity_scoremicroprice_jump_at_switchdepth_rebuild_time_msquote_age_p95_ms(post-switch window)
Execution sensitivity
queue_age_msremaining_deadline_slack_mschild_retry_burst_1svenue_ack_jitter_p95
Missing switch_count and time_since_last_switch at decision time should fail promotion for this model family.
6) New Diagnostics / KPIs
- FSR (Feed Switch Rate): switches per second per symbol bucket
- FFR (Feed Flap Rate): switch-back within (<T) ms after failover
- SAS (Switch-Age Spike): quote-age increase around switch events
- SMD (Switch Markout Delta): markout(high-switch-hazard) - markout(low-switch-hazard)
- QLRS (Queue-Loss-at-Resync Share): % IS explained by queue deterioration during F2/F3
If average IS is flat but SMD and QLRS are rising, the desk is silently paying control-plane slippage.
7) Control Policy (Hazard-Aware)
STABLE_PRIMARY ((p_{sw}<\tau_1))
- normal passive/neutral routing
DEGRADED_PRIMARY ((\tau_1\le p_{sw}<\tau_2))
- reduce passive timeout
- downweight fragile queues
- tighten cancel/replace budget
FAILOVER_TRANSITION ((\tau_2\le p_{sw}<\tau_3))
- avoid tactics requiring ultra-fresh L1 certainty
- prefer smaller child slices with bounded urgency
- require stronger confirmation for aggressive takes
FLAP_STORM ((p_{sw}\ge\tau_3) or FFR breach)
- freeze high-churn repricing logic
- cap participation + widen safety margins
- switch to completion-preserving SAFE profile when confidence floor is breached
8) Labeling & Training Contract
For each child decision, store:
- Regime label (F0..F4)
- Switch outcome in next 100/250/1000ms
- Realized IS + multi-horizon markouts
- Fill path (fill/cancel/replace/reject)
- Feed-health snapshot (gaps/loss/skew/age)
Train three heads:
- mean cost (P50)
- tail cost (P90/P97.5)
- switch hazard (short horizon)
9) Rollout Blueprint
- Shadow switch-hazard + mixture predictions for 10–14 days
- Compare by symbol liquidity bucket + time-of-day
- Canary only for F2/F3 logic first
- Promote when all hold:
- better tail coverage in switch windows
- lower QLRS and SMD
- no completion collapse under flap episodes
Rollback triggers:
- repeated false-positive SAFE transitions
- underfill spike above threshold
- unresolved feed-clock integrity alerts
10) Common Anti-Patterns
- Treating failover as one-off event instead of recurrent flap process
- Optimizing only p50 while F3 tails deteriorate
- Switching feeds without switching execution policy profile
- Ignoring quote-age spikes right after feed resync
- No hysteresis, causing policy and feed-control oscillation together
11) Fast Implementation Checklist
[ ] Add feed-switch telemetry to decision-time feature store
[ ] Train short-horizon switch hazard model
[ ] Upgrade cost model to stable/switch mixture form
[ ] Add F0..F4 state machine with hysteresis + dwell timers
[ ] Gate deployment on switch-window tail coverage + QLRS/SMD
[ ] Canary with strict notional caps in F2/F3 only
References
- Cartea, Á., Jaimungal, S., Penalva, J. (2015), Algorithmic and High-Frequency Trading.
- Gould, M. D., et al. (2013), Limit Order Books (Quantitative Finance review).
- Bouchaud, J.-P., Farmer, J. D., Lillo, F. (2009), How Markets Slowly Digest Changes in Supply and Demand.
- Bacry, E., Mastromatteo, I., Muzy, J.-F. (2015), Hawkes Processes in Finance.
TL;DR
Dual-feed redundancy is not automatically safer for slippage. Without flap-aware hazard modeling and state-dependent controls, feed arbitration itself becomes a repeatable hidden cost center.