Multicast Gap-Fill Storms as a Hidden Slippage Engine

A Practical Playbook for Replay-Queue Congestion, Stale-Book Risk, and Safe Execution Controls

Why this note: In fast markets, packet loss is rarely catastrophic by itself. The real damage comes from the recovery path: sequence gaps, bursty gap-fill replay, and temporary book incoherence that silently leaks basis points.

1) Core Failure Pattern

A typical failure chain:

Incremental multicast packet loss (or short microburst overrun).
Sequence gap detected (expected_seq != received_seq).
Gap request / replay path engaged (unicast recovery or snapshot stitch).
Recovery queue grows while new incremental updates continue arriving.
Strategy acts on stale/incomplete local book or over-throttles and misses liquidity.
Post-recovery catch-up causes bursty execution (queue priority reset + adverse selection).

This is a branching slippage process, not a single latency spike.

2) Cost Decomposition That Matches Reality

For a parent order over horizon (H):

[ \mathbb{E}[IS] = IS_{base}

p_{stale},\Delta_{stale}
p_{delay},\Delta_{delay}
p_{burst},\Delta_{burst} ]

Where:

(\Delta_{stale}): cost uplift from quoting/routing on stale book state.
(\Delta_{delay}): missed-opportunity cost from cautious throttling/freeze.
(\Delta_{burst}): catch-up aggression tax after replay backlog clears.

If you only track average decision latency, you miss all three branch costs.

3) Data Contract You Need (Per Venue, Per Symbol)

Capture these fields as first-class telemetry:

seq_expected, seq_received, gap_size
gap_open_ts, gap_closed_ts, gap_duration_ms
replay_queue_depth, replay_throughput_msgs
book_age_ms (local event-time staleness)
snapshot_stitch_count, resync_count
execution linkage: child order decision timestamp, book version at decision, fill/markout ladder

Without explicit sequence/replay telemetry, slippage attribution is guesswork.

4) Minimal KPI Stack (Operator-Friendly)

Integrity KPIs

GIR (Gap Incidence Rate): gaps per 10k market-data messages.
GDS p95: gap duration (ms) tail.
RQD p95: replay queue depth tail.
BAS p95 (Book Age at Signal): staleness at decision time.

Execution KPIs

SGM (Stale-Quote Markout): markout uplift when book_age_ms > threshold.
DMC (Delay Miss Cost): opportunity-cost uplift during degrade/freeze windows.
CBI (Catch-up Burst Index): concentration of child orders in post-recovery window.

Track all KPIs by regime (open/close/event windows). Aggregate-only dashboards hide the tails.

5) Modeling Stack

Layer A — Gap/Recovery Hazard Model

Model probability and duration of replay episodes:

Inputs: message rate, symbol volatility, NIC drops, CPU softirq pressure, replay server RTT.
Outputs: (\hat p_{gap}), (\hat p_{long-gap}), expected recovery time.

Simple start: logistic + quantile regression. Better: regime-conditioned survival model.

Layer B — Integrity-Conditioned Slippage Uplift

Predict incremental cost given feed health state:

[ \Delta IS = f(\text{book_age}, \text{gap_duration}, \text{replay_depth}, \text{vol regime}, \text{urgency}) ]

Use quantile heads (p50/p90/p99) to keep tail risk explicit.

Layer C — Action Policy Controller

Pick tactic state from feed-health posterior:

GREEN: normal routing.
AMBER: reduce passive exposure, shorten stale-signal TTL.
RED: restrict to robust venues/tactics; cap participation.
SAFE_RECOVER: temporary conservative mode until integrity recovers.

6) Execution Controls That Actually Work

Staleness TTL guard: invalidate microstructure features when book_age_ms exceeds venue/symbol budget.
Integrity-weighted aggressiveness: lower passive size when replay queue depth is elevated.
Burst limiter on recovery: avoid immediate over-catch-up after gap closure.
Venue confidence weighting: down-rank venues with ongoing replay backlog.
Hysteresis on state transitions: prevent GREEN↔RED flapping during noisy recovery.

7) Infrastructure Controls (Joint Desk + Infra Responsibility)

Dedicated queueing and CPU isolation for market-data path.
Replay/gap-fill traffic shaping so recovery does not starve live incremental stream.
Capacity tests with synthetic packet-loss bursts and controlled replay storms.
Session-level runbooks: when to snapshot-resync vs continue incremental recovery.

Execution quality degrades when infra treats replay as a purely network concern.

8) Validation & Rollout Plan

Backfill attribution: label historical gap/recovery episodes and compute branch costs.
Shadow scoring: run integrity-conditioned cost model without changing routing.
Canary by symbol bucket: high-liquidity names first, then stressed names.
Guardrails: rollback if completion drops or p95 IS worsens beyond budget.
Weekly recalibration: replay-path behavior drifts with infra/config changes.

9) Common Mistakes

Treating sequence-gap events as “rare infra noise” and excluding them from TCA.
Using wall-clock latency only, ignoring event-time book freshness.
Over-freezing during recovery and paying large delay opportunity cost.
Over-catching up right after recovery and paying burst adverse selection.
Not linking child decisions to exact book/version integrity state.

10) Reference Pointers

Kissell, R. — The Science of Algorithmic Trading and Portfolio Management (execution/TCA foundations).
Bouchaud, J.-P. et al. — market microstructure and impact literature (queue/impact realism).
Nasdaq TotalView-ITCH / MoldUDP64 technical specs (sequence-gap + replay mechanics).
LOBSTER papers (book dynamics and event-time modeling context).

TL;DR

Packet loss itself is not the full slippage story. The expensive part is the recovery branch: stale-book decisions, delay misses, and post-recovery burst costs. Model feed-integrity state explicitly, route with state-aware controls, and treat replay telemetry as a first-class slippage feature—not just an infra metric.