Multicast Gap-Fill Storms as a Hidden Slippage Engine

2026-03-27 · finance

Multicast Gap-Fill Storms as a Hidden Slippage Engine

A Practical Playbook for Replay-Queue Congestion, Stale-Book Risk, and Safe Execution Controls

Why this note: In fast markets, packet loss is rarely catastrophic by itself. The real damage comes from the recovery path: sequence gaps, bursty gap-fill replay, and temporary book incoherence that silently leaks basis points.


1) Core Failure Pattern

A typical failure chain:

  1. Incremental multicast packet loss (or short microburst overrun).
  2. Sequence gap detected (expected_seq != received_seq).
  3. Gap request / replay path engaged (unicast recovery or snapshot stitch).
  4. Recovery queue grows while new incremental updates continue arriving.
  5. Strategy acts on stale/incomplete local book or over-throttles and misses liquidity.
  6. Post-recovery catch-up causes bursty execution (queue priority reset + adverse selection).

This is a branching slippage process, not a single latency spike.


2) Cost Decomposition That Matches Reality

For a parent order over horizon (H):

[ \mathbb{E}[IS] = IS_{base}

Where:

If you only track average decision latency, you miss all three branch costs.


3) Data Contract You Need (Per Venue, Per Symbol)

Capture these fields as first-class telemetry:

Without explicit sequence/replay telemetry, slippage attribution is guesswork.


4) Minimal KPI Stack (Operator-Friendly)

Integrity KPIs

Execution KPIs

Track all KPIs by regime (open/close/event windows). Aggregate-only dashboards hide the tails.


5) Modeling Stack

Layer A — Gap/Recovery Hazard Model

Model probability and duration of replay episodes:

Simple start: logistic + quantile regression. Better: regime-conditioned survival model.

Layer B — Integrity-Conditioned Slippage Uplift

Predict incremental cost given feed health state:

[ \Delta IS = f(\text{book_age}, \text{gap_duration}, \text{replay_depth}, \text{vol regime}, \text{urgency}) ]

Use quantile heads (p50/p90/p99) to keep tail risk explicit.

Layer C — Action Policy Controller

Pick tactic state from feed-health posterior:


6) Execution Controls That Actually Work

  1. Staleness TTL guard: invalidate microstructure features when book_age_ms exceeds venue/symbol budget.
  2. Integrity-weighted aggressiveness: lower passive size when replay queue depth is elevated.
  3. Burst limiter on recovery: avoid immediate over-catch-up after gap closure.
  4. Venue confidence weighting: down-rank venues with ongoing replay backlog.
  5. Hysteresis on state transitions: prevent GREEN↔RED flapping during noisy recovery.

7) Infrastructure Controls (Joint Desk + Infra Responsibility)

Execution quality degrades when infra treats replay as a purely network concern.


8) Validation & Rollout Plan

  1. Backfill attribution: label historical gap/recovery episodes and compute branch costs.
  2. Shadow scoring: run integrity-conditioned cost model without changing routing.
  3. Canary by symbol bucket: high-liquidity names first, then stressed names.
  4. Guardrails: rollback if completion drops or p95 IS worsens beyond budget.
  5. Weekly recalibration: replay-path behavior drifts with infra/config changes.

9) Common Mistakes


10) Reference Pointers


TL;DR

Packet loss itself is not the full slippage story. The expensive part is the recovery branch: stale-book decisions, delay misses, and post-recovery burst costs. Model feed-integrity state explicitly, route with state-aware controls, and treat replay telemetry as a first-class slippage feature—not just an infra metric.