BGP Convergence & Anycast Path-Flap RTT-Bimodality Slippage Playbook

2026-03-25 · finance

BGP Convergence & Anycast Path-Flap RTT-Bimodality Slippage Playbook

Date: 2026-03-25
Category: research
Audience: execution operators running multi-region / anycast-connected trading infrastructure


Why this note

Many execution stacks assume network latency is a single noisy distribution around a stable mean.

During inter-domain routing churn (BGP withdrawal/re-announcement, maintenance misstep, upstream instability, anycast path shift), latency often becomes bimodal rather than merely “higher.”

That regime change can create hidden slippage through:

The key mistake: treating route instability as generic latency noise instead of a separate execution state.


1) Cost decomposition with route-instability term

For child order decision at time (t):

[ \mathbb{E}[C_t] = \mathbb{E}[C_{base} \mid a_t, s_t] + \lambda_r \cdot \mathbb{E}[C_{route} \mid z_t] ]

Where:

Practical branch model:

[ C_{route} \approx p_{stale}L_{stale} + p_{retry}L_{retry} + p_{deadline}L_{deadline} ]


2) Observability contract (point-in-time)

A) Network path telemetry

B) Execution timeline telemetry

C) Session-level context

Without path-event markers, the model confuses structural regime shifts with random latency noise.


3) Modeling stack

A) Route state classifier

Define discrete states:

  1. PATH_STABLE — unimodal latency, low update churn
  2. PATH_SHIFTING — emerging second mode, rising jitter/retransmits
  3. PATH_UNSTABLE — persistent bimodality or rapid path flips
  4. PATH_RECOVERING — variance decays but tail still elevated

Implementation options:

B) Bimodality detector

Use simple online indicators:

Trigger if bimodality persists over minimum dwell time (avoid false positives from microbursts).

C) Stale-arrival model

Estimate:

[ P(\text{arrival age} > \tau_{alpha} \mid z_t, venue, urgency) ]

This probability should directly influence passive-vs-aggressive decisioning.

D) Deadline convexity model

Model expected incremental cost of waiting near schedule end:

[ \Delta C_{deadline}(u) = C_{forced}(u) - C_{smooth}(u) ]

where (u) is residual urgency budget. Under PATH_UNSTABLE, (\Delta C_{deadline}) usually steepens.


4) Live policy controller (state-aware)

GREEN — PATH_STABLE

AMBER — PATH_SHIFTING

RED — PATH_UNSTABLE

RECOVERY — PATH_RECOVERING

No direct RED→GREEN jumps; require recovery dwell and KPI normalization.


5) Production KPIs

Useful alerts:


6) Validation ladder

  1. Historical relabeling
    Join execution logs with path/route event markers; compare state-conditional slippage.

  2. Counterfactual replay
    Re-run decisions with AMBER/RED controls and estimate avoided tail cost.

  3. Fault injection drills
    Inject synthetic latency bimodality/retransmit bursts in staging.

  4. Shadow then canary
    Shadow state classifier first, then enable controls on small flow slice.

Primary success criterion: reduced q95/q99 slippage with bounded completion-risk degradation.


7) 14-day implementation plan

Days 1-2
Instrument path-event + execution-age fields end-to-end.

Days 3-4
Implement RBS/PSR/SAF calculators and dashboards.

Days 5-7
Build initial state classifier with hysteresis.

Days 8-9
Train stale-arrival and deadline-convexity components.

Days 10-11
Integrate AMBER/RED policy guards in shadow mode.

Days 12-13
Canary enablement with strict rollback gates.

Day 14
Finalize incident runbook and postmortem template.


Common mistakes

  1. Using mean RTT as primary health signal
    Mean hides the second mode that actually drives tail slippage.

  2. Immediate aggression ramp on partial recovery
    Recovery without dwell causes repeated oscillation and cost spikes.

  3. No separation between market toxicity and path toxicity
    Different causes require different controls.

  4. Retry storms without pacing
    Naive retries convert transient path noise into self-inflicted burst impact.


Bottom line

BGP/anycast path churn is not just “infra noise.”

In execution systems it behaves like a distinct slippage regime with:

Model it explicitly, gate policy by route state, and enforce conservative recovery transitions. That is how you stop network path chaos from quietly becoming basis-point leakage.


References