Neighbor-Cache GC & ARP/ND Resolution Stall Slippage Playbook

2026-03-22 · finance

Neighbor-Cache GC & ARP/ND Resolution Stall Slippage Playbook

Date: 2026-03-22
Category: research
Scope: How L2 address-resolution pressure (ARP/ND misses + neighbor-table garbage collection) creates hidden dispatch latency and execution slippage

Why this matters

Execution teams usually watch venue latency, queue position, and risk-gateway timing.

But a recurring blind spot is host/network edge state: when neighbor tables fill, entries churn, and ARP/ND lookups become bursty, the first packets of critical order flows can stall in kernel resolution paths.

This is rarely a hard outage. It is a micro-latency tax that appears as:

In short: your model says “send now,” but L2 resolution says “wait a bit.”


Failure mechanism in one timeline

  1. Burst of destination churn (gateway failover, path change, reconnect wave, or broad symbol fan-out).
  2. ARP/ND lookups spike; unresolved neighbors increase.
  3. Neighbor table approaches GC thresholds; collector starts aggressive cleanup.
  4. Recently useful entries are evicted; misses rise further.
  5. First packets wait on neighbor resolution/retransmit windows.
  6. Child orders launch in clumps instead of intended cadence.
  7. Fill quality deteriorates (more spread crossing, larger adverse markouts).

Pathology: cadence distortion from resolution stalls, not average-link throughput loss.


Extend IS decomposition with neighbor-resolution tax

[ IS = IS_{market} + IS_{impact} + IS_{timing} + IS_{fees} + \underbrace{IS_{nbr}}_{\text{L2 resolution tax}} ]

Operational approximation:

[ IS_{nbr,t} \approx a\cdot NUR_t + b\cdot ARL95_t + c\cdot GCB_t + d\cdot UFR_t ]

Where:

Objective: detect and contain tail slippage inflation before rejects/timeouts become visible.


Online signals to collect

1) Neighbor Utilization Ratio (NUR)

[ NUR = \frac{#neighbor\ entries}{\text{configured effective capacity}} ]

Track by host, namespace, and egress interface.

2) Unresolved-Flow Ratio (UFR)

[ UFR = \frac{#flows\ waiting\ on\ neighbor\ resolution}{#active\ execution\ flows} ]

A direct measure of first-packet gating risk.

3) Resolution Tail Latency (ARL95/99)

p95/p99 delay from packet enqueue (or send intent) to resolved L2 next-hop mapping.

4) GC Burst Index (GCB)

Rate/cluster score of neighbor-GC events per short bucket (e.g., 100ms/1s).

High GCB indicates eviction/relearning churn loops.

5) ARP/ND Retransmit Ratio (ARR)

[ ARR = \frac{#ARP/ND\ retries}{#ARP/ND\ requests} ]

Helps separate simple cache misses from unstable L2 reachability.

6) First-Packet Stall p95 (FPS95)

Tail delay for first packet of order-critical flow after routing decision.

7) Decision-to-Wire Delay (DWD95)

Tail delay from child-order decision timestamp to actual NIC wire timestamp.

8) Neighbor-Stress Markout Gap (NSMG)

Matched-cohort post-fill markout difference between neighbor_stress=1 and baseline windows.


Minimal production model

Two-stage setup:

  1. Neighbor-stress classifier

    • Inputs: NUR, UFR, ARL95, GCB, ARR, recent failover/reconnect counters
    • Output: (P(\text{NBR_STRESS}))
  2. Conditional cost model

    • Predict (E[IS]), (q95(IS)), completion risk under stress state

Key interaction term:

[ \Delta IS \sim \beta_1,urgency + \beta_2,NBR_STRESS + \beta_3,(urgency \times NBR_STRESS) ]

Urgent flow pays disproportionately when dispatch cadence is distorted.


Controller states (GREEN → RED)

GREEN — NEIGHBOR_STABLE

YELLOW — RESOLUTION_PRESSURE

ORANGE — GC_CHURN_ACTIVE

RED — RESOLUTION_STALL_TAX

Use hysteresis and minimum dwell time to avoid oscillatory state flips.


Engineering mitigations (highest ROI first)

  1. Reduce destination churn on critical paths

    • keep stable gateway affinity where possible
    • avoid unnecessary fan-out bursts
  2. Neighbor-table capacity & GC policy tuning

    • set thresholds with burst headroom
    • validate under synthetic failover/reconnect stress
  3. Pre-warm known next hops

    • proactively refresh critical neighbor entries before open/auction windows
  4. Separate hot and cold egress domains

    • isolate execution-critical traffic from high-churn auxiliary traffic
  5. Failover choreography

    • stagger reroute waves to avoid synchronized cache misses
  6. Unified telemetry

    • dashboard NUR/UFR/ARL/GCB beside slippage and completion KPIs

Validation workflow

  1. Label neighbor_stress windows from thresholded NUR/UFR/ARL/GCB.
  2. Build matched cohorts by symbol, spread, volatility, urgency, and time bucket.
  3. Estimate incremental (\Delta E[IS]), (\Delta q95(IS)), completion miss risk.
  4. Shadow controller decisions before live activation.
  5. Promote only if out-of-sample windows show tail-cost reduction without fill-rate regression.

KPIs to track in production

Success criterion: improved tail execution quality via cadence stability, not merely fewer ARP/ND warnings.


Pseudocode sketch

features = collect_neighbor_features()  # NUR, UFR, ARL95, GCB, ARR, DWD95
p_stress = nbr_stress_model.predict_proba(features)
state = decode_neighbor_state(p_stress, features)

if state == "GREEN":
    params = normal_execution()
elif state == "YELLOW":
    params = mild_clip_trim_and_jitter()
elif state == "ORANGE":
    params = cap_catchup_reduce_fanout()
else:
    params = containment_mode_with_urgent_guards()

submit_orders(params)
log_neighbor_stress_telemetry(state, features, params)

Bottom line

If your stack already models market microstructure but ignores neighbor-resolution dynamics, you likely have an unpriced infra slippage component in the tails.

Treat ARP/ND neighbor pressure as a first-class execution state variable, and convert invisible micro-latency into explicit control policy.