Neighbor-Cache GC & ARP/ND Resolution Stall Slippage Playbook
Date: 2026-03-22
Category: research
Scope: How L2 address-resolution pressure (ARP/ND misses + neighbor-table garbage collection) creates hidden dispatch latency and execution slippage
Why this matters
Execution teams usually watch venue latency, queue position, and risk-gateway timing.
But a recurring blind spot is host/network edge state: when neighbor tables fill, entries churn, and ARP/ND lookups become bursty, the first packets of critical order flows can stall in kernel resolution paths.
This is rarely a hard outage. It is a micro-latency tax that appears as:
- delayed first-send on otherwise healthy sockets,
- clustered order release,
- queue-priority decay during urgency,
- worse tail implementation shortfall.
In short: your model says “send now,” but L2 resolution says “wait a bit.”
Failure mechanism in one timeline
- Burst of destination churn (gateway failover, path change, reconnect wave, or broad symbol fan-out).
- ARP/ND lookups spike; unresolved neighbors increase.
- Neighbor table approaches GC thresholds; collector starts aggressive cleanup.
- Recently useful entries are evicted; misses rise further.
- First packets wait on neighbor resolution/retransmit windows.
- Child orders launch in clumps instead of intended cadence.
- Fill quality deteriorates (more spread crossing, larger adverse markouts).
Pathology: cadence distortion from resolution stalls, not average-link throughput loss.
Extend IS decomposition with neighbor-resolution tax
[ IS = IS_{market} + IS_{impact} + IS_{timing} + IS_{fees} + \underbrace{IS_{nbr}}_{\text{L2 resolution tax}} ]
Operational approximation:
[ IS_{nbr,t} \approx a\cdot NUR_t + b\cdot ARL95_t + c\cdot GCB_t + d\cdot UFR_t ]
Where:
- (NUR): Neighbor Utilization Ratio
- (ARL95): ARP/ND Resolution Latency p95
- (GCB): GC Burst intensity
- (UFR): Unresolved-Flow Ratio
Objective: detect and contain tail slippage inflation before rejects/timeouts become visible.
Online signals to collect
1) Neighbor Utilization Ratio (NUR)
[ NUR = \frac{#neighbor\ entries}{\text{configured effective capacity}} ]
Track by host, namespace, and egress interface.
2) Unresolved-Flow Ratio (UFR)
[ UFR = \frac{#flows\ waiting\ on\ neighbor\ resolution}{#active\ execution\ flows} ]
A direct measure of first-packet gating risk.
3) Resolution Tail Latency (ARL95/99)
p95/p99 delay from packet enqueue (or send intent) to resolved L2 next-hop mapping.
4) GC Burst Index (GCB)
Rate/cluster score of neighbor-GC events per short bucket (e.g., 100ms/1s).
High GCB indicates eviction/relearning churn loops.
5) ARP/ND Retransmit Ratio (ARR)
[ ARR = \frac{#ARP/ND\ retries}{#ARP/ND\ requests} ]
Helps separate simple cache misses from unstable L2 reachability.
6) First-Packet Stall p95 (FPS95)
Tail delay for first packet of order-critical flow after routing decision.
7) Decision-to-Wire Delay (DWD95)
Tail delay from child-order decision timestamp to actual NIC wire timestamp.
8) Neighbor-Stress Markout Gap (NSMG)
Matched-cohort post-fill markout difference between neighbor_stress=1 and baseline windows.
Minimal production model
Two-stage setup:
Neighbor-stress classifier
- Inputs: NUR, UFR, ARL95, GCB, ARR, recent failover/reconnect counters
- Output: (P(\text{NBR_STRESS}))
Conditional cost model
- Predict (E[IS]), (q95(IS)), completion risk under stress state
Key interaction term:
[ \Delta IS \sim \beta_1,urgency + \beta_2,NBR_STRESS + \beta_3,(urgency \times NBR_STRESS) ]
Urgent flow pays disproportionately when dispatch cadence is distorted.
Controller states (GREEN → RED)
GREEN — NEIGHBOR_STABLE
- Low NUR/UFR, normal ARL tails
- Standard execution policies
YELLOW — RESOLUTION_PRESSURE
- NUR rising, intermittent ARL95 expansion
- Actions:
- trim child clip sizes (e.g., -10%)
- enable mild launch jitter to prevent synchronized bursts
- prefer routes/paths with warm neighbor state
ORANGE — GC_CHURN_ACTIVE
- GCB spikes + UFR increase + DWD95 drift
- Actions:
- cap aggressive catch-up behavior
- reduce concurrent new-destination fan-out
- elevate routing preference for stable egress domains
- temporarily downweight tactics sensitive to first-packet delay
RED — RESOLUTION_STALL_TAX
- persistent ARL99 blowout, repeated unresolved spikes
- Actions:
- containment mode for non-urgent flow
- hard guardrails on urgent override sizes
- incident lane: table-capacity/GC tuning + L2 topology remediation
Use hysteresis and minimum dwell time to avoid oscillatory state flips.
Engineering mitigations (highest ROI first)
Reduce destination churn on critical paths
- keep stable gateway affinity where possible
- avoid unnecessary fan-out bursts
Neighbor-table capacity & GC policy tuning
- set thresholds with burst headroom
- validate under synthetic failover/reconnect stress
Pre-warm known next hops
- proactively refresh critical neighbor entries before open/auction windows
Separate hot and cold egress domains
- isolate execution-critical traffic from high-churn auxiliary traffic
Failover choreography
- stagger reroute waves to avoid synchronized cache misses
Unified telemetry
- dashboard NUR/UFR/ARL/GCB beside slippage and completion KPIs
Validation workflow
- Label
neighbor_stresswindows from thresholded NUR/UFR/ARL/GCB. - Build matched cohorts by symbol, spread, volatility, urgency, and time bucket.
- Estimate incremental (\Delta E[IS]), (\Delta q95(IS)), completion miss risk.
- Shadow controller decisions before live activation.
- Promote only if out-of-sample windows show tail-cost reduction without fill-rate regression.
KPIs to track in production
- NUR, UFR
- ARL95 / ARL99
- GCB, ARR
- FPS95, DWD95
- q95 implementation shortfall (stress vs baseline)
- completion rate under stress
- NSMG (markout gap)
Success criterion: improved tail execution quality via cadence stability, not merely fewer ARP/ND warnings.
Pseudocode sketch
features = collect_neighbor_features() # NUR, UFR, ARL95, GCB, ARR, DWD95
p_stress = nbr_stress_model.predict_proba(features)
state = decode_neighbor_state(p_stress, features)
if state == "GREEN":
params = normal_execution()
elif state == "YELLOW":
params = mild_clip_trim_and_jitter()
elif state == "ORANGE":
params = cap_catchup_reduce_fanout()
else:
params = containment_mode_with_urgent_guards()
submit_orders(params)
log_neighbor_stress_telemetry(state, features, params)
Bottom line
If your stack already models market microstructure but ignores neighbor-resolution dynamics, you likely have an unpriced infra slippage component in the tails.
Treat ARP/ND neighbor pressure as a first-class execution state variable, and convert invisible micro-latency into explicit control policy.