Ephemeral Port Exhaustion & Connect-Retry Burst Slippage Playbook
Date: 2026-03-22
Category: research
Scope: How outbound socket churn (TIME_WAIT buildup, connect failures, retry bursts) leaks execution quality
Why this matters
Most slippage stacks model market state and strategy urgency well.
But many desks still leak bps when execution-path networking silently shifts from steady pooled sessions to high-churn connect/reconnect behavior.
A common failure mode:
- ephemeral port pool saturates,
connect()latencies tail out,- retries bunch,
- child orders launch late and clumped,
- queue priority decays exactly when urgency rises.
This is not a full outage. It is a transport-state tax that shows up first in tail slippage.
Mechanism in one timeline
- Order gateway loses keepalive stability (or rotates connections too aggressively).
- New outbound connections spike to venues/risk/market-data dependencies.
- Local ephemeral ports and TIME_WAIT slots accumulate.
connect()starts failing (EADDRNOTAVAIL/timeouts) or stalling.- Retry logic amplifies burstiness (often synchronized backoff schedules).
- Dispatch timing drifts from intended cadence.
- Fill quality degrades: more chasing, more spread crossing, worse markouts.
The key pathology is cadence distortion, not just average latency increase.
Add a port-pressure term to IS decomposition
[ IS = IS_{market} + IS_{impact} + IS_{timing} + IS_{fees} + \underbrace{IS_{port}}_{\text{infra transport tax}} ]
Operational approximation:
[ IS_{port,t} \approx a\cdot PPR_t + b\cdot TWR_t + c\cdot CFR_t + d\cdot RBI_t ]
Where:
- (PPR): Port Pressure Ratio
- (TWR): TIME_WAIT Ratio
- (CFR): Connect Failure Rate
- (RBI): Retry Burst Index
Goal: forecast and contain tail cost inflation before it appears as visible reject storms.
Online signals to collect
1) Port Pressure Ratio (PPR)
[ PPR = \frac{\text{active outbound sockets in ephemeral range}}{\text{ephemeral range capacity}} ]
2) TIME_WAIT Ratio (TWR)
[ TWR = \frac{#TIME_WAIT\ sockets}{\text{ephemeral range capacity}} ]
3) Connect Failure Rate (CFR)
[ CFR = \frac{#(connect\ timeouts + EADDRNOTAVAIL + ECONNREFUSED + resets)}{#connect\ attempts} ]
4) Connect Tail Latency (CTL95/99)
p95/p99 of socket connect establishment on execution-critical paths.
5) Retry Burst Index (RBI)
Concentration of retries within short windows (e.g., 100-500ms buckets).
High RBI means retry policy is creating synthetic microbursts.
6) Decision-to-Send Delay (DSD95)
Tail delay from child-order decision timestamp to actual send attempt.
7) Port-Stress Markout Gap (PSMG)
Matched-cohort post-fill return gap between port_stress=1 and baseline windows.
Minimal production model
Two-stage design:
Port-stress classifier
- Inputs: PPR, TWR, CFR, CTL95, retry queue depth, reconnect rate
- Output: (P(\text{PORT_STRESS}))
Cost model conditioned on stress
- Predict (E[IS]), (q95(IS)), completion risk
Useful interaction to keep:
[ \Delta IS \sim \beta_1,urgency + \beta_2,PORT_STRESS + \beta_3,(urgency \times PORT_STRESS) ]
Reason: urgency amplifies cadence distortion cost.
Controller states
GREEN — SESSION_STABLE
- Low PPR/TWR, low CFR, normal CTL tails
- Standard tactic set
YELLOW — PORT_PRESSURE_RISING
- PPR/TWR drift upward, mild CTL95 expansion
- Actions:
- reduce connection churn (enforce keepalive/pool reuse)
- trim child clip size 10-15%
- add bounded launch jitter to avoid synchronized retries
ORANGE — RETRY_BURST_ACTIVE
- CFR up + RBI spike + DSD95 deterioration
- Actions:
- hard cap aggressive catch-up
- token-bucket retry budget per destination
- prefer already-warm gateway/venue channels
- temporarily lower concurrent new-session attempts
RED — PORT_EXHAUSTION_TAX
- repeated connect failures, persistent tail blowout
- Actions:
- safe containment for non-urgent flow
- freeze optional reconnection churn sources
- incident lane: port-range/connection-policy remediation
- require manual approval for urgency overrides
Use hysteresis + minimum dwell time to avoid control flapping.
Engineering mitigations (high ROI first)
Connection lifecycle discipline first
- Prefer long-lived pooled channels over per-order short sessions.
Retry policy hardening
- jittered exponential backoff + retry token budgets.
- never allow synchronized immediate retries across workers.
Ephemeral range capacity planning
- set
ip_local_port_rangewith measured headroom for bursts.
- set
TIME_WAIT-aware design
- reduce unnecessary close/reopen loops.
- evaluate kernel/socket tuning only after protocol-level fixes.
Source-IP / egress sharding where appropriate
- distribute outbound tuple pressure when architecture supports it.
Execution SLO + transport SLO unification
- dashboard PPR/TWR/CFR next to slippage tails and completion metrics.
Validation workflow
- Label
port_stresswindows from thresholded PPR/TWR/CFR/CTL95. - Build matched cohorts by symbol, spread, vol, urgency, and session bucket.
- Estimate incremental (\Delta E[IS]), (\Delta q95(IS)), completion miss rate.
- Run controller in shadow mode first.
- Promote only if out-of-sample weeks show tail-cost reduction without completion regression.
KPIs
- PPR / TWR
- CFR
- CTL95 / CTL99
- RBI
- DSD95
- q95 implementation shortfall (stress vs baseline)
- completion rate under stress
- PSMG (markout gap)
Success = lower tail slippage and better cadence stability, not just fewer connect errors.
Pseudocode sketch
features = collect_transport_features() # PPR, TWR, CFR, CTL95, RBI, DSD95
p_stress = port_stress_model.predict_proba(features)
state = decode_port_state(p_stress, features)
if state == "GREEN":
params = normal_execution()
elif state == "YELLOW":
params = reduce_churn_and_trim_clip()
elif state == "ORANGE":
params = cap_retry_and_prefer_warm_paths()
else:
params = safe_containment_mode()
submit_orders(params)
log_port_stress_telemetry(state, features, params)
Anti-footgun rules
- Don’t hide connect failures inside one average latency metric.
- Don’t rely on fast retry loops without jitter and budgets.
- Don’t scale worker count without revisiting outbound tuple/port capacity.
- Don’t treat TIME_WAIT explosions as harmless housekeeping.
- Don’t promote mitigation policies without matched-cohort impact validation.
References (starting points)
- Linux kernel networking docs (
ip_local_port_range, socket/TCP behavior) - RFC 9293 (TCP) for connection lifecycle context
- Dean & Barroso (2013), The Tail at Scale
- Linux observability stack docs (
ss,netstatreplacements, eBPF socket telemetry)
Bottom line
Ephemeral-port and reconnect churn is an under-modeled execution tax.
Treat port pressure as a first-class state variable, stabilize connection lifecycle, and gate execution aggressiveness by transport health before retry bursts quietly convert alpha into tail slippage.