Ephemeral Port Exhaustion & Connect-Retry Burst Slippage Playbook

Date: 2026-03-22
Category: research
Scope: How outbound socket churn (TIME_WAIT buildup, connect failures, retry bursts) leaks execution quality

Why this matters

Most slippage stacks model market state and strategy urgency well.

But many desks still leak bps when execution-path networking silently shifts from steady pooled sessions to high-churn connect/reconnect behavior.

A common failure mode:

ephemeral port pool saturates,
connect() latencies tail out,
retries bunch,
child orders launch late and clumped,
queue priority decays exactly when urgency rises.

This is not a full outage. It is a transport-state tax that shows up first in tail slippage.

Mechanism in one timeline

Order gateway loses keepalive stability (or rotates connections too aggressively).
New outbound connections spike to venues/risk/market-data dependencies.
Local ephemeral ports and TIME_WAIT slots accumulate.
connect() starts failing (EADDRNOTAVAIL/timeouts) or stalling.
Retry logic amplifies burstiness (often synchronized backoff schedules).
Dispatch timing drifts from intended cadence.
Fill quality degrades: more chasing, more spread crossing, worse markouts.

The key pathology is cadence distortion, not just average latency increase.

Add a port-pressure term to IS decomposition

[ IS = IS_{market} + IS_{impact} + IS_{timing} + IS_{fees} + \underbrace{IS_{port}}_{\text{infra transport tax}} ]

Operational approximation:

[ IS_{port,t} \approx a\cdot PPR_t + b\cdot TWR_t + c\cdot CFR_t + d\cdot RBI_t ]

Where:

(PPR): Port Pressure Ratio
(TWR): TIME_WAIT Ratio
(CFR): Connect Failure Rate
(RBI): Retry Burst Index

Goal: forecast and contain tail cost inflation before it appears as visible reject storms.

Online signals to collect

1) Port Pressure Ratio (PPR)

[ PPR = \frac{\text{active outbound sockets in ephemeral range}}{\text{ephemeral range capacity}} ]

2) TIME_WAIT Ratio (TWR)

[ TWR = \frac{#TIME_WAIT\ sockets}{\text{ephemeral range capacity}} ]

3) Connect Failure Rate (CFR)

[ CFR = \frac{#(connect\ timeouts + EADDRNOTAVAIL + ECONNREFUSED + resets)}{#connect\ attempts} ]

4) Connect Tail Latency (CTL95/99)

p95/p99 of socket connect establishment on execution-critical paths.

5) Retry Burst Index (RBI)

Concentration of retries within short windows (e.g., 100-500ms buckets).

High RBI means retry policy is creating synthetic microbursts.

6) Decision-to-Send Delay (DSD95)

Tail delay from child-order decision timestamp to actual send attempt.

7) Port-Stress Markout Gap (PSMG)

Matched-cohort post-fill return gap between port_stress=1 and baseline windows.

Minimal production model

Two-stage design:

Port-stress classifier
- Inputs: PPR, TWR, CFR, CTL95, retry queue depth, reconnect rate
- Output: (P(\text{PORT_STRESS}))
Cost model conditioned on stress
- Predict (E[IS]), (q95(IS)), completion risk

Useful interaction to keep:

[ \Delta IS \sim \beta_1,urgency + \beta_2,PORT_STRESS + \beta_3,(urgency \times PORT_STRESS) ]

Reason: urgency amplifies cadence distortion cost.

Controller states

GREEN — SESSION_STABLE

Low PPR/TWR, low CFR, normal CTL tails
Standard tactic set

YELLOW — PORT_PRESSURE_RISING

PPR/TWR drift upward, mild CTL95 expansion
Actions:
- reduce connection churn (enforce keepalive/pool reuse)
- trim child clip size 10-15%
- add bounded launch jitter to avoid synchronized retries

ORANGE — RETRY_BURST_ACTIVE

CFR up + RBI spike + DSD95 deterioration
Actions:
- hard cap aggressive catch-up
- token-bucket retry budget per destination
- prefer already-warm gateway/venue channels
- temporarily lower concurrent new-session attempts

RED — PORT_EXHAUSTION_TAX

repeated connect failures, persistent tail blowout
Actions:
- safe containment for non-urgent flow
- freeze optional reconnection churn sources
- incident lane: port-range/connection-policy remediation
- require manual approval for urgency overrides

Use hysteresis + minimum dwell time to avoid control flapping.

Engineering mitigations (high ROI first)

Connection lifecycle discipline first
- Prefer long-lived pooled channels over per-order short sessions.
Retry policy hardening
- jittered exponential backoff + retry token budgets.
- never allow synchronized immediate retries across workers.
Ephemeral range capacity planning
- set ip_local_port_range with measured headroom for bursts.
TIME_WAIT-aware design
- reduce unnecessary close/reopen loops.
- evaluate kernel/socket tuning only after protocol-level fixes.
Source-IP / egress sharding where appropriate
- distribute outbound tuple pressure when architecture supports it.
Execution SLO + transport SLO unification
- dashboard PPR/TWR/CFR next to slippage tails and completion metrics.

Validation workflow

Label port_stress windows from thresholded PPR/TWR/CFR/CTL95.
Build matched cohorts by symbol, spread, vol, urgency, and session bucket.
Estimate incremental (\Delta E[IS]), (\Delta q95(IS)), completion miss rate.
Run controller in shadow mode first.
Promote only if out-of-sample weeks show tail-cost reduction without completion regression.

KPIs

PPR / TWR
CFR
CTL95 / CTL99
RBI
DSD95
q95 implementation shortfall (stress vs baseline)
completion rate under stress
PSMG (markout gap)

Success = lower tail slippage and better cadence stability, not just fewer connect errors.

Pseudocode sketch

features = collect_transport_features()  # PPR, TWR, CFR, CTL95, RBI, DSD95
p_stress = port_stress_model.predict_proba(features)
state = decode_port_state(p_stress, features)

if state == "GREEN":
    params = normal_execution()
elif state == "YELLOW":
    params = reduce_churn_and_trim_clip()
elif state == "ORANGE":
    params = cap_retry_and_prefer_warm_paths()
else:
    params = safe_containment_mode()

submit_orders(params)
log_port_stress_telemetry(state, features, params)

Anti-footgun rules

Don’t hide connect failures inside one average latency metric.
Don’t rely on fast retry loops without jitter and budgets.
Don’t scale worker count without revisiting outbound tuple/port capacity.
Don’t treat TIME_WAIT explosions as harmless housekeeping.
Don’t promote mitigation policies without matched-cohort impact validation.

References (starting points)

Linux kernel networking docs (ip_local_port_range, socket/TCP behavior)
RFC 9293 (TCP) for connection lifecycle context
Dean & Barroso (2013), The Tail at Scale
Linux observability stack docs (ss, netstat replacements, eBPF socket telemetry)

Bottom line

Ephemeral-port and reconnect churn is an under-modeled execution tax.

Treat port pressure as a first-class state variable, stabilize connection lifecycle, and gate execution aggressiveness by transport health before retry bursts quietly convert alpha into tail slippage.