Ephemeral Port Exhaustion & Connect-Retry Burst Slippage Playbook

2026-03-22 · finance

Ephemeral Port Exhaustion & Connect-Retry Burst Slippage Playbook

Date: 2026-03-22
Category: research
Scope: How outbound socket churn (TIME_WAIT buildup, connect failures, retry bursts) leaks execution quality

Why this matters

Most slippage stacks model market state and strategy urgency well.

But many desks still leak bps when execution-path networking silently shifts from steady pooled sessions to high-churn connect/reconnect behavior.

A common failure mode:

This is not a full outage. It is a transport-state tax that shows up first in tail slippage.


Mechanism in one timeline

  1. Order gateway loses keepalive stability (or rotates connections too aggressively).
  2. New outbound connections spike to venues/risk/market-data dependencies.
  3. Local ephemeral ports and TIME_WAIT slots accumulate.
  4. connect() starts failing (EADDRNOTAVAIL/timeouts) or stalling.
  5. Retry logic amplifies burstiness (often synchronized backoff schedules).
  6. Dispatch timing drifts from intended cadence.
  7. Fill quality degrades: more chasing, more spread crossing, worse markouts.

The key pathology is cadence distortion, not just average latency increase.


Add a port-pressure term to IS decomposition

[ IS = IS_{market} + IS_{impact} + IS_{timing} + IS_{fees} + \underbrace{IS_{port}}_{\text{infra transport tax}} ]

Operational approximation:

[ IS_{port,t} \approx a\cdot PPR_t + b\cdot TWR_t + c\cdot CFR_t + d\cdot RBI_t ]

Where:

Goal: forecast and contain tail cost inflation before it appears as visible reject storms.


Online signals to collect

1) Port Pressure Ratio (PPR)

[ PPR = \frac{\text{active outbound sockets in ephemeral range}}{\text{ephemeral range capacity}} ]

2) TIME_WAIT Ratio (TWR)

[ TWR = \frac{#TIME_WAIT\ sockets}{\text{ephemeral range capacity}} ]

3) Connect Failure Rate (CFR)

[ CFR = \frac{#(connect\ timeouts + EADDRNOTAVAIL + ECONNREFUSED + resets)}{#connect\ attempts} ]

4) Connect Tail Latency (CTL95/99)

p95/p99 of socket connect establishment on execution-critical paths.

5) Retry Burst Index (RBI)

Concentration of retries within short windows (e.g., 100-500ms buckets).

High RBI means retry policy is creating synthetic microbursts.

6) Decision-to-Send Delay (DSD95)

Tail delay from child-order decision timestamp to actual send attempt.

7) Port-Stress Markout Gap (PSMG)

Matched-cohort post-fill return gap between port_stress=1 and baseline windows.


Minimal production model

Two-stage design:

  1. Port-stress classifier

    • Inputs: PPR, TWR, CFR, CTL95, retry queue depth, reconnect rate
    • Output: (P(\text{PORT_STRESS}))
  2. Cost model conditioned on stress

    • Predict (E[IS]), (q95(IS)), completion risk

Useful interaction to keep:

[ \Delta IS \sim \beta_1,urgency + \beta_2,PORT_STRESS + \beta_3,(urgency \times PORT_STRESS) ]

Reason: urgency amplifies cadence distortion cost.


Controller states

GREEN — SESSION_STABLE

YELLOW — PORT_PRESSURE_RISING

ORANGE — RETRY_BURST_ACTIVE

RED — PORT_EXHAUSTION_TAX

Use hysteresis + minimum dwell time to avoid control flapping.


Engineering mitigations (high ROI first)

  1. Connection lifecycle discipline first

    • Prefer long-lived pooled channels over per-order short sessions.
  2. Retry policy hardening

    • jittered exponential backoff + retry token budgets.
    • never allow synchronized immediate retries across workers.
  3. Ephemeral range capacity planning

    • set ip_local_port_range with measured headroom for bursts.
  4. TIME_WAIT-aware design

    • reduce unnecessary close/reopen loops.
    • evaluate kernel/socket tuning only after protocol-level fixes.
  5. Source-IP / egress sharding where appropriate

    • distribute outbound tuple pressure when architecture supports it.
  6. Execution SLO + transport SLO unification

    • dashboard PPR/TWR/CFR next to slippage tails and completion metrics.

Validation workflow

  1. Label port_stress windows from thresholded PPR/TWR/CFR/CTL95.
  2. Build matched cohorts by symbol, spread, vol, urgency, and session bucket.
  3. Estimate incremental (\Delta E[IS]), (\Delta q95(IS)), completion miss rate.
  4. Run controller in shadow mode first.
  5. Promote only if out-of-sample weeks show tail-cost reduction without completion regression.

KPIs

Success = lower tail slippage and better cadence stability, not just fewer connect errors.


Pseudocode sketch

features = collect_transport_features()  # PPR, TWR, CFR, CTL95, RBI, DSD95
p_stress = port_stress_model.predict_proba(features)
state = decode_port_state(p_stress, features)

if state == "GREEN":
    params = normal_execution()
elif state == "YELLOW":
    params = reduce_churn_and_trim_clip()
elif state == "ORANGE":
    params = cap_retry_and_prefer_warm_paths()
else:
    params = safe_containment_mode()

submit_orders(params)
log_port_stress_telemetry(state, features, params)

Anti-footgun rules


References (starting points)


Bottom line

Ephemeral-port and reconnect churn is an under-modeled execution tax.

Treat port pressure as a first-class state variable, stabilize connection lifecycle, and gate execution aggressiveness by transport health before retry bursts quietly convert alpha into tail slippage.