SO_REUSEPORT + eBPF Socket Steering Playbook (Hot-Restart Safe)

2026-03-16 · software

SO_REUSEPORT + eBPF Socket Steering Playbook (Hot-Restart Safe)

Date: 2026-03-16
Category: knowledge

Why this matters

If you run multi-worker network services (gateways, market-data ingest, collectors, APIs), SO_REUSEPORT is a standard way to spread traffic across workers.

But default behavior is not always enough:

This playbook turns SO_REUSEPORT from a socket option into an operated control surface.


1) Baseline model: what plain SO_REUSEPORT actually does

With SO_REUSEPORT, multiple sockets can bind/listen on the same IP:port (Linux 3.9+).

Default selection is kernel hash-based (4-tuple driven: src IP/port + dst IP/port), which is fast and usually good enough.

Important version landmarks:

Operational implication: if you need programmable steering + safer restarts, kernel version is a hard prerequisite, not a tuning detail.


2) When default hashing is enough vs when eBPF is justified

Use plain SO_REUSEPORT when:

Consider SO_ATTACH_REUSEPORT_EBPF when:

  1. Skewed source distributions
    • few heavy senders dominate one/few workers.
  2. Topology-aware steering needs
    • you want custom policy (e.g., weighted, random, migration-aware).
  3. Hot-restart safety
    • you need better control over listener transitions.
  4. Measurable imbalance pain
    • drops, queue buildup, p99 divergence are recurring.

Rule: don’t add eBPF “because fancy.” Add it when you can state the failure mode in one sentence.


3) Selection semantics you must not forget

For reuseport BPF programs, socket selection is group-index based.

Key details:

Practical consequence: avoid hard-coding assumptions like “worker 3 is always index 3 forever.”

If your rollout process depends on stable identity, use explicit map-driven steering (BPF_MAP_TYPE_REUSEPORT_SOCKARRAY) and treat indexing as mutable runtime state.


4) Restart safety: the subtle failure mode

Historically, reuseport groups could lose in-flight TCP handshake-related state when a listener closed during restart/reload windows.

Modern migration support (5.14+) improves this via migration-aware paths (BPF_SK_REUSEPORT_SELECT_OR_MIGRATE), and non-eBPF path may rely on net.ipv4.tcp_migrate_req behavior.

Operational guidance:

“Graceful restart” claims are meaningless unless measured during contention.


5) Metrics that actually reveal reuseport health

Track at least:

Distribution / balance

Loss / pressure

User-visible outcomes

Restart correctness

Averages hide this problem. Keep per-worker visibility.


6) Control states (recommended)

Use a small state machine instead of ad-hoc tweaks:

Example trigger ideas:

Design principle: always keep a one-step path to known-good kernel default behavior.


7) Rollout pattern that avoids self-inflicted outages

  1. Observe-only phase

    • Keep plain reuseport.
    • Build per-worker baseline for balance, drops, p99.
  2. Shadow policy validation

    • Compute what custom steering would choose (without enforcing).
    • Compare expected vs actual skew reduction potential.
  3. Canary apply

    • Small subset of hosts/ports.
    • Hard rollback if p99, drops, or setup failures regress.
  4. Progressive expansion

    • Scale by host group, not all at once.
    • Freeze rollout during known burst windows.
  5. Restart stress validation

    • Intentionally deploy under load.
    • Require restart-window SLO pass before full adoption.

8) Common mistakes

  1. No kernel capability gate

    • Team assumes feature exists everywhere; mixed kernels break assumptions.
  2. Treating per-worker skew as noise

    • System looks healthy until one worker saturates and tails explode.
  3. No fallback contract

    • Custom policy fails with no immediate downgrade path.
  4. Testing only steady-state traffic

    • Restart and burst paths are where hidden defects appear.
  5. Assuming socket index stability

    • Group membership changes reorder effective indexing.

9) 30-minute incident runbook (reuseport imbalance spike)

  1. Confirm symptom:
    • imbalance ratio, per-worker drops, p99 skew.
  2. Check whether issue coincides with deploy/restart window.
  3. If custom steering enabled:
    • switch to fallback/plain hash policy.
  4. Verify immediate impact:
    • drop rate down? queue down? p99 improving?
  5. If yes, hold fallback and capture forensic bundle:
    • kernel version, policy version, map/program update logs.
  6. If no, investigate non-steering bottlenecks:
    • NIC queue pinning, CPU saturation, app-level lock contention.
  7. Post-incident:
    • tune thresholds/hysteresis,
    • add replay or load-test case for reproduced failure.

10) Minimal decision matrix

The winning setup is not “most programmable.” It is “easiest to keep correct at 3 a.m.”


References