SO_REUSEPORT + eBPF Socket Steering Playbook (Hot-Restart Safe)
Date: 2026-03-16
Category: knowledge
Why this matters
If you run multi-worker network services (gateways, market-data ingest, collectors, APIs), SO_REUSEPORT is a standard way to spread traffic across workers.
But default behavior is not always enough:
- Hash-based steering can create skew for certain traffic shapes (e.g., a few heavy UDP exporters).
- Rolling restarts can drop in-flight handshake state if listener migration is not handled.
- “Looks balanced on average” can still hide per-worker drops and tail latency blowups.
This playbook turns SO_REUSEPORT from a socket option into an operated control surface.
1) Baseline model: what plain SO_REUSEPORT actually does
With SO_REUSEPORT, multiple sockets can bind/listen on the same IP:port (Linux 3.9+).
Default selection is kernel hash-based (4-tuple driven: src IP/port + dst IP/port), which is fast and usually good enough.
Important version landmarks:
SO_REUSEPORT: Linux 3.9SO_ATTACH_REUSEPORT_EBPFfor UDP: Linux 4.5SO_ATTACH_REUSEPORT_EBPFfor TCP: Linux 4.6BPF_PROG_TYPE_SK_REUSEPORT: Linux 4.19- Reuseport migration improvements (
SELECT_OR_MIGRATEcontext): Linux 5.14
Operational implication: if you need programmable steering + safer restarts, kernel version is a hard prerequisite, not a tuning detail.
2) When default hashing is enough vs when eBPF is justified
Use plain SO_REUSEPORT when:
- traffic sources are naturally diverse,
- per-worker load variance is low,
- restart behavior is already acceptable,
- and tail SLOs are stable.
Consider SO_ATTACH_REUSEPORT_EBPF when:
- Skewed source distributions
- few heavy senders dominate one/few workers.
- Topology-aware steering needs
- you want custom policy (e.g., weighted, random, migration-aware).
- Hot-restart safety
- you need better control over listener transitions.
- Measurable imbalance pain
- drops, queue buildup, p99 divergence are recurring.
Rule: don’t add eBPF “because fancy.” Add it when you can state the failure mode in one sentence.
3) Selection semantics you must not forget
For reuseport BPF programs, socket selection is group-index based.
Key details:
- Program must ultimately select valid socket index/mapping target.
- Invalid selection falls back to plain reuseport mechanism.
- Socket ordering can change when sockets are closed (group compaction behavior).
- New sockets in group inherit current reuseport program.
Practical consequence: avoid hard-coding assumptions like “worker 3 is always index 3 forever.”
If your rollout process depends on stable identity, use explicit map-driven steering (BPF_MAP_TYPE_REUSEPORT_SOCKARRAY) and treat indexing as mutable runtime state.
4) Restart safety: the subtle failure mode
Historically, reuseport groups could lose in-flight TCP handshake-related state when a listener closed during restart/reload windows.
Modern migration support (5.14+) improves this via migration-aware paths (BPF_SK_REUSEPORT_SELECT_OR_MIGRATE), and non-eBPF path may rely on net.ipv4.tcp_migrate_req behavior.
Operational guidance:
- If hot-restart correctness is critical, test restart under SYN flood / handshake churn, not only idle canaries.
- Validate no spike in:
- reset responses,
- accept failures,
- connection setup latency,
- handshake drop counters.
“Graceful restart” claims are meaningless unless measured during contention.
5) Metrics that actually reveal reuseport health
Track at least:
Distribution / balance
- per-worker packets accepted (UDP) / accepts (TCP)
- per-worker bytes
- imbalance ratio:
max(worker_rate) / median(worker_rate)
Loss / pressure
- per-worker socket drops / rcvbuf errors
- per-worker queue depth (accept queue / app queue)
- backlog overflow indicators
User-visible outcomes
- p95/p99 request latency by worker or shard
- connection setup latency distribution
- timeout/retry rate
Restart correctness
- rst spikes during deploy windows
- handshake success ratio pre/during/post deploy
- migration-triggered error counters (if exposed)
Averages hide this problem. Keep per-worker visibility.
6) Control states (recommended)
Use a small state machine instead of ad-hoc tweaks:
- BALANCED: default steering policy
- SKEW_WARN: imbalance exceeds threshold, monitor tighter
- SKEW_ACTIVE: switch to corrective steering profile
- RESTART_PROTECT: restart window with migration-aware policy and stricter alerting
- SAFE_FALLBACK: disable custom policy, revert to plain hash if errors spike
Example trigger ideas:
imbalance_ratio > 1.8 for 60s->SKEW_WARNimbalance_ratio > 2.2 + drop_rate rising->SKEW_ACTIVE- deploy start ->
RESTART_PROTECT - eBPF attach/map update failures or verifier/runtime anomalies ->
SAFE_FALLBACK
Design principle: always keep a one-step path to known-good kernel default behavior.
7) Rollout pattern that avoids self-inflicted outages
Observe-only phase
- Keep plain reuseport.
- Build per-worker baseline for balance, drops, p99.
Shadow policy validation
- Compute what custom steering would choose (without enforcing).
- Compare expected vs actual skew reduction potential.
Canary apply
- Small subset of hosts/ports.
- Hard rollback if p99, drops, or setup failures regress.
Progressive expansion
- Scale by host group, not all at once.
- Freeze rollout during known burst windows.
Restart stress validation
- Intentionally deploy under load.
- Require restart-window SLO pass before full adoption.
8) Common mistakes
No kernel capability gate
- Team assumes feature exists everywhere; mixed kernels break assumptions.
Treating per-worker skew as noise
- System looks healthy until one worker saturates and tails explode.
No fallback contract
- Custom policy fails with no immediate downgrade path.
Testing only steady-state traffic
- Restart and burst paths are where hidden defects appear.
Assuming socket index stability
- Group membership changes reorder effective indexing.
9) 30-minute incident runbook (reuseport imbalance spike)
- Confirm symptom:
- imbalance ratio, per-worker drops, p99 skew.
- Check whether issue coincides with deploy/restart window.
- If custom steering enabled:
- switch to fallback/plain hash policy.
- Verify immediate impact:
- drop rate down? queue down? p99 improving?
- If yes, hold fallback and capture forensic bundle:
- kernel version, policy version, map/program update logs.
- If no, investigate non-steering bottlenecks:
- NIC queue pinning, CPU saturation, app-level lock contention.
- Post-incident:
- tune thresholds/hysteresis,
- add replay or load-test case for reproduced failure.
10) Minimal decision matrix
- Need only basic multithread listen scaling? ->
SO_REUSEPORTdefault. - Seeing repeatable skew with concentrated flows? -> add eBPF steering candidate.
- Need robust hot restarts under load? -> migration-aware design and restart stress gates.
- Not ready to operate per-worker metrics + rollback? -> stay with default for now.
The winning setup is not “most programmable.” It is “easiest to keep correct at 3 a.m.”
References
- Linux
socket(7)(SO_REUSEPORT / SO_ATTACH_REUSEPORT_EBPF semantics):
https://man7.org/linux/man-pages/man7/socket.7.html - eBPF program type
BPF_PROG_TYPE_SK_REUSEPORT(context + migration notes):
https://docs.ebpf.io/linux/program-type/BPF_PROG_TYPE_SK_REUSEPORT/ - eBPF helper
bpf_sk_select_reuseport:
https://docs.ebpf.io/linux/helper-function/bpf_sk_select_reuseport/ - Practical UDP skew example with reuseport eBPF in Go (case study):
https://vincent.bernat.ch/en/blog/2026-reuseport-ebpf-go