Linux TCP Pacing + TSQ + BQL Playbook (Send-Side Latency Control)
Date: 2026-03-17
Category: knowledge
Why this matters
Many Linux latency incidents are not RX problems. They are send-side queueing problems:
- too much data sits in qdisc/device queues,
- one hot flow bloats TX rings,
- bursts leave in coarse chunks instead of smooth pacing,
- p99 latency spikes while average throughput still looks “fine”.
If RSS/IRQ tuning fixed RX-side pressure, this is the next high-leverage layer.
1) Mental model: where TX delay hides
Simplified path:
app -> socket send buffer -> TCP output -> qdisc -> driver TX ring -> NIC wire
Three controls matter most:
- TCP pacing (fq qdisc): spaces packet departures over time.
- TSQ guardrail (
tcp_limit_output_bytes): caps bytes queued below TCP. - BQL (Byte Queue Limits): dynamically limits driver TX queue backlog.
Think of them as a stack:
- fq shapes when packets leave,
- TSQ limits how far ahead TCP can dump,
- BQL prevents device queue from becoming a giant hidden buffer.
2) Fast decision matrix
A) p99 spikes + bursty egress + low drops
- Enable/check
sch_fqpacing first. - Verify TSQ limit is sane.
- Confirm BQL is active for the NIC driver.
B) Throughput fine, but tiny RPCs suffer under bulk transfers
- Reduce excessive send-side backlog (TSQ + BQL validation).
- Prefer fq over simple fifo defaults.
- Re-check offload-induced burstiness.
C) One/few elephant flows dominate
- Use fq pacing and per-socket pacing cap (
SO_MAX_PACING_RATE) where possible. - Keep TSQ guardrail enabled.
- Watch if device queue inflight bytes stay high for long intervals.
D) Driver does not expose BQL
- Be conservative with
txqueuelen. - Rely more on qdisc pacing/queue discipline.
- Consider NIC/driver upgrade path for robust low-latency operation.
3) 10-minute baseline capture
# Active qdisc and counters
sudo tc -s qdisc show dev eth0
# Key TCP output guardrail
sysctl net.ipv4.tcp_limit_output_bytes
# Socket-level view (send queues, pacing-rate visibility depends on kernel/tools)
ss -tin
# NIC stats (driver-specific names)
ethtool -S eth0
# Offload context
ethtool -k eth0
# TX queue state
ip -s link show dev eth0
If available, inspect BQL per TX queue:
ls /sys/class/net/eth0/queues/tx-0/byte_queue_limits/
# often includes: limit limit_max limit_min hold_time inflight
Capture before/after snapshots to avoid blind tuning.
4) Step-by-step tuning sequence
4.1 Use fq as root qdisc for paced TCP
sudo tc qdisc replace dev eth0 root fq
sudo tc -s qdisc show dev eth0
Why: fq cooperates with TCP pacing and smooths burst emission. Modern kernels use EDT-style scheduling with fq for better pacing behavior.
4.2 Validate TSQ guardrail (tcp_limit_output_bytes)
sysctl net.ipv4.tcp_limit_output_bytes
# temporary change example:
sudo sysctl -w net.ipv4.tcp_limit_output_bytes=262144
Interpretation:
- Too high: more hidden queueing below TCP, worse tail latency under contention.
- Too low: potential throughput penalty in some high-BDP paths.
Tune for your SLO (latency-first vs bulk-throughput-first), not folklore.
4.3 Confirm BQL behavior on TX queues
For each TX queue (if supported):
for q in /sys/class/net/eth0/queues/tx-*; do
echo "== $q =="
ls "$q/byte_queue_limits" 2>/dev/null || echo "(no BQL exposed)"
for f in limit limit_max limit_min hold_time inflight; do
[ -f "$q/byte_queue_limits/$f" ] && printf "%s: " "$f" && cat "$q/byte_queue_limits/$f"
done
echo
done
You want inflight bytes to be bounded/adaptive, not persistently huge.
4.4 Keep txqueuelen sane (don’t paper over queueing)
ip link show dev eth0 | grep -o "qlen [0-9]*"
# example adjustment (validate in canary first)
sudo ip link set dev eth0 txqueuelen 1000
Huge TX queue lengths can hide congestion and inflate tail latency. Treat qlen as a control knob, not a “bigger is safer” setting.
4.5 Optional: per-socket pacing caps in app tier
Where app supports it, SO_MAX_PACING_RATE can protect shared hosts from one flow monopolizing egress pacing budget.
Use per-service defaults + overrides for exceptional transfers.
5) Observability checklist (what success looks like)
Track before/after:
- p95/p99 request latency (and timeout/retry rates)
- qdisc backlog/overlimits from
tc -s - device TX drops/requeues from
ethtool -S - socket send queue growth patterns (
ss -tin) - CPU softirq skew (if combined with RX/TX steering work)
Success pattern:
- smoother egress behavior,
- lower tail latency during mixed load,
- no surprise rise in drops/retries.
6) Common mistakes
Tuning only app buffers, ignoring kernel/device queues
Latency debt is often below the app.Disabling pacing and expecting TSQ/BQL to do everything
These controls are complementary.Changing many knobs simultaneously
You lose causality and rollback clarity.Using throughput-only benchmarks
Tail SLO regressions can hide under good average Mbps.No persistence plan
tc/sysctl changes vanish after reboot unless codified.
7) Practical rollout template
- Baseline: latency +
tc -s+ethtool -S+ sysctl snapshot. - Enable fq root qdisc.
- Re-measure.
- Adjust TSQ guardrail conservatively.
- Verify BQL activity/support per queue.
- Canary under mixed tiny+bulk traffic.
- Persist with systemd-networkd / NetworkManager / provisioning scripts.
Closing
Send-side latency control on Linux is a queue-budget discipline problem. If you treat fq pacing, TSQ, and BQL as one control stack, you usually get better p99 behavior than ad-hoc buffer tuning or blind qlen changes.