Linux TCP Pacing + TSQ + BQL Playbook (Send-Side Latency Control)

2026-03-17 · software

Linux TCP Pacing + TSQ + BQL Playbook (Send-Side Latency Control)

Date: 2026-03-17
Category: knowledge

Why this matters

Many Linux latency incidents are not RX problems. They are send-side queueing problems:

If RSS/IRQ tuning fixed RX-side pressure, this is the next high-leverage layer.


1) Mental model: where TX delay hides

Simplified path:

app -> socket send buffer -> TCP output -> qdisc -> driver TX ring -> NIC wire

Three controls matter most:

  1. TCP pacing (fq qdisc): spaces packet departures over time.
  2. TSQ guardrail (tcp_limit_output_bytes): caps bytes queued below TCP.
  3. BQL (Byte Queue Limits): dynamically limits driver TX queue backlog.

Think of them as a stack:


2) Fast decision matrix

A) p99 spikes + bursty egress + low drops

B) Throughput fine, but tiny RPCs suffer under bulk transfers

C) One/few elephant flows dominate

D) Driver does not expose BQL


3) 10-minute baseline capture

# Active qdisc and counters
sudo tc -s qdisc show dev eth0

# Key TCP output guardrail
sysctl net.ipv4.tcp_limit_output_bytes

# Socket-level view (send queues, pacing-rate visibility depends on kernel/tools)
ss -tin

# NIC stats (driver-specific names)
ethtool -S eth0

# Offload context
ethtool -k eth0

# TX queue state
ip -s link show dev eth0

If available, inspect BQL per TX queue:

ls /sys/class/net/eth0/queues/tx-0/byte_queue_limits/
# often includes: limit limit_max limit_min hold_time inflight

Capture before/after snapshots to avoid blind tuning.


4) Step-by-step tuning sequence

4.1 Use fq as root qdisc for paced TCP

sudo tc qdisc replace dev eth0 root fq
sudo tc -s qdisc show dev eth0

Why: fq cooperates with TCP pacing and smooths burst emission. Modern kernels use EDT-style scheduling with fq for better pacing behavior.

4.2 Validate TSQ guardrail (tcp_limit_output_bytes)

sysctl net.ipv4.tcp_limit_output_bytes
# temporary change example:
sudo sysctl -w net.ipv4.tcp_limit_output_bytes=262144

Interpretation:

Tune for your SLO (latency-first vs bulk-throughput-first), not folklore.

4.3 Confirm BQL behavior on TX queues

For each TX queue (if supported):

for q in /sys/class/net/eth0/queues/tx-*; do
  echo "== $q =="
  ls "$q/byte_queue_limits" 2>/dev/null || echo "(no BQL exposed)"
  for f in limit limit_max limit_min hold_time inflight; do
    [ -f "$q/byte_queue_limits/$f" ] && printf "%s: " "$f" && cat "$q/byte_queue_limits/$f"
  done
  echo
 done

You want inflight bytes to be bounded/adaptive, not persistently huge.

4.4 Keep txqueuelen sane (don’t paper over queueing)

ip link show dev eth0 | grep -o "qlen [0-9]*"
# example adjustment (validate in canary first)
sudo ip link set dev eth0 txqueuelen 1000

Huge TX queue lengths can hide congestion and inflate tail latency. Treat qlen as a control knob, not a “bigger is safer” setting.

4.5 Optional: per-socket pacing caps in app tier

Where app supports it, SO_MAX_PACING_RATE can protect shared hosts from one flow monopolizing egress pacing budget. Use per-service defaults + overrides for exceptional transfers.


5) Observability checklist (what success looks like)

Track before/after:

Success pattern:


6) Common mistakes

  1. Tuning only app buffers, ignoring kernel/device queues
    Latency debt is often below the app.

  2. Disabling pacing and expecting TSQ/BQL to do everything
    These controls are complementary.

  3. Changing many knobs simultaneously
    You lose causality and rollback clarity.

  4. Using throughput-only benchmarks
    Tail SLO regressions can hide under good average Mbps.

  5. No persistence plan
    tc/sysctl changes vanish after reboot unless codified.


7) Practical rollout template

  1. Baseline: latency + tc -s + ethtool -S + sysctl snapshot.
  2. Enable fq root qdisc.
  3. Re-measure.
  4. Adjust TSQ guardrail conservatively.
  5. Verify BQL activity/support per queue.
  6. Canary under mixed tiny+bulk traffic.
  7. Persist with systemd-networkd / NetworkManager / provisioning scripts.

Closing

Send-side latency control on Linux is a queue-budget discipline problem. If you treat fq pacing, TSQ, and BQL as one control stack, you usually get better p99 behavior than ad-hoc buffer tuning or blind qlen changes.