Linux TCP Pacing + TSQ + BQL Playbook (Send-Side Latency Control)

Date: 2026-03-17
Category: knowledge

Why this matters

Many Linux latency incidents are not RX problems. They are send-side queueing problems:

too much data sits in qdisc/device queues,
one hot flow bloats TX rings,
bursts leave in coarse chunks instead of smooth pacing,
p99 latency spikes while average throughput still looks “fine”.

If RSS/IRQ tuning fixed RX-side pressure, this is the next high-leverage layer.

1) Mental model: where TX delay hides

Simplified path:

app -> socket send buffer -> TCP output -> qdisc -> driver TX ring -> NIC wire

Three controls matter most:

TCP pacing (fq qdisc): spaces packet departures over time.
TSQ guardrail (tcp_limit_output_bytes): caps bytes queued below TCP.
BQL (Byte Queue Limits): dynamically limits driver TX queue backlog.

Think of them as a stack:

fq shapes when packets leave,
TSQ limits how far ahead TCP can dump,
BQL prevents device queue from becoming a giant hidden buffer.

2) Fast decision matrix

A) p99 spikes + bursty egress + low drops

Enable/check sch_fq pacing first.
Verify TSQ limit is sane.
Confirm BQL is active for the NIC driver.

B) Throughput fine, but tiny RPCs suffer under bulk transfers

Reduce excessive send-side backlog (TSQ + BQL validation).
Prefer fq over simple fifo defaults.
Re-check offload-induced burstiness.

C) One/few elephant flows dominate

Use fq pacing and per-socket pacing cap (SO_MAX_PACING_RATE) where possible.
Keep TSQ guardrail enabled.
Watch if device queue inflight bytes stay high for long intervals.

D) Driver does not expose BQL

Be conservative with txqueuelen.
Rely more on qdisc pacing/queue discipline.
Consider NIC/driver upgrade path for robust low-latency operation.

3) 10-minute baseline capture

# Active qdisc and counters
sudo tc -s qdisc show dev eth0

# Key TCP output guardrail
sysctl net.ipv4.tcp_limit_output_bytes

# Socket-level view (send queues, pacing-rate visibility depends on kernel/tools)
ss -tin

# NIC stats (driver-specific names)
ethtool -S eth0

# Offload context
ethtool -k eth0

# TX queue state
ip -s link show dev eth0

If available, inspect BQL per TX queue:

ls /sys/class/net/eth0/queues/tx-0/byte_queue_limits/
# often includes: limit limit_max limit_min hold_time inflight

Capture before/after snapshots to avoid blind tuning.

4) Step-by-step tuning sequence

4.1 Use fq as root qdisc for paced TCP

sudo tc qdisc replace dev eth0 root fq
sudo tc -s qdisc show dev eth0

Why: fq cooperates with TCP pacing and smooths burst emission. Modern kernels use EDT-style scheduling with fq for better pacing behavior.

4.2 Validate TSQ guardrail (`tcp_limit_output_bytes`)

sysctl net.ipv4.tcp_limit_output_bytes
# temporary change example:
sudo sysctl -w net.ipv4.tcp_limit_output_bytes=262144

Interpretation:

Too high: more hidden queueing below TCP, worse tail latency under contention.
Too low: potential throughput penalty in some high-BDP paths.

Tune for your SLO (latency-first vs bulk-throughput-first), not folklore.

4.3 Confirm BQL behavior on TX queues

For each TX queue (if supported):

for q in /sys/class/net/eth0/queues/tx-*; do
  echo "== $q =="
  ls "$q/byte_queue_limits" 2>/dev/null || echo "(no BQL exposed)"
  for f in limit limit_max limit_min hold_time inflight; do
    [ -f "$q/byte_queue_limits/$f" ] && printf "%s: " "$f" && cat "$q/byte_queue_limits/$f"
  done
  echo
 done

You want inflight bytes to be bounded/adaptive, not persistently huge.

4.4 Keep txqueuelen sane (don’t paper over queueing)

ip link show dev eth0 | grep -o "qlen [0-9]*"
# example adjustment (validate in canary first)
sudo ip link set dev eth0 txqueuelen 1000

Huge TX queue lengths can hide congestion and inflate tail latency. Treat qlen as a control knob, not a “bigger is safer” setting.

4.5 Optional: per-socket pacing caps in app tier

Where app supports it, SO_MAX_PACING_RATE can protect shared hosts from one flow monopolizing egress pacing budget. Use per-service defaults + overrides for exceptional transfers.

5) Observability checklist (what success looks like)

Track before/after:

p95/p99 request latency (and timeout/retry rates)
qdisc backlog/overlimits from tc -s
device TX drops/requeues from ethtool -S
socket send queue growth patterns (ss -tin)
CPU softirq skew (if combined with RX/TX steering work)

Success pattern:

smoother egress behavior,
lower tail latency during mixed load,
no surprise rise in drops/retries.

6) Common mistakes

Tuning only app buffers, ignoring kernel/device queues
Latency debt is often below the app.
Disabling pacing and expecting TSQ/BQL to do everything
These controls are complementary.
Changing many knobs simultaneously
You lose causality and rollback clarity.
Using throughput-only benchmarks
Tail SLO regressions can hide under good average Mbps.
No persistence plan
tc/sysctl changes vanish after reboot unless codified.

7) Practical rollout template

Baseline: latency + tc -s + ethtool -S + sysctl snapshot.
Enable fq root qdisc.
Re-measure.
Adjust TSQ guardrail conservatively.
Verify BQL activity/support per queue.
Canary under mixed tiny+bulk traffic.
Persist with systemd-networkd / NetworkManager / provisioning scripts.

Closing

Send-side latency control on Linux is a queue-budget discipline problem. If you treat fq pacing, TSQ, and BQL as one control stack, you usually get better p99 behavior than ad-hoc buffer tuning or blind qlen changes.

Linux TCP Pacing + TSQ + BQL Playbook (Send-Side Latency Control)

Linux TCP Pacing + TSQ + BQL Playbook (Send-Side Latency Control)

Why this matters

1) Mental model: where TX delay hides

2) Fast decision matrix

A) p99 spikes + bursty egress + low drops

B) Throughput fine, but tiny RPCs suffer under bulk transfers

C) One/few elephant flows dominate

D) Driver does not expose BQL

3) 10-minute baseline capture

4) Step-by-step tuning sequence

4.1 Use fq as root qdisc for paced TCP

4.2 Validate TSQ guardrail (tcp_limit_output_bytes)

4.3 Confirm BQL behavior on TX queues

4.4 Keep txqueuelen sane (don’t paper over queueing)

4.5 Optional: per-socket pacing caps in app tier

5) Observability checklist (what success looks like)

6) Common mistakes

7) Practical rollout template

Closing

4.2 Validate TSQ guardrail (`tcp_limit_output_bytes`)