Byte Queue Limits (BQL) Oscillation & Wire-Cadence Slippage Playbook

2026-03-20 · finance

Byte Queue Limits (BQL) Oscillation & Wire-Cadence Slippage Playbook

Why this matters

Many execution stacks optimize strategy logic, venue routing, and feed latency, but miss a kernel-level source of hidden cost: transmit queue-limit oscillation.

On Linux, BQL dynamically controls how many bytes can sit in each NIC TX queue. When this control loop becomes unstable (too permissive, then too tight, then permissive again), wire cadence becomes sawtoothed:

Median latency can look acceptable while tail slippage quietly worsens.


Failure mechanism (host TX control loop -> execution tails)

  1. Application + qdisc produce bursty enqueue patterns (often amplified by offloads).
  2. Driver/NIC TX queue drains asynchronously; BQL limit adapts from completion feedback.
  3. Under unstable conditions, limit oscillates around the true operating point.
  4. Wire departure cadence alternates between mini-burst and underfill/starvation phases.
  5. Child-order timing dephases from intended schedule and queue-priority assumptions.

Result: tail IS inflation driven by host transmit-control instability, not purely market regime.


Slippage decomposition with BQL term

For parent order (i):

[ IS_i = C_{impact} + C_{timing} + C_{routing} + C_{bql} ]

Where:

[ C_{bql} = C_{serialize} + C_{starve} + C_{burst-recover} ]


Operational metrics (new)

1) BUI — Byte-Queue Utilization

[ BUI_t = \frac{inflight_t}{\max(limit_t, \epsilon)} ] Per-queue occupancy pressure relative to dynamic limit.

2) LOS — Limit Oscillation Score

[ LOS = p95\left(\left|\Delta \log(limit_t + 1)\right|\right) ] Captures instability in BQL control movement.

3) TSR — TX Stall Rate

[ TSR = \frac{\Delta stall_cnt}{\Delta t} ] Uses kernel BQL stall counters (where available) to quantify completion-stall episodes.

4) WCV95 — Wire Cadence Variability p95

p95 absolute deviation of inter-departure gaps from target pacing gap.

5) BOT — BQL Oscillation Tax

Incremental IS in high-LOS/high-TSR windows vs matched stable windows.


What to log in production

Kernel/NIC queue layer (per TX queue)

Qdisc/transport layer

Execution layer


Identification strategy (causal)

  1. Match windows by spread, volatility, participation, and TOD bucket.
  2. Segment into BQL_STABLE vs BQL_OSCILLATING by LOS/TSR thresholds.
  3. Estimate incremental tail IS with host and symbol fixed effects.
  4. Run intervention canaries:
    • pacing/qdisc changes (e.g., fq tuning),
    • TX queue/ring tuning,
    • offload profile changes,
    • BQL bound adjustments where policy permits.
  5. Confirm BOT reduction while market covariates stay matched.

If BOT falls after host-TX interventions, the effect is infra-causal.


Regime state machine

BQL_STABLE

BQL_SWING

BQL_STALLING

BQL_SAFE_CONTAIN

Use hysteresis + minimum dwell to avoid control flapping.


Control ladder

  1. Make TX queue state observable first
    • without per-queue BQL telemetry, “random venue noise” diagnosis is unreliable.
  2. Stabilize pacing upstream of NIC queue
    • use fair-queue pacing intentionally; avoid unbounded burst injection.
  3. Tune queue bounds conservatively
    • over-large limits can hide latency in driver/NIC queues.
  4. Handle offload interactions explicitly
    • TSO/GSO profiles can amplify byte-burst shape into cadence distortion.
  5. Use stall counters as hard safety signals
    • repeated completion stalls should trigger automatic defensive execution mode.
  6. Model LOS/TSR as first-class slippage features
    • include in mean + tail heads, not just dashboard alerts.

Failure drills (must run)

  1. Burst-injection drill
    • reproduce high enqueue burstiness and validate LOS/TSR detection.
  2. Pacing-canary drill
    • compare BOT before/after pacing policy changes.
  3. Bound-sensitivity drill
    • controlled limit_min/limit_max experiments with rollback plan.
  4. Stall-threshold drill
    • validate stall_thrs alerting and SAFE_CONTAIN transition behavior.

Common mistakes


Bottom line

BQL is a control loop, not just a queue knob.

When that loop oscillates, execution timing becomes non-deterministic and tail slippage rises. Treat per-queue BQL telemetry and stall signals as first-class inputs to live slippage control.


References