Linux blk-mq I/O Scheduler Selection Playbook (none vs mq-deadline vs BFQ vs Kyber)

Date: 2026-03-17
Category: knowledge

Why this matters

On modern Linux, storage performance failures are often tail-latency problems, not average-throughput problems.

Wrong scheduler choice can look like:

p99 write spikes during mixed read/write load,
noisy-neighbor latency blowups on shared disks,
"fast NVMe" that still feels jittery under bursty sync writes.

The right choice is workload-specific. This playbook is a practical way to choose, test, and roll out safely.

1) Quick mental model

With blk-mq, each device can use a pluggable scheduler per block queue.

Common options you’ll see in /sys/block/<dev>/queue/scheduler:

none: minimal software reordering.
mq-deadline: deadline-style fairness/latency control.
bfq: bandwidth-fairness + interactive latency focus.
kyber: latency-target driven throttling (if available in your kernel build).

The active scheduler is the one in brackets.

2) Fast decision matrix

A) NVMe / SSD, single-tenant, throughput-first

Start with: none

Why:

device firmware + hardware queues already do heavy lifting,
lowest scheduler overhead path,
often best baseline for simple high-IOPS pipelines.

B) NVMe / SSD, mixed read+sync-write with tail SLOs

Start with: mq-deadline

Why:

explicit starvation control (writes_starved),
tunable read/write expiry windows,
generally better predictability than pure none under contention.

C) Multi-tenant/shared host where fairness matters

Start with: mq-deadline (server), bfq (desktop/interactive-heavy)

Why:

mq-deadline: lighter and predictable for servers,
bfq: stronger fairness and responsiveness at cost of extra CPU overhead.

D) Desktop/workstation interactivity under heavy background I/O

Start with: bfq

Why:

designed to protect interactive responsiveness and soft real-time behavior.

E) You need latency targets as explicit knobs and Kyber is available

Try: kyber

Why:

direct read/sync-write latency targets (read_lat_nsec, write_lat_nsec).

Caveat: validate carefully in your kernel/distros; operational ecosystem is usually richer around none/mq-deadline.

3) Discovery commands (5 minutes)

# 1) list scheduler choices + current one
cat /sys/block/<dev>/queue/scheduler

# 2) rotational hint (1=HDD, 0=non-rotating)
cat /sys/block/<dev>/queue/rotational

# 3) queue depth-ish context
cat /sys/block/<dev>/queue/nr_requests

# 4) current i/o pressure and latency view (user-space)
iostat -x 1

Temporary switch (until reboot):

echo mq-deadline | sudo tee /sys/block/<dev>/queue/scheduler

4) Baseline-first policy (don’t tune blind)

For each candidate scheduler, collect the same KPI bundle:

read/write p50/p95/p99 latency,
throughput (MB/s, IOPS),
CPU overhead,
queue depth and saturation (%util, await/service trends),
application SLO metrics (not just fio numbers).

Run at least 3 load shapes:

read-heavy,
mixed read/write,
bursty sync-write + background scan (tail killer scenario).

Rule: if improvement appears only in synthetic throughput but hurts app p99, reject it.

5) Practical tuning anchors

`mq-deadline`

Key tunables (under /sys/block/<dev>/queue/iosched/):

read_expire
write_expire
fifo_batch
writes_starved
front_merges

Operational heuristics:

tighter read tail SLO → consider lower read_expire / smaller fifo_batch,
write starvation symptoms → adjust writes_starved,
random-heavy workload where front merge lookup is wasted → test front_merges=0.

`kyber`

Primary knobs:

read_lat_nsec
write_lat_nsec (sync writes)

Interpretation: it throttles to hit target latency classes; too aggressive targets can tank throughput.

`bfq`

Use when interactive fairness is worth overhead.

If throughput dominates and latency heuristics hurt throughput, test low-latency tuning off (low_latency=0) and compare.
Be cautious on very high-IOPS server paths: measure scheduler CPU cost explicitly.

`none`

No scheduler-level fairness/reordering policy.

Great baseline on fast, predictable dedicated devices.
Risky when contention/noisy neighbors dominate tails.

6) ioprio reality check

Kernel docs note that I/O priorities are scheduler-dependent, currently supported by BFQ and mq-deadline.

Implication:

If your ops model relies on ionice classes to protect critical jobs, none may remove the control surface you expected.

7) Persistent configuration pattern

Set scheduler persistently via udev rule (example):

# /etc/udev/rules.d/60-ioscheduler.rules
ACTION=="add|change", KERNEL=="nvme*n1", ATTR{queue/scheduler}="none"
ACTION=="add|change", KERNEL=="sd*", ATTR{queue/rotational}=="1", ATTR{queue/scheduler}="mq-deadline"

Then reload and trigger:

sudo udevadm control --reload-rules
sudo udevadm trigger

Always verify post-boot state in CI/boot checks.

8) Rollout plan (safe)

Shadow benchmark on representative hardware (same kernel + FS + mount options).
Canary hosts (5–10%) with rollback-ready automation.
Compare 24h diurnal traffic:
- p99 latency,
- timeout/retry rate,
- CPU usage,
- incident count.
Promote gradually if p99 improves and no reliability regressions.
Keep explicit rollback command + runbook.

Rollback must be one command, not a wiki adventure.

9) Common mistakes

Treating one scheduler as universally best
- Device type + workload + contention pattern decides.
Benchmarking only max throughput
- Real incidents are usually p99 latency + retry storms.
Ignoring scheduler-dependent controls
- ionice and fairness behavior change with scheduler choice.
Forgetting persistence
- Runtime echo test passes, reboot silently reverts.
Tuning before establishing a clean baseline
- You can’t optimize what you didn’t measure.

10) Minimal "good enough" defaults

If you need a pragmatic starting point:

HDD / rotational: mq-deadline
NVMe/SSD dedicated throughput path: none
Mixed server workload with tail SLO pain: mq-deadline first, then tune
Interactive workstation: bfq

Then measure and adapt. No static rule beats your own traces.

References

Linux Kernel Docs — Switching Scheduler: https://www.kernel.org/doc/html/latest/block/switching-sched.html
Linux Kernel Docs — Multi-Queue Block IO (blk-mq): https://www.kernel.org/doc/html/latest/block/blk-mq.html
Linux Kernel Docs — Deadline scheduler tunables: https://www.kernel.org/doc/html/latest/block/deadline-iosched.html
Linux Kernel Docs — Kyber tunables: https://www.kernel.org/doc/html/latest/block/kyber-iosched.html
Linux Kernel Docs — BFQ scheduler: https://www.kernel.org/doc/html/latest/block/bfq-iosched.html
Linux Kernel Docs — Block I/O priorities (ionice): https://www.kernel.org/doc/html/latest/block/ioprio.html
Oracle Linux / RHEL guidance example (device-type default heuristic): https://docs.oracle.com/en/database/oracle/oracle-database/26/ladbi/setting-the-disk-io-scheduler-on-linux.html