Linux blk-mq I/O Scheduler Selection Playbook (none vs mq-deadline vs BFQ vs Kyber)
Date: 2026-03-17
Category: knowledge
Why this matters
On modern Linux, storage performance failures are often tail-latency problems, not average-throughput problems.
Wrong scheduler choice can look like:
- p99 write spikes during mixed read/write load,
- noisy-neighbor latency blowups on shared disks,
- "fast NVMe" that still feels jittery under bursty sync writes.
The right choice is workload-specific. This playbook is a practical way to choose, test, and roll out safely.
1) Quick mental model
With blk-mq, each device can use a pluggable scheduler per block queue.
Common options you’ll see in /sys/block/<dev>/queue/scheduler:
none: minimal software reordering.mq-deadline: deadline-style fairness/latency control.bfq: bandwidth-fairness + interactive latency focus.kyber: latency-target driven throttling (if available in your kernel build).
The active scheduler is the one in brackets.
2) Fast decision matrix
A) NVMe / SSD, single-tenant, throughput-first
Start with: none
Why:
- device firmware + hardware queues already do heavy lifting,
- lowest scheduler overhead path,
- often best baseline for simple high-IOPS pipelines.
B) NVMe / SSD, mixed read+sync-write with tail SLOs
Start with: mq-deadline
Why:
- explicit starvation control (
writes_starved), - tunable read/write expiry windows,
- generally better predictability than pure
noneunder contention.
C) Multi-tenant/shared host where fairness matters
Start with: mq-deadline (server), bfq (desktop/interactive-heavy)
Why:
mq-deadline: lighter and predictable for servers,bfq: stronger fairness and responsiveness at cost of extra CPU overhead.
D) Desktop/workstation interactivity under heavy background I/O
Start with: bfq
Why:
- designed to protect interactive responsiveness and soft real-time behavior.
E) You need latency targets as explicit knobs and Kyber is available
Try: kyber
Why:
- direct read/sync-write latency targets (
read_lat_nsec,write_lat_nsec).
Caveat: validate carefully in your kernel/distros; operational ecosystem is usually richer around none/mq-deadline.
3) Discovery commands (5 minutes)
# 1) list scheduler choices + current one
cat /sys/block/<dev>/queue/scheduler
# 2) rotational hint (1=HDD, 0=non-rotating)
cat /sys/block/<dev>/queue/rotational
# 3) queue depth-ish context
cat /sys/block/<dev>/queue/nr_requests
# 4) current i/o pressure and latency view (user-space)
iostat -x 1
Temporary switch (until reboot):
echo mq-deadline | sudo tee /sys/block/<dev>/queue/scheduler
4) Baseline-first policy (don’t tune blind)
For each candidate scheduler, collect the same KPI bundle:
- read/write p50/p95/p99 latency,
- throughput (MB/s, IOPS),
- CPU overhead,
- queue depth and saturation (
%util, await/service trends), - application SLO metrics (not just fio numbers).
Run at least 3 load shapes:
- read-heavy,
- mixed read/write,
- bursty sync-write + background scan (tail killer scenario).
Rule: if improvement appears only in synthetic throughput but hurts app p99, reject it.
5) Practical tuning anchors
mq-deadline
Key tunables (under /sys/block/<dev>/queue/iosched/):
read_expirewrite_expirefifo_batchwrites_starvedfront_merges
Operational heuristics:
- tighter read tail SLO → consider lower
read_expire/ smallerfifo_batch, - write starvation symptoms → adjust
writes_starved, - random-heavy workload where front merge lookup is wasted → test
front_merges=0.
kyber
Primary knobs:
read_lat_nsecwrite_lat_nsec(sync writes)
Interpretation: it throttles to hit target latency classes; too aggressive targets can tank throughput.
bfq
Use when interactive fairness is worth overhead.
- If throughput dominates and latency heuristics hurt throughput, test low-latency tuning off (
low_latency=0) and compare. - Be cautious on very high-IOPS server paths: measure scheduler CPU cost explicitly.
none
No scheduler-level fairness/reordering policy.
- Great baseline on fast, predictable dedicated devices.
- Risky when contention/noisy neighbors dominate tails.
6) ioprio reality check
Kernel docs note that I/O priorities are scheduler-dependent, currently supported by BFQ and mq-deadline.
Implication:
- If your ops model relies on
ioniceclasses to protect critical jobs,nonemay remove the control surface you expected.
7) Persistent configuration pattern
Set scheduler persistently via udev rule (example):
# /etc/udev/rules.d/60-ioscheduler.rules
ACTION=="add|change", KERNEL=="nvme*n1", ATTR{queue/scheduler}="none"
ACTION=="add|change", KERNEL=="sd*", ATTR{queue/rotational}=="1", ATTR{queue/scheduler}="mq-deadline"
Then reload and trigger:
sudo udevadm control --reload-rules
sudo udevadm trigger
Always verify post-boot state in CI/boot checks.
8) Rollout plan (safe)
- Shadow benchmark on representative hardware (same kernel + FS + mount options).
- Canary hosts (5–10%) with rollback-ready automation.
- Compare 24h diurnal traffic:
- p99 latency,
- timeout/retry rate,
- CPU usage,
- incident count.
- Promote gradually if p99 improves and no reliability regressions.
- Keep explicit rollback command + runbook.
Rollback must be one command, not a wiki adventure.
9) Common mistakes
Treating one scheduler as universally best
- Device type + workload + contention pattern decides.
Benchmarking only max throughput
- Real incidents are usually p99 latency + retry storms.
Ignoring scheduler-dependent controls
ioniceand fairness behavior change with scheduler choice.
Forgetting persistence
- Runtime
echotest passes, reboot silently reverts.
- Runtime
Tuning before establishing a clean baseline
- You can’t optimize what you didn’t measure.
10) Minimal "good enough" defaults
If you need a pragmatic starting point:
- HDD / rotational:
mq-deadline - NVMe/SSD dedicated throughput path:
none - Mixed server workload with tail SLO pain:
mq-deadlinefirst, then tune - Interactive workstation:
bfq
Then measure and adapt. No static rule beats your own traces.
References
- Linux Kernel Docs — Switching Scheduler: https://www.kernel.org/doc/html/latest/block/switching-sched.html
- Linux Kernel Docs — Multi-Queue Block IO (blk-mq): https://www.kernel.org/doc/html/latest/block/blk-mq.html
- Linux Kernel Docs — Deadline scheduler tunables: https://www.kernel.org/doc/html/latest/block/deadline-iosched.html
- Linux Kernel Docs — Kyber tunables: https://www.kernel.org/doc/html/latest/block/kyber-iosched.html
- Linux Kernel Docs — BFQ scheduler: https://www.kernel.org/doc/html/latest/block/bfq-iosched.html
- Linux Kernel Docs — Block I/O priorities (
ionice): https://www.kernel.org/doc/html/latest/block/ioprio.html - Oracle Linux / RHEL guidance example (device-type default heuristic): https://docs.oracle.com/en/database/oracle/oracle-database/26/ladbi/setting-the-disk-io-scheduler-on-linux.html