Linux PREEMPT_RT + Threaded IRQ Playbook (Practical Tail-Latency Control)
Date: 2026-03-18
Category: knowledge
Why this matters
If you already tuned CPU isolation, IRQ affinity, and NIC queues but still see stubborn p99/p999 spikes, kernel scheduling behavior itself can be the bottleneck.
PREEMPT_RT helps by making Linux much more preemptible and moving most interrupt work into schedulable threads. That gives you better control over who runs first when latency pressure hits.
It is not a free lunch:
- overhead can increase,
- throughput can drop for some workloads,
- priority mistakes can create starvation.
Treat RT as a control-plane upgrade for latency, not a blanket speed boost.
1) Mental model: from “best effort latency” to “priority-governed latency”
Without RT:
- some kernel sections stay non-preemptible longer,
- IRQ work can preempt user tasks unpredictably,
- tail latency depends heavily on burst timing.
With PREEMPT_RT:
- much more kernel code becomes preemptible,
- most hardware IRQ handlers run as kernel threads (
irq/<n>-<name>), - you can assign priorities/affinity to IRQ work like normal schedulable entities.
Practical effect: jitter sources become more visible and tunable.
2) What changes under PREEMPT_RT (operator view)
2.1 Locking behavior changes
- Many
spinlockpaths become sleeping RT mutex behavior. raw_spinlockremains non-sleeping for truly hard critical sections.
Implication: better bounded latency, but different contention behavior than non-RT.
2.2 Interrupt handling becomes more schedulable
- Most IRQs are threaded on RT kernels.
- A subset can remain hard IRQ (
IRQF_NO_THREAD/ critical paths).
Implication: IRQ priority and CPU placement become first-class tuning knobs.
2.3 Scheduling policy matters more than before
SCHED_FIFO/SCHED_RRpriorities now strongly shape latency outcomes.- Bad priority hierarchy can starve housekeeping threads.
Implication: you need explicit priority policy, not ad-hoc chrt tweaks.
3) When to use RT (and when not to)
Good candidates:
- strict control-loop or market-gateway style p99/p999 targets,
- systems where rare >1ms spikes are business-critical incidents,
- hosts you can dedicate to latency-sensitive services.
Use caution / reconsider:
- mixed workloads where fairness and throughput dominate,
- teams without strong scheduler/IRQ observability,
- environments where kernel variant sprawl is operationally expensive.
Rule of thumb: if tail spikes are expensive enough to justify extra ops complexity, RT is worth piloting.
4) Baseline before any RT rollout
Collect these on current (non-RT) canary hosts first:
- Request latency p50/p95/p99/p999
- CPU run-queue pressure and migration stats
- IRQ rate + per-CPU distribution
- Softirq pressure and
ksoftirqdactivity - Worst-case scheduler latency (
cyclictestor equivalent)
Quick checks:
uname -a
grep -E 'PREEMPT|HZ=' /boot/config-$(uname -r)
cat /proc/softirqs
cat /proc/interrupts
If you skip baseline, you won’t know whether RT improved tails or just moved bottlenecks.
5) Rollout strategy (safe sequence)
Phase A: Non-RT hygiene first
Before switching kernel type, ensure these are already sane:
- IRQ affinity pinned intentionally,
- hot threads pinned to intended CPUs,
- noisy background jobs isolated,
- frequency/power policy stable enough for latency SLOs.
RT cannot compensate for chaotic placement.
Phase B: Single-host RT canary
Install RT kernel on one production-like host.
Verify kernel mode:
uname -r
grep CONFIG_PREEMPT_RT /boot/config-$(uname -r)
Track same workload side-by-side vs non-RT control host.
Phase C: IRQ thread policy
Inspect IRQ threads:
ps -eLo pid,cls,rtprio,pri,psr,comm | grep -E 'irq/|softirq|ksoftirqd'
Then tune progressively:
- set affinity for critical IRQ threads,
- raise priority carefully for latency-critical device IRQs,
- keep housekeeping IRQs at lower priority.
Avoid “everything high priority.” That usually creates hidden starvation.
Phase D: Application + IRQ priority contract
Define explicit ordering, for example:
- NIC IRQ threads: high RT priority,
- gateway/worker threads: slightly below/above depending on architecture,
- logging/metrics/exporters: normal policy.
The exact numbers matter less than consistent hierarchy.
6) Observability that actually catches RT failures
Latency outcomes
- request p99/p999,
- event loop delay / service time jitter,
- deadline misses per minute.
Scheduler/IRQ health
schedstat/ run-queue anomalies,- IRQ thread runtime and migrations,
ksoftirqdsustained CPU usage.
Starvation indicators
- sudden drops in low-priority housekeeping progress,
- delayed flush/log/telemetry paths,
- watchdog-like symptoms during burst windows.
Useful tools:
rt-tests(cyclictest),trace-cmd/ftrace scheduler+irq events,perf sched,tuna(priority/affinity management).
7) Common footguns
No baseline control host
You can’t distinguish real gain from placebo.Over-prioritizing everything
RT priorities are relative scarcity, not badges.Ignoring IRQ affinity after enabling RT
Threaded IRQs still need deliberate CPU placement.Mixing latency-critical and batch jobs on same RT cores
Determinism collapses quickly.Judging success by average latency only
RT is a tail-latency lever; p50 may barely move.
8) Minimal canary checklist (one-week loop)
- Pick one service and one host class.
- Capture 24h non-RT baseline.
- Switch one canary host to RT kernel.
- Keep affinity identical to baseline first.
- Tune IRQ thread priority/affinity in small steps.
- Compare p99/p999 + starvation signals daily.
- Promote only if tail gains persist through peak windows.
- Keep rollback path simple (kernel fallback + config revert).
References
- Linux kernel documentation (real-time preemption):
https://www.kernel.org/doc/html/latest/core-api/real-time/index.html - Linux kernel locking docs (RT semantics context):
https://www.kernel.org/doc/html/latest/locking/index.html - Threaded IRQ background (
threadirqs, IRQ threading concepts):
https://www.kernel.org/doc/html/latest/admin-guide/kernel-parameters.html - rt-tests (
cyclictest) source/tools:
https://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git/ - Linux Foundation RT wiki (operational context):
https://wiki.linuxfoundation.org/realtime/start