Linux PREEMPT_RT + Threaded IRQ Playbook (Practical Tail-Latency Control)

Date: 2026-03-18
Category: knowledge

Why this matters

If you already tuned CPU isolation, IRQ affinity, and NIC queues but still see stubborn p99/p999 spikes, kernel scheduling behavior itself can be the bottleneck.

PREEMPT_RT helps by making Linux much more preemptible and moving most interrupt work into schedulable threads. That gives you better control over who runs first when latency pressure hits.

It is not a free lunch:

overhead can increase,
throughput can drop for some workloads,
priority mistakes can create starvation.

Treat RT as a control-plane upgrade for latency, not a blanket speed boost.

1) Mental model: from “best effort latency” to “priority-governed latency”

Without RT:

some kernel sections stay non-preemptible longer,
IRQ work can preempt user tasks unpredictably,
tail latency depends heavily on burst timing.

With PREEMPT_RT:

much more kernel code becomes preemptible,
most hardware IRQ handlers run as kernel threads (irq/<n>-<name>),
you can assign priorities/affinity to IRQ work like normal schedulable entities.

Practical effect: jitter sources become more visible and tunable.

2) What changes under PREEMPT_RT (operator view)

2.1 Locking behavior changes

Many spinlock paths become sleeping RT mutex behavior.
raw_spinlock remains non-sleeping for truly hard critical sections.

Implication: better bounded latency, but different contention behavior than non-RT.

2.2 Interrupt handling becomes more schedulable

Most IRQs are threaded on RT kernels.
A subset can remain hard IRQ (IRQF_NO_THREAD / critical paths).

Implication: IRQ priority and CPU placement become first-class tuning knobs.

2.3 Scheduling policy matters more than before

SCHED_FIFO/SCHED_RR priorities now strongly shape latency outcomes.
Bad priority hierarchy can starve housekeeping threads.

Implication: you need explicit priority policy, not ad-hoc chrt tweaks.

3) When to use RT (and when not to)

Good candidates:

strict control-loop or market-gateway style p99/p999 targets,
systems where rare >1ms spikes are business-critical incidents,
hosts you can dedicate to latency-sensitive services.

Use caution / reconsider:

mixed workloads where fairness and throughput dominate,
teams without strong scheduler/IRQ observability,
environments where kernel variant sprawl is operationally expensive.

Rule of thumb: if tail spikes are expensive enough to justify extra ops complexity, RT is worth piloting.

4) Baseline before any RT rollout

Collect these on current (non-RT) canary hosts first:

Request latency p50/p95/p99/p999
CPU run-queue pressure and migration stats
IRQ rate + per-CPU distribution
Softirq pressure and ksoftirqd activity
Worst-case scheduler latency (cyclictest or equivalent)

Quick checks:

uname -a
grep -E 'PREEMPT|HZ=' /boot/config-$(uname -r)
cat /proc/softirqs
cat /proc/interrupts

If you skip baseline, you won’t know whether RT improved tails or just moved bottlenecks.

5) Rollout strategy (safe sequence)

Phase A: Non-RT hygiene first

Before switching kernel type, ensure these are already sane:

IRQ affinity pinned intentionally,
hot threads pinned to intended CPUs,
noisy background jobs isolated,
frequency/power policy stable enough for latency SLOs.

RT cannot compensate for chaotic placement.

Phase B: Single-host RT canary

Install RT kernel on one production-like host.

Verify kernel mode:

uname -r
grep CONFIG_PREEMPT_RT /boot/config-$(uname -r)

Track same workload side-by-side vs non-RT control host.

Phase C: IRQ thread policy

Inspect IRQ threads:

ps -eLo pid,cls,rtprio,pri,psr,comm | grep -E 'irq/|softirq|ksoftirqd'

Then tune progressively:

set affinity for critical IRQ threads,
raise priority carefully for latency-critical device IRQs,
keep housekeeping IRQs at lower priority.

Avoid “everything high priority.” That usually creates hidden starvation.

Phase D: Application + IRQ priority contract

Define explicit ordering, for example:

NIC IRQ threads: high RT priority,
gateway/worker threads: slightly below/above depending on architecture,
logging/metrics/exporters: normal policy.

The exact numbers matter less than consistent hierarchy.

6) Observability that actually catches RT failures

Latency outcomes

request p99/p999,
event loop delay / service time jitter,
deadline misses per minute.

Scheduler/IRQ health

schedstat / run-queue anomalies,
IRQ thread runtime and migrations,
ksoftirqd sustained CPU usage.

Starvation indicators

sudden drops in low-priority housekeeping progress,
delayed flush/log/telemetry paths,
watchdog-like symptoms during burst windows.

Useful tools:

rt-tests (cyclictest),
trace-cmd/ftrace scheduler+irq events,
perf sched,
tuna (priority/affinity management).

7) Common footguns

No baseline control host
You can’t distinguish real gain from placebo.
Over-prioritizing everything
RT priorities are relative scarcity, not badges.
Ignoring IRQ affinity after enabling RT
Threaded IRQs still need deliberate CPU placement.
Mixing latency-critical and batch jobs on same RT cores
Determinism collapses quickly.
Judging success by average latency only
RT is a tail-latency lever; p50 may barely move.

8) Minimal canary checklist (one-week loop)

Pick one service and one host class.
Capture 24h non-RT baseline.
Switch one canary host to RT kernel.
Keep affinity identical to baseline first.
Tune IRQ thread priority/affinity in small steps.
Compare p99/p999 + starvation signals daily.
Promote only if tail gains persist through peak windows.
Keep rollback path simple (kernel fallback + config revert).

References

Linux kernel documentation (real-time preemption):
https://www.kernel.org/doc/html/latest/core-api/real-time/index.html
Linux kernel locking docs (RT semantics context):
https://www.kernel.org/doc/html/latest/locking/index.html
Threaded IRQ background (threadirqs, IRQ threading concepts):
https://www.kernel.org/doc/html/latest/admin-guide/kernel-parameters.html
rt-tests (cyclictest) source/tools:
https://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git/
Linux Foundation RT wiki (operational context):
https://wiki.linuxfoundation.org/realtime/start