Linux Policy Routing Playbook (ip rule + fwmark for Deterministic Egress)

Date: 2026-03-22
Category: knowledge
Domain: systems / linux networking / low-latency operations

Why this matters

If you run multi-homed servers (or multiple uplinks/VLANs), destination-only routing (main table only) is often not enough.

Typical pain:

order/quote traffic exits the “wrong” NIC,
reply packets return on a different path,
latency and jitter become unstable,
sporadic packet drops appear from asymmetric-path filtering,
incident response becomes guesswork.

Policy routing lets you make egress decisions from intent signals (source subnet, interface, fwmark, uid range), not only destination prefix.

1) Core mental model

Linux routing has two layers:

RPDB (Routing Policy Database) via ip rule
- Rules are checked by priority (smaller number = higher priority).
Routing tables via ip route
- A matched rule tells Linux which table to consult.

Default behavior is roughly:

priority 0 → local
priority 32766 → main
priority 32767 → default

Your job is to insert policy rules before the generic main lookup where needed.

2) When to use policy routing

Use it when at least one is true:

You need deterministic egress per strategy/service class.
You have multiple ISP/uplink paths with different latency/cost/SLA.
You separate management traffic from trading or market-data traffic.
Source-based return symmetry matters (firewalls, ACLs, upstream policies).
A/B network paths must coexist safely on one host.

If all traffic can share one default route and one operational policy, keep it simple and avoid PBR complexity.

3) Practical design patterns

Pattern A — Source-subnet based egress

Use when each service binds to a dedicated source IP/subnet.

Rule selector: from <subnet>
Action: lookup <table_per_path>

Best for static segmentation (e.g., market-data subnet vs execution subnet).

Pattern B — fwmark-based egress classes

Use when policy depends on app intent, not just source IP.

Mark packets (nftables/iptables/BPF/cgroup)
Rule selector: fwmark <mark>/<mask>
Action: lookup <class_table>

Best for dynamic classes (critical/live, backfill, bulk sync, telemetry).

Pattern C — uidrange-based service steering

Use when each service runs under a dedicated Unix user.

Rule selector: uidrange
Action: table lookup

Best for minimizing packet-marking complexity in simple hosts.

4) Minimal production blueprint

Define route tables in /etc/iproute2/rt_tables
- Example names: rt_exec, rt_md, rt_mgmt.
Populate each table with:
- required connected routes,
- explicit default route (or explicit non-default policy if intentionally isolated).
Create RPDB rules with explicit priorities
- Keep a visible gap strategy (e.g., 1000, 1100, 1200…) for maintainability.
Add fallback semantics deliberately
- If no policy rule matches, traffic should fall to main by design—not by accident.
Persist config through your network manager
- systemd-networkd / NetworkManager / netplan / distro scripts.
- Avoid “works until reboot.”

5) Non-negotiable guardrails

Rule priority hygiene
- Overlapping selectors without explicit priority intent cause shadowing bugs.
- Always document why each priority exists.
Table completeness checks
- Missing default routes in custom tables can blackhole marked traffic.
Asymmetric path awareness
- Reverse-path filtering can drop valid packets in asymmetric designs.
- Validate rp_filter posture per interface for your threat model.
Conntrack/mark consistency
- If you rely on marks, ensure mark lifecycle is consistent across request/reply and NAT boundaries.
Atomic rollout
- Stage table routes first, then rules, then mark producers.
- Reverse order on rollback.

6) Validation runbook (must pass before cutover)

A. Structural checks

ip rule show
ip route show table main
ip route show table <custom>

Confirm:

expected rule order,
no accidental duplicate/shadow rules,
all referenced tables exist and are populated.

B. Path simulation checks

Use route queries that include policy context (source/mark/interface) to verify expected nexthop resolution before real traffic switch.

C. Live traffic checks

Interface packet counters per class
Flow logs/pcap sampling for ingress/egress symmetry
Error counters (drops, martians, invalid states)

D. Failure-injection checks

Bring down one uplink and confirm fail behavior.
Ensure critical class falls back as designed.
Ensure non-critical class either degrades or blocks as designed.

7) Common failure modes

Failure mode 1: Rule shadowing by broad early match

Symptom:

Specific class never hits intended table.

Fix:

Move broad rules lower priority (numerically larger).
Add clear specificity/priority policy.

Failure mode 2: Marked traffic blackholes

Symptom:

Only marked packets timeout.

Fix:

Verify custom table has required connected + default routes.
Verify mark mask/value parity across producer and RPDB rule.

Failure mode 3: Intermittent one-way connectivity

Symptom:

SYN leaves, no stable response path; sporadic drop behavior.

Fix:

Validate asymmetric routing assumptions.
Recheck reverse-path filter posture and upstream return routing.

Failure mode 4: Reboot regression

Symptom:

Everything worked manually but failed after reboot.

Fix:

Move all rules/routes/mark logic into persistent network config or boot orchestration.

8) Operational metrics worth tracking

per-interface egress bytes/pps by traffic class
policy-hit distribution (which rules are actually used)
class-level p50/p95/p99 latency and retransmits
drop counters by reason (rp_filter, conntrack invalid, firewall)
failover convergence time when uplink state changes

If you cannot observe policy-hit and class latency together, policy routing incidents will remain opaque.

9) Change-management checklist

Before change:

snapshot current ip rule and relevant tables
prepare rollback script
define success metrics and timeout budget

During change:

apply tables → rules → marks
verify structural + simulation checks each step

After change:

run synthetic probes per class
monitor for at least one volatility window (open, close, scheduled data events)

10) Recommended default policy

For most multi-homed low-latency hosts:

Start with source-based segmentation (simpler, auditable).
Introduce fwmark classes only for traffic needing dynamic steering.
Keep rule count small and intention-revealing.
Treat RPDB like code: versioned, reviewed, and testable.

Deterministic egress is not a “network nice-to-have.” It is often the difference between stable execution and random latency incidents.

References

ip-rule(8) manual (RPDB semantics and default rules): https://man7.org/linux/man-pages/man8/ip-rule.8.html
ip-route(8) manual (routing table operations and route get): https://man7.org/linux/man-pages/man8/ip-route.8.html