Linux Policy Routing Playbook (ip rule + fwmark for Deterministic Egress)
Date: 2026-03-22
Category: knowledge
Domain: systems / linux networking / low-latency operations
Why this matters
If you run multi-homed servers (or multiple uplinks/VLANs), destination-only routing (main table only) is often not enough.
Typical pain:
- order/quote traffic exits the “wrong” NIC,
- reply packets return on a different path,
- latency and jitter become unstable,
- sporadic packet drops appear from asymmetric-path filtering,
- incident response becomes guesswork.
Policy routing lets you make egress decisions from intent signals (source subnet, interface, fwmark, uid range), not only destination prefix.
1) Core mental model
Linux routing has two layers:
- RPDB (Routing Policy Database) via
ip rule- Rules are checked by priority (smaller number = higher priority).
- Routing tables via
ip route- A matched rule tells Linux which table to consult.
Default behavior is roughly:
- priority 0 →
local - priority 32766 →
main - priority 32767 →
default
Your job is to insert policy rules before the generic main lookup where needed.
2) When to use policy routing
Use it when at least one is true:
- You need deterministic egress per strategy/service class.
- You have multiple ISP/uplink paths with different latency/cost/SLA.
- You separate management traffic from trading or market-data traffic.
- Source-based return symmetry matters (firewalls, ACLs, upstream policies).
- A/B network paths must coexist safely on one host.
If all traffic can share one default route and one operational policy, keep it simple and avoid PBR complexity.
3) Practical design patterns
Pattern A — Source-subnet based egress
Use when each service binds to a dedicated source IP/subnet.
- Rule selector:
from <subnet> - Action:
lookup <table_per_path>
Best for static segmentation (e.g., market-data subnet vs execution subnet).
Pattern B — fwmark-based egress classes
Use when policy depends on app intent, not just source IP.
- Mark packets (nftables/iptables/BPF/cgroup)
- Rule selector:
fwmark <mark>/<mask> - Action:
lookup <class_table>
Best for dynamic classes (critical/live, backfill, bulk sync, telemetry).
Pattern C — uidrange-based service steering
Use when each service runs under a dedicated Unix user.
- Rule selector:
uidrange - Action: table lookup
Best for minimizing packet-marking complexity in simple hosts.
4) Minimal production blueprint
Define route tables in
/etc/iproute2/rt_tables- Example names:
rt_exec,rt_md,rt_mgmt.
- Example names:
Populate each table with:
- required connected routes,
- explicit default route (or explicit non-default policy if intentionally isolated).
Create RPDB rules with explicit priorities
- Keep a visible gap strategy (e.g., 1000, 1100, 1200…) for maintainability.
Add fallback semantics deliberately
- If no policy rule matches, traffic should fall to
mainby design—not by accident.
- If no policy rule matches, traffic should fall to
Persist config through your network manager
- systemd-networkd / NetworkManager / netplan / distro scripts.
- Avoid “works until reboot.”
5) Non-negotiable guardrails
Rule priority hygiene
- Overlapping selectors without explicit priority intent cause shadowing bugs.
- Always document why each priority exists.
Table completeness checks
- Missing default routes in custom tables can blackhole marked traffic.
Asymmetric path awareness
- Reverse-path filtering can drop valid packets in asymmetric designs.
- Validate
rp_filterposture per interface for your threat model.
Conntrack/mark consistency
- If you rely on marks, ensure mark lifecycle is consistent across request/reply and NAT boundaries.
Atomic rollout
- Stage table routes first, then rules, then mark producers.
- Reverse order on rollback.
6) Validation runbook (must pass before cutover)
A. Structural checks
ip rule showip route show table mainip route show table <custom>
Confirm:
- expected rule order,
- no accidental duplicate/shadow rules,
- all referenced tables exist and are populated.
B. Path simulation checks
Use route queries that include policy context (source/mark/interface) to verify expected nexthop resolution before real traffic switch.
C. Live traffic checks
- Interface packet counters per class
- Flow logs/pcap sampling for ingress/egress symmetry
- Error counters (drops, martians, invalid states)
D. Failure-injection checks
- Bring down one uplink and confirm fail behavior.
- Ensure critical class falls back as designed.
- Ensure non-critical class either degrades or blocks as designed.
7) Common failure modes
Failure mode 1: Rule shadowing by broad early match
Symptom:
- Specific class never hits intended table.
Fix:
- Move broad rules lower priority (numerically larger).
- Add clear specificity/priority policy.
Failure mode 2: Marked traffic blackholes
Symptom:
- Only marked packets timeout.
Fix:
- Verify custom table has required connected + default routes.
- Verify mark mask/value parity across producer and RPDB rule.
Failure mode 3: Intermittent one-way connectivity
Symptom:
- SYN leaves, no stable response path; sporadic drop behavior.
Fix:
- Validate asymmetric routing assumptions.
- Recheck reverse-path filter posture and upstream return routing.
Failure mode 4: Reboot regression
Symptom:
- Everything worked manually but failed after reboot.
Fix:
- Move all rules/routes/mark logic into persistent network config or boot orchestration.
8) Operational metrics worth tracking
- per-interface egress bytes/pps by traffic class
- policy-hit distribution (which rules are actually used)
- class-level p50/p95/p99 latency and retransmits
- drop counters by reason (rp_filter, conntrack invalid, firewall)
- failover convergence time when uplink state changes
If you cannot observe policy-hit and class latency together, policy routing incidents will remain opaque.
9) Change-management checklist
Before change:
- snapshot current
ip ruleand relevant tables - prepare rollback script
- define success metrics and timeout budget
During change:
- apply tables → rules → marks
- verify structural + simulation checks each step
After change:
- run synthetic probes per class
- monitor for at least one volatility window (open, close, scheduled data events)
10) Recommended default policy
For most multi-homed low-latency hosts:
- Start with source-based segmentation (simpler, auditable).
- Introduce fwmark classes only for traffic needing dynamic steering.
- Keep rule count small and intention-revealing.
- Treat RPDB like code: versioned, reviewed, and testable.
Deterministic egress is not a “network nice-to-have.” It is often the difference between stable execution and random latency incidents.
References
ip-rule(8)manual (RPDB semantics and default rules): https://man7.org/linux/man-pages/man8/ip-rule.8.htmlip-route(8)manual (routing table operations and route get): https://man7.org/linux/man-pages/man8/ip-route.8.html