RPKI-First BGP Routing Hygiene Playbook (ROA/ROV/RTR + Route-Leak Guardrails)

2026-03-13 Β· systems

RPKI-First BGP Routing Hygiene Playbook (ROA/ROV/RTR + Route-Leak Guardrails)

Date: 2026-03-13
Category: knowledge
Scope: Practical deployment pattern for reducing origin hijacks without causing self-inflicted routing outages.


1) Why this matters

Classic BGP trust is too permissive: a wrong origin announcement can propagate globally.

RPKI-based origin validation (ROV) gives operators a cryptographic way to check whether the origin AS is authorized for a prefix. Done well, this cuts a major class of hijacks. Done carelessly, it can create self-inflicted outages via bad ROAs.

This playbook is about deploying safely and operationally, not just enabling a checkbox.


2) What RPKI/ROV actually guarantees (and what it does not)

What it gives you

What it does not give you

Mental model: ROV is a high-value baseline control, not the whole routing-security stack.


3) Validation-state semantics you should internalize

Per origin validation logic (RFC 6811 family):

Operationally, this means:


4) ROA authoring rules that prevent most self-outages

4.1 Prefer minimal ROAs

Authorizations should match actually-originated prefixes as tightly as possible.

4.2 Be conservative with maxLength

maxLength is operationally useful, but over-broad values expand forged-origin subprefix attack surface and increase error blast radius.

4.3 Cover all legitimate origin ASNs

If a prefix can be originated by multiple ASNs (multi-homing, migrations, mitigation providers), issue ROAs for all legitimate origin cases.

4.4 Supernet/subnet sequencing discipline

Before publishing a supernet ROA, verify sub-allocations announced by other ASNs are correctly represented; otherwise you can accidentally make legitimate downstream announcements Invalid.


5) Safe policy rollout ladder (do this in stages)

Stage 0 β€” Instrument only

Stage 1 β€” Preferential routing

Stage 2 β€” Controlled rejection

Stage 3 β€” Broad invalid rejection

This mirrors BCP guidance: keep reachability safe during partial deployment, but move toward dropping Invalid once confidence is established.


6) Router/validator architecture that survives bad days

6.1 Multiple caches, not one

Routers should peer with more than one trusted cache/validator to avoid single-point failure.

6.2 Place caches close to control plane

Reduce bootstrap and reachability dependencies; avoid circular dependencies where routing must converge before the router can reach validation data.

6.3 Protect RTR transport

Routers trust cache output; secure and harden that channel and avoid insecure inter-AS transport for router-cache sessions.

6.4 Keep caches fresh and observable

Track serial lag, cache freshness, and last successful update times. Stale-but-quiet is dangerous.


7) Two commonly missed implementation details

7.1 β€œSet state, don’t auto-act”

Validation state should be computed broadly, but policy actions must be explicit operator choice. Implicit vendor defaults are an outage risk.

7.2 Validate egress with the effective origin AS

When exporting, policy/AS_PATH manipulations can change effective origin semantics. Egress validation should use the post-policy effective origin view.


8) ROV is not route-leak defense: add BGP roles/OTC

ROV primarily addresses origin legitimacy.

Route leaks (relationship-violating propagation) require complementary controls. BGP Roles and OTC signaling (RFC 9234) add in-band relationship-aware safeguards for leak prevention/detection.

Practical stack:

  1. RPKI/ROV for origin hygiene.
  2. BGP Roles/OTC for propagation-hygiene.
  3. Classic import/export filters and IRR/RPKI sanity checks.

9) Incident playbook for β€œWhy did this become Invalid?”

  1. Identify affected prefix + observed origin ASN.
  2. Compare current VRP/ROA set versus previous snapshot.
  3. Check maxLength mismatch first (very common).
  4. Check AS migration/private-AS stripping/policy rewrites (effective-origin mismatch).
  5. Validate cache freshness and RTR session health.
  6. Apply temporary exception only with expiry and ticketed follow-up.
  7. Post-incident: fix ROA model, add pre-change validation tests.

10) Operator checklist (short form)


References


One-line takeaway

Treat RPKI as a production control system: precise ROAs + staged policy + resilient validator architecture + leak-specific controls beats checkbox deployment every time.