BFD Fast Failure Detection — Production Deployment Playbook
Date: 2026-03-29
Category: knowledge
Audience: Network/SRE engineers running BGP/OSPF/IS-IS at scale
1) Why BFD deserves its own playbook
BFD is one of those tools that can be either:
- your fastest convergence win, or
- a flap amplifier that turns micro-jitter into control-plane churn.
The difference is almost never “BFD enabled vs disabled.” It is usually about timer discipline, failure-domain boundaries, and protocol coupling choices.
2) What BFD is (and is not)
Per RFC 5880, BFD is a low-latency liveness protocol for the forwarding path between two systems.
Important constraints from the RFC set:
- BFD is an OAM liveness signal, not a replacement for routing protocol semantics (RFC 5882).
- It is intended for network service paths (router-router, service appliances, LSP/circuit endpoints), not arbitrary internet app-to-app probing (RFC 5881/5883 applicability text).
- BFD has no discovery; clients/protocols create sessions.
Operational translation:
- Treat BFD as a fast advisory input to routing decisions.
- Keep policy and route selection in the routing protocols.
3) Core timer math you should memorize
In Asynchronous mode (the default mode most deployments actually use), local detection time is driven by remote-advertised values and negotiated intervals (RFC 5880 §6.8.4):
- Detection Time ≈
remote Detect Multiplier × negotiated remote transmit interval - Negotiated Tx interval is the greater of local Required Min RX and remote Desired Min TX.
Practical shorthand:
multiplier 3 × 300 ms⇒ ~900 ms detectmultiplier 3 × 100 ms⇒ ~300 ms detectmultiplier 5 × 50 ms⇒ ~250 ms detect
If you only optimize interval without multiplier discipline, you can still end up unstable.
4) Single-hop vs multihop: don’t blur them
Single-hop BFD (RFC 5881)
- UDP destination port: 3784 (control)
- Echo packets use UDP 3785
- TTL/Hop-Limit hardening: transmit 255; receiver discards if not 255 (especially when auth is absent)
Multihop BFD (RFC 5883)
- UDP destination port: 4784
- Echo function MUST NOT be used over multihop
- Authentication is strongly encouraged (spoofing risk grows with hop count)
Design rule:
- Keep fast/aggressive profiles primarily for single-hop directly-connected adjacencies.
- Use calmer, explicitly secured profiles for multihop sessions.
5) Mode choices in real networks
RFC 5880 modes:
- Asynchronous mode: periodic control packets; most common and simplest to operate.
- Demand mode: less periodic traffic after session establishment, but less commonly implemented/used in mainstream stacks.
- Echo function: can improve forwarding-path-focused detection but has deployment caveats.
Real-world caveat example:
- Cumulus Linux docs explicitly note no support for BFD demand mode/echo mode in that platform profile.
Bottom line: standardize on Asynchronous mode unless you have clear platform support and lab evidence for alternatives.
6) Recommended baseline profiles (starting points)
These are pragmatic defaults to start from, then tune per link quality and CPU budget.
Profile A — Conservative (WAN / noisy domains)
- tx/rx: 300 ms
- detect-multiplier: 3
- expected detection: ~900 ms
Profile B — Balanced (intra-DC leaf/spine)
- tx/rx: 100 ms
- detect-multiplier: 3
- expected detection: ~300 ms
Profile C — Fast but disciplined (high-quality links only)
- tx/rx: 50 ms
- detect-multiplier: 5
- expected detection: ~250 ms
Why C uses multiplier 5: you get fast recovery without making every 1-2 missed packets a topology event.
7) Interop reality: Common Intervals matter
RFC 7419 highlights a practical interop issue: hardware implementations may support only a subset of timer values.
Defined Common Intervals:
- 3.3 ms, 10 ms, 20 ms, 50 ms, 100 ms, 1 s
Recommendation:
- Prefer these values when standardizing profiles across mixed-vendor estates.
- Avoid “cute” custom intervals (e.g., 37 ms) unless every platform is validated.
8) Coupling with routing protocols: avoid double-trigger chaos
From RFC 5882 guidance:
- usually one BFD session per data path should be shared by multiple clients,
- BFD should be advisory, while clients decide actions.
Operational pitfalls:
- BFD + aggressive IGP hello dead timers both tuned ultra-fast → redundant churn.
- BFD Down immediately tearing BGP while GR/GSHUT logic is also active → oscillation.
- Multiple protocol clients each instantiating independent BFD sessions for same path → unnecessary load.
Do this instead:
- One BFD policy owner per adjacency type.
- Explicit precedence between BFD events and protocol graceful behaviors.
- Maintenance workflows that de-risk BFD-triggered spikes (e.g., temporary relaxed profile/admin-down where needed).
9) Security and hygiene checklist
Minimum hygiene:
- Single-hop: enforce TTL/Hop-Limit expectations (RFC 5881).
- Multihop: use authentication where possible (RFC 5883).
- Log and alert on unusual
Diagreasons (e.g., Echo Function Failed vs Control Detection Time Expired). - Treat frequent
AdminDowntransitions as automation/process smell.
Modern BGP observability bonus:
- RFC 9384 defines Cease subcode “BFD Down” (value 10), improving postmortem clarity when BGP is torn down due to BFD.
10) FRR-oriented implementation notes (practical)
From FRRouting docs:
- default detect-multiplier: 3
- default tx/rx: 300 ms
- echo-mode disabled by default
- multihop sessions use port 4784 and echo is not supported there
- BGP can run with BFD strict mode, but this should be introduced carefully to avoid startup deadlocks/flap loops in unstable paths
A practical rollout sequence:
- Enable BFD with conservative profile (A).
- Validate 7-day flap rate + CPU impact.
- Move selected low-loss adjacency classes to profile B.
- Only then consider profile C for proven-clean links.
11) “Do not do this” list
- Don’t start at 50 ms / multiplier 3 across the entire network.
- Don’t use multihop BFD without security controls.
- Don’t enable echo mode just because it exists on one vendor.
- Don’t assume BFD instability means line instability; it can be control-plane scheduling/policing issues.
- Don’t skip failure-injection tests (drop %, jitter, burst loss) before tightening timers.
12) 20-minute validation runbook
Before production timer reduction:
- Confirm path quality SLOs (loss/jitter) match target profile assumptions.
- Check platform dataplane/control-plane BFD offload behavior.
- Verify protocol interaction (BGP/OSPF/IS-IS) in a lab with synthetic packet loss.
- Validate alerting for BFD state churn and cause codes.
- Roll out in rings; stop if flap rate crosses threshold.
After rollout:
- Compare convergence KPIs (p95 failover time) vs baseline.
- Compare churn KPIs (session flaps/hour, route update bursts).
- Keep profile if convergence gains outweigh churn cost; otherwise back off.
13) Bottom line
BFD gives excellent convergence when you treat it as precision tooling:
- sane intervals,
- measured multipliers,
- clear protocol coupling,
- and strict scope boundaries.
If you tune it like a race car on a gravel road, it will absolutely spin out your control plane.
References
- RFC 5880 — Bidirectional Forwarding Detection (BFD)
https://www.rfc-editor.org/rfc/rfc5880 - RFC 5881 — BFD for IPv4 and IPv6 (Single Hop)
https://www.rfc-editor.org/rfc/rfc5881 - RFC 5882 — Generic Application of BFD
https://www.rfc-editor.org/rfc/rfc5882 - RFC 5883 — BFD for Multihop Paths
https://www.rfc-editor.org/rfc/rfc5883 - RFC 7419 — Common Interval Support in BFD
https://www.rfc-editor.org/rfc/rfc7419 - RFC 7880 — Seamless BFD (S-BFD)
https://www.rfc-editor.org/rfc/rfc7880 - RFC 8562 — BFD for Multipoint Networks
https://www.rfc-editor.org/rfc/rfc8562 - RFC 9384 — BGP Cease NOTIFICATION Subcode for BFD Down
https://www.rfc-editor.org/rfc/rfc9384 - FRRouting BFD documentation
https://docs.frrouting.org/en/latest/bfd.html - NVIDIA Cumulus Linux BFD docs
https://docs.nvidia.com/networking-ethernet-software/cumulus-linux-515/Layer-3/Bidirectional-Forwarding-Detection-BFD/