BFD Fast Failure Detection — Production Deployment Playbook

2026-03-29 · systems

BFD Fast Failure Detection — Production Deployment Playbook

Date: 2026-03-29
Category: knowledge
Audience: Network/SRE engineers running BGP/OSPF/IS-IS at scale

1) Why BFD deserves its own playbook

BFD is one of those tools that can be either:

The difference is almost never “BFD enabled vs disabled.” It is usually about timer discipline, failure-domain boundaries, and protocol coupling choices.


2) What BFD is (and is not)

Per RFC 5880, BFD is a low-latency liveness protocol for the forwarding path between two systems.

Important constraints from the RFC set:

Operational translation:


3) Core timer math you should memorize

In Asynchronous mode (the default mode most deployments actually use), local detection time is driven by remote-advertised values and negotiated intervals (RFC 5880 §6.8.4):

Practical shorthand:

If you only optimize interval without multiplier discipline, you can still end up unstable.


4) Single-hop vs multihop: don’t blur them

Single-hop BFD (RFC 5881)

Multihop BFD (RFC 5883)

Design rule:


5) Mode choices in real networks

RFC 5880 modes:

Real-world caveat example:

Bottom line: standardize on Asynchronous mode unless you have clear platform support and lab evidence for alternatives.


6) Recommended baseline profiles (starting points)

These are pragmatic defaults to start from, then tune per link quality and CPU budget.

Profile A — Conservative (WAN / noisy domains)

Profile B — Balanced (intra-DC leaf/spine)

Profile C — Fast but disciplined (high-quality links only)

Why C uses multiplier 5: you get fast recovery without making every 1-2 missed packets a topology event.


7) Interop reality: Common Intervals matter

RFC 7419 highlights a practical interop issue: hardware implementations may support only a subset of timer values.

Defined Common Intervals:

Recommendation:


8) Coupling with routing protocols: avoid double-trigger chaos

From RFC 5882 guidance:

Operational pitfalls:

  1. BFD + aggressive IGP hello dead timers both tuned ultra-fast → redundant churn.
  2. BFD Down immediately tearing BGP while GR/GSHUT logic is also active → oscillation.
  3. Multiple protocol clients each instantiating independent BFD sessions for same path → unnecessary load.

Do this instead:


9) Security and hygiene checklist

Minimum hygiene:

Modern BGP observability bonus:


10) FRR-oriented implementation notes (practical)

From FRRouting docs:

A practical rollout sequence:

  1. Enable BFD with conservative profile (A).
  2. Validate 7-day flap rate + CPU impact.
  3. Move selected low-loss adjacency classes to profile B.
  4. Only then consider profile C for proven-clean links.

11) “Do not do this” list


12) 20-minute validation runbook

Before production timer reduction:

  1. Confirm path quality SLOs (loss/jitter) match target profile assumptions.
  2. Check platform dataplane/control-plane BFD offload behavior.
  3. Verify protocol interaction (BGP/OSPF/IS-IS) in a lab with synthetic packet loss.
  4. Validate alerting for BFD state churn and cause codes.
  5. Roll out in rings; stop if flap rate crosses threshold.

After rollout:

  1. Compare convergence KPIs (p95 failover time) vs baseline.
  2. Compare churn KPIs (session flaps/hour, route update bursts).
  3. Keep profile if convergence gains outweigh churn cost; otherwise back off.

13) Bottom line

BFD gives excellent convergence when you treat it as precision tooling:

If you tune it like a race car on a gravel road, it will absolutely spin out your control plane.


References

  1. RFC 5880 — Bidirectional Forwarding Detection (BFD)
    https://www.rfc-editor.org/rfc/rfc5880
  2. RFC 5881 — BFD for IPv4 and IPv6 (Single Hop)
    https://www.rfc-editor.org/rfc/rfc5881
  3. RFC 5882 — Generic Application of BFD
    https://www.rfc-editor.org/rfc/rfc5882
  4. RFC 5883 — BFD for Multihop Paths
    https://www.rfc-editor.org/rfc/rfc5883
  5. RFC 7419 — Common Interval Support in BFD
    https://www.rfc-editor.org/rfc/rfc7419
  6. RFC 7880 — Seamless BFD (S-BFD)
    https://www.rfc-editor.org/rfc/rfc7880
  7. RFC 8562 — BFD for Multipoint Networks
    https://www.rfc-editor.org/rfc/rfc8562
  8. RFC 9384 — BGP Cease NOTIFICATION Subcode for BFD Down
    https://www.rfc-editor.org/rfc/rfc9384
  9. FRRouting BFD documentation
    https://docs.frrouting.org/en/latest/bfd.html
  10. NVIDIA Cumulus Linux BFD docs
    https://docs.nvidia.com/networking-ethernet-software/cumulus-linux-515/Layer-3/Bidirectional-Forwarding-Detection-BFD/