BGP Fast-Failover Operations Playbook β€” PIC + FRR + Detection Budgets

2026-03-26 Β· systems

BGP Fast-Failover Operations Playbook β€” PIC + FRR + Detection Budgets

Date: 2026-03-26
Category: knowledge
Audience: Backbone/edge network engineers running large BGP fabrics

1) Why this matters

In large BGP networks, failure pain is rarely about whether an alternate path exists. It is mostly about how long forwarding takes to move after failure.

Without pre-computed repair, convergence time can scale with route volume and control-plane churn. With the right design, failover can be driven by local forwarding updates that are mostly independent of prefix count.

That is the operational value of combining:


2) Failure timeline mental model (where milliseconds are spent)

Treat failover latency as a budget with four buckets:

  1. Detection β€” link down, BFD down, or adjacencies expiring.
  2. Local repair β€” immediate dataplane detour by the PLR (point of local repair).
  3. FIB indirection switch β€” next-hop group pointer swap (PIC behavior).
  4. Control-plane cleanup β€” protocol reconvergence and path re-selection afterward.

If you skip (2) and (3), all traffic waits for (4), and tail loss explodes.


3) The three building blocks (and what each does)

A) IP FRR / TI-LFA: immediate local protection

Role: keep packets moving during the first moments after a failure.

B) BGP PIC: prefix-independent forwarding change

The current IETF work (draft-ietf-rtgwg-bgp-pic) describes organizing forwarding hierarchically so many prefixes share forwarding objects.

Operationally:

Two practical domains:

C) Fast detection (BFD, RFC 5880)

BFD gives protocol-independent liveness with low detection latency. But overly aggressive timers can create flap storms and false positives.

Role: trigger protection quickly, but with controlled stability margins.


4) Preconditions before enabling PIC (non-negotiable)

  1. Path diversity exists

    • PIC helps only when alternate ECMP/backup paths exist.
  2. Recursive forwarding is already healthy

    • clean next-hop resolution in RIB/FIB,
    • no fragile recursion chains.
  3. IGP underlay supports fast repair

    • LFA/RLFA/TI-LFA policy and coverage are validated.
  4. Edge route visibility is sufficient

    • where needed, use mechanisms like ADD-PATH (RFC 7911) so alternate paths are visible before failure.
  5. Hardware/software FIB scale is understood

    • backup/repair objects consume resources; validate headroom first.

5) Practical rollout sequence (low-risk path)

Phase 0 β€” Baseline current failover

Measure before changes:

Phase 1 β€” Stabilize detection budget

Phase 2 β€” Deploy IGP FRR first

Phase 3 β€” Enable PIC Core

Phase 4 β€” Enable PIC Edge

Roll out in rings/cells; avoid one-shot network-wide activation.


6) Observability you need (or you are flying blind)

Dataplane

Control plane

Business/SLO layer


7) Common failure modes (seen in production)

  1. β€œPIC enabled” but no real alternates

    • zero benefit because topology is single-path at failure point.
  2. BFD too hot

    • false downs trigger oscillation and control-plane storms.
  3. FRR coverage gaps

    • LFA unavailable in some topologies; unprotected prefixes still wait for reconvergence.
  4. Edge policy mismatch

    • alternate egress exists physically but is rejected by policy.
  5. Scale surprises in FIB objects

    • backup groups consume TCAM/adjacency resources; partial install causes inconsistent behavior.
  6. Testing only link failure

    • node/SRLG-style events expose very different behavior.

8) Quick incident triage checklist (during real failure)

  1. Is failure detected by physical signal, BFD, or protocol timeout?
  2. Did local repair activate (LFA/RLFA/TI-LFA counters/tables)?
  3. Did next-hop group/recursion object switch immediately?
  4. Are alternates policy-eligible and installed in FIB?
  5. Is loss from dataplane congestion after reroute (not control-plane delay)?
  6. Are BFD flaps continuing after first event (instability loop)?

This order prevents wasting time blaming BGP when the real issue is underlay repair or path capacity.


9) Bottom line

Fast failover at BGP scale is not one feature toggle. It is a layered control system:

Detection budget (BFD/LOS) + local repair (FRR/TI-LFA) + prefix-independent forwarding structure (PIC) + controlled reconvergence cleanup.

If any one layer is weak, failures return to prefix-by-prefix pain.


References

  1. Internet-Draft: BGP Prefix Independent Convergence (draft-ietf-rtgwg-bgp-pic)
    https://datatracker.ietf.org/doc/draft-ietf-rtgwg-bgp-pic/
  2. RFC 5714 β€” IP Fast Reroute Framework
    https://www.rfc-editor.org/rfc/rfc5714
  3. RFC 5286 β€” Basic Specification for IP Fast Reroute: Loop-Free Alternates
    https://www.rfc-editor.org/rfc/rfc5286
  4. RFC 7490 β€” Remote Loop-Free Alternate (LFA) Fast Reroute (FRR)
    https://www.rfc-editor.org/rfc/rfc7490
  5. RFC 9855 β€” Topology Independent Fast Reroute Using Segment Routing
    https://www.rfc-editor.org/rfc/rfc9855
  6. RFC 5880 β€” Bidirectional Forwarding Detection (BFD)
    https://www.rfc-editor.org/rfc/rfc5880
  7. RFC 7911 β€” Advertisement of Multiple Paths in BGP (ADD-PATH)
    https://www.rfc-editor.org/rfc/rfc7911