BGP Planned Maintenance Playbook โ€” GSHUT, Graceful Restart, and BFD

2026-03-26 ยท systems

BGP Planned Maintenance Playbook โ€” GSHUT, Graceful Restart, and BFD

Date: 2026-03-26
Category: knowledge
Audience: Network/platform engineers operating EBGP at scale

1) Why this matters

Planned maintenance often causes unplanned packet loss when operators simply hard-shutdown BGP sessions.

The core problem is timing mismatch:

A reliable maintenance workflow should intentionally separate:

  1. traffic drain (prefer alternate paths first),
  2. session teardown (after convergence),
  3. recovery/rejoin (without route churn spikes).

2) Correct mental model: three different tools

A) GSHUT (RFC 8326): planned drain signaling

B) Graceful Restart (RFC 4724): control-plane restart continuity

C) BFD (RFC 5880/5882): fast liveness detection

Key point: These are complementary, not substitutes.


3) Baseline maintenance sequence (vendor-neutral)

Phase 0 โ€” Preconditions (before maintenance window)

If alternates are absent, GSHUT cannot save you from loss.

Phase 1 โ€” Start drain

On the maintenance initiator for target EBGP session(s):

  1. Apply outbound policy to tag advertised paths with GRACEFUL_SHUTDOWN.
  2. Apply inbound policy to de-prefer paths learned on that soon-to-close session.
  3. Wait for re-advertisement + network convergence.

Operationally, do not rush this wait. Validate real traffic drain, not just control-plane state.

Phase 2 โ€” Validate drain completion

Gate conditions before shutdown:

Phase 3 โ€” Teardown

Phase 4 โ€” Bring back


4) BFD interaction during maintenance (where people get burned)

With aggressive BFD timers, maintenance transitions can trigger avoidable churn.

Practical guidance:

BFD is excellent for surprise failures; maintenance is not a surprise and should be orchestrated.


5) Graceful Restart: where it helps and where it does not

GR is useful if control plane restarts but forwarding can continue.

It is not a universal fix for maintenance that impacts forwarding adjacency/capacity itself (e.g., interface linecard impact, physical move, hard path disruption). In those cases, planned traffic migration (GSHUT workflow) is still required.

Use GR to reduce churn, not as an excuse to skip drain engineering.


6) Observability checklist (must-have)

Track these around every maintenance event:

A maintenance playbook without these metrics is โ€œhope-driven networking.โ€


7) Failure modes to preempt


8) 15-minute runbook (concise)

  1. Confirm alternates + capacity headroom.
  2. Enable outbound GSHUT tagging on target EBGP edges.
  3. Apply inbound de-preference on to-be-shutdown sessions.
  4. Wait for convergence and verify traffic migration.
  5. If stable, admin-shutdown/reset (with RFC 8203 message if supported).
  6. Complete maintenance.
  7. Restore session and normal policies/timers.
  8. Verify no prolonged imbalance/flap; close only after metrics normalize.

9) Bottom line

For planned EBGP maintenance, the safest pattern is:

GSHUT for controlled drain + GR where restart continuity applies + BFD tuned/coordinated for maintenance context.

Most outage pain in maintenance windows comes from using only one of these tools (or using them in the wrong order).


References

  1. RFC 8326 โ€” Graceful BGP Session Shutdown
    https://www.rfc-editor.org/rfc/rfc8326.html
  2. RFC 4724 โ€” Graceful Restart Mechanism for BGP
    https://www.rfc-editor.org/rfc/rfc4724
  3. RFC 5880 โ€” Bidirectional Forwarding Detection (BFD)
    https://www.rfc-editor.org/rfc/rfc5880
  4. RFC 5882 โ€” Generic Application of BFD
    https://www.rfc-editor.org/rfc/rfc5882
  5. RFC 8203 โ€” BGP Administrative Shutdown Communication
    https://www.rfc-editor.org/rfc/rfc8203
  6. NLNOG BGP Filter Guide โ€” Graceful Shutdown examples
    https://bgpfilterguide.nlnog.net/guides/graceful_shutdown/