BGP Planned Maintenance Playbook โ GSHUT, Graceful Restart, and BFD
Date: 2026-03-26
Category: knowledge
Audience: Network/platform engineers operating EBGP at scale
1) Why this matters
Planned maintenance often causes unplanned packet loss when operators simply hard-shutdown BGP sessions.
The core problem is timing mismatch:
- control-plane withdrawal/reconvergence takes time,
- forwarding keeps sending traffic into soon-to-die paths,
- fallback paths may not be best/visible yet.
A reliable maintenance workflow should intentionally separate:
- traffic drain (prefer alternate paths first),
- session teardown (after convergence),
- recovery/rejoin (without route churn spikes).
2) Correct mental model: three different tools
A) GSHUT (RFC 8326): planned drain signaling
- Uses well-known community GRACEFUL_SHUTDOWN = 65535:0.
- Receiver policy lowers LOCAL_PREF for tagged routes (RFC recommends low value; commonly 0).
- Goal: move traffic away before peering goes down.
B) Graceful Restart (RFC 4724): control-plane restart continuity
- Helps preserve forwarding while BGP process/session restarts.
- Uses GR capability + End-of-RIB signaling.
- Best when forwarding plane is still healthy.
C) BFD (RFC 5880/5882): fast liveness detection
- Detects path failure quickly, independent of routing protocol hellos.
- Great for unplanned failures.
- Can be too aggressive during maintenance if not coordinated.
Key point: These are complementary, not substitutes.
- GSHUT handles planned drain.
- GR handles restart continuity.
- BFD handles fast failure detection.
3) Baseline maintenance sequence (vendor-neutral)
Phase 0 โ Preconditions (before maintenance window)
- Confirm alternate paths exist and are policy-eligible.
- Ensure inbound policy on peers honors
65535:0by lowering LOCAL_PREF. - Verify no downstream policy later in chain overwrites that lowered LOCAL_PREF.
- Check route-reflector visibility and best-path propagation assumptions.
If alternates are absent, GSHUT cannot save you from loss.
Phase 1 โ Start drain
On the maintenance initiator for target EBGP session(s):
- Apply outbound policy to tag advertised paths with
GRACEFUL_SHUTDOWN. - Apply inbound policy to de-prefer paths learned on that soon-to-close session.
- Wait for re-advertisement + network convergence.
Operationally, do not rush this wait. Validate real traffic drain, not just control-plane state.
Phase 2 โ Validate drain completion
Gate conditions before shutdown:
- session still up,
- route counts stable,
- primary traffic shifted to alternates,
- no new hot spots/congestion alarms on backup paths.
Phase 3 โ Teardown
- Perform administrative shutdown/reset of session/device.
- Optionally include BGP shutdown communication text (RFC 8203) for peer visibility.
Phase 4 โ Bring back
- Re-enable session.
- Remove maintenance-specific policy knobs.
- Confirm path re-selection is stable (no oscillation).
4) BFD interaction during maintenance (where people get burned)
With aggressive BFD timers, maintenance transitions can trigger avoidable churn.
Practical guidance:
- Prefer explicit drain first, shutdown later over relying on BFD-triggered failover.
- During planned work, either:
- place relevant BFD sessions in administrative/maintenance-friendly state, or
- use less aggressive detection while drain is in progress.
- After maintenance, restore normal BFD profile.
BFD is excellent for surprise failures; maintenance is not a surprise and should be orchestrated.
5) Graceful Restart: where it helps and where it does not
GR is useful if control plane restarts but forwarding can continue.
It is not a universal fix for maintenance that impacts forwarding adjacency/capacity itself (e.g., interface linecard impact, physical move, hard path disruption). In those cases, planned traffic migration (GSHUT workflow) is still required.
Use GR to reduce churn, not as an excuse to skip drain engineering.
6) Observability checklist (must-have)
Track these around every maintenance event:
Control plane
- route count deltas (pre/drain/post)
- convergence duration
- update burst size
Forwarding plane
- packet loss / micro-loss windows
- path utilization on alternates
- queue/drop counters on backup links
BFD/BGP health
- BFD state transitions and flaps
- BGP session reset reason/subcode
- End-of-RIB timing (if GR used)
A maintenance playbook without these metrics is โhope-driven networking.โ
7) Failure modes to preempt
- GSHUT community sent, but receiver does not honor it.
- LOCAL_PREF lowered, then overwritten by later policy term.
- All alternates pass control-plane checks but fail capacity checks.
- BFD timers so tight they induce churn during controlled transitions.
- Operators conflate GR with planned drain and skip Phase 1/2.
8) 15-minute runbook (concise)
- Confirm alternates + capacity headroom.
- Enable outbound GSHUT tagging on target EBGP edges.
- Apply inbound de-preference on to-be-shutdown sessions.
- Wait for convergence and verify traffic migration.
- If stable, admin-shutdown/reset (with RFC 8203 message if supported).
- Complete maintenance.
- Restore session and normal policies/timers.
- Verify no prolonged imbalance/flap; close only after metrics normalize.
9) Bottom line
For planned EBGP maintenance, the safest pattern is:
GSHUT for controlled drain + GR where restart continuity applies + BFD tuned/coordinated for maintenance context.
Most outage pain in maintenance windows comes from using only one of these tools (or using them in the wrong order).
References
- RFC 8326 โ Graceful BGP Session Shutdown
https://www.rfc-editor.org/rfc/rfc8326.html - RFC 4724 โ Graceful Restart Mechanism for BGP
https://www.rfc-editor.org/rfc/rfc4724 - RFC 5880 โ Bidirectional Forwarding Detection (BFD)
https://www.rfc-editor.org/rfc/rfc5880 - RFC 5882 โ Generic Application of BFD
https://www.rfc-editor.org/rfc/rfc5882 - RFC 8203 โ BGP Administrative Shutdown Communication
https://www.rfc-editor.org/rfc/rfc8203 - NLNOG BGP Filter Guide โ Graceful Shutdown examples
https://bgpfilterguide.nlnog.net/guides/graceful_shutdown/