Anycast Service Operations — BGP Health Signaling, Hysteresis, and Safe Cutover Playbook

2026-03-29 · systems

Anycast Service Operations — BGP Health Signaling, Hysteresis, and Safe Cutover Playbook

Date: 2026-03-29
Category: knowledge
Audience: Network/SRE operators running global anycast services (DNS, edge API, L4/L7 ingress)

1) Why this matters

Anycast is powerful because one service IP can be announced from many sites, and routing naturally sends users to a topologically “closest” node. But “closest by BGP policy” is not always “best by latency/load/health”.

In practice, most outages are not caused by “anycast itself,” but by control-loop mismatch:

The result: traffic blackholes, path hunting, and noisy flap storms.


2) Mental model: three control loops, three timescales

Loop A — Local traffic control (fast)

Loop B — Site-level admission (medium)

Loop C — Global routing control (slow)

Rule: Do not use Loop C for events that Loop A/B can absorb.


3) Failure taxonomy (what to do first)

Type 1: Host/pod failures inside a healthy site

Type 2: Site brownout (high error rate, but not hard down)

Type 3: Site hard failure or severe packet loss

Type 4: Planned maintenance


4) Core design principles

  1. Health ≠ reachability
    A BGP session can be up while the service is unhealthy. Route export must be tied to service health, not only peering state.

  2. Prefer de-preference before withdrawal
    De-preference preserves alternate path visibility and reduces path-hunting turbulence.

  3. Use hysteresis everywhere
    Independent thresholds for enter/exit + hold-down timers to prevent oscillation.

  4. One controlling signal per decision
    Avoid mixing too many uncoordinated gates (NMS alarm + app check + manual override all racing each other).

  5. Treat Internet convergence as asynchronous and lossy
    Assume some networks converge late; design drain windows accordingly.


5) Recommended policy ladder (least disruptive → most disruptive)

Stage 0: Normal

Stage 1: Soft shift

Stage 2: Hard shift

Stage 3: Withdrawal

This staged approach usually beats “panic-withdraw first.”


6) Practical anti-flap controls

Hysteresis template

Change budget

Damp only the unstable edges, not everything

Maintenance guardrails


7) Safe cutover / withdrawal runbook (Internet-facing anycast)

Step 1 — Introduce same-length alternate path

Advertise the same prefix length from the destination site/provider first.

Step 2 — Wait for global convergence window

Allow 5–10 minutes (or your measured baseline) before withdrawing old path.

Step 3 — Verify catchment movement

Check per-AS / per-region traffic and error metrics; ensure target site is absorbing load safely.

Step 4 — Withdraw old path if needed

Execute withdrawal only after drain success criteria are met.

Step 5 — Post-check for stragglers

Monitor late-converging networks and temporary suboptimal paths.

Why this works: it avoids forcing the DFZ into frantic path search when no equivalent route is pre-positioned.


8) Stateful protocols and “stabilized anycast”

Anycast route movement can reset active TCP flows if packets suddenly land on a different frontend lacking flow state.

Mitigations:

Design goal: keep existing flows sticky while shifting new flows first.


9) Observability minimum set

Routing-plane telemetry

Traffic telemetry

Reliability telemetry

If you cannot answer “which route action hurt which users?”, your control loop is under-instrumented.


10) Governance: what should be automated vs manual

Automate:

Manual approval:

Always keep:


11) Quick readiness checklist

Before relying on automatic anycast steering in production:


12) Bottom line

Reliable anycast operations are less about “faster withdrawals” and more about stable control loops:

That’s how you get low latency and low drama.


References

  1. RFC 4786 — Operation of Anycast Services
    https://www.rfc-editor.org/rfc/rfc4786

  2. RFC 7094 — Architectural Considerations of IP Anycast
    https://www.rfc-editor.org/rfc/rfc7094

  3. RFC 8326 — Graceful BGP Session Shutdown
    https://www.rfc-editor.org/rfc/rfc8326

  4. RFC 5880 — Bidirectional Forwarding Detection (BFD)
    https://www.rfc-editor.org/rfc/rfc5880

  5. RFC 1997 — BGP Communities Attribute
    https://www.rfc-editor.org/rfc/rfc1997

  6. RFC 8092 — BGP Large Communities Attribute
    https://www.rfc-editor.org/rfc/rfc8092

  7. RFC 2439 — BGP Route Flap Damping
    https://www.rfc-editor.org/rfc/rfc2439

  8. RFC 7196 — Making Route Flap Damping Usable
    https://www.rfc-editor.org/rfc/rfc7196

  9. RIPE-580 — RIPE Routing WG Recommendations on Route Flap Damping
    https://www.ripe.net/publications/docs/ripe-580/

  10. Google SRE Workbook — Managing Load (anycast/stabilized-anycast discussion)
    https://sre.google/workbook/managing-load/

  11. Cloudflare Docs — Troubleshoot routing and BGP (path hunting notes)
    https://developers.cloudflare.com/magic-transit/troubleshooting/routing-and-bgp/

  12. Cloudflare Docs — Safely withdraw a BYOIP prefix (same-length cutover method)
    https://developers.cloudflare.com/magic-transit/how-to/safely-withdraw-byoip-prefix/

  13. Google Research / NSDI’16 — Maglev: A Fast and Reliable Software Network Load Balancer
    https://research.google/pubs/maglev-a-fast-and-reliable-software-network-load-balancer/