Anycast Service Operations — BGP Health Signaling, Hysteresis, and Safe Cutover Playbook
Date: 2026-03-29
Category: knowledge
Audience: Network/SRE operators running global anycast services (DNS, edge API, L4/L7 ingress)
1) Why this matters
Anycast is powerful because one service IP can be announced from many sites, and routing naturally sends users to a topologically “closest” node. But “closest by BGP policy” is not always “best by latency/load/health”.
In practice, most outages are not caused by “anycast itself,” but by control-loop mismatch:
- app health fails in milliseconds,
- local failover reacts in sub-seconds,
- interdomain BGP convergence can take minutes,
- and human runbooks often assume all loops run at the same speed.
The result: traffic blackholes, path hunting, and noisy flap storms.
2) Mental model: three control loops, three timescales
Loop A — Local traffic control (fast)
- ECMP / local LB / service mesh
- reacts in ms–seconds
- best for host/pod/node failures
Loop B — Site-level admission (medium)
- “Should this PoP/site accept new traffic?”
- driven by synthetic checks + saturation SLOs
- reacts in seconds
Loop C — Global routing control (slow)
- BGP advertisement, prepends, community policy, withdrawals
- reacts in tens of seconds to minutes on the Internet
Rule: Do not use Loop C for events that Loop A/B can absorb.
3) Failure taxonomy (what to do first)
Type 1: Host/pod failures inside a healthy site
- Keep prefix announced.
- Let local load balancers drain/replace backends.
- Avoid route changes.
Type 2: Site brownout (high error rate, but not hard down)
- Keep site online but degrade preference (prepend / provider-community policy).
- Prefer traffic shift over hard withdrawal.
Type 3: Site hard failure or severe packet loss
- Withdraw or strongly de-preference site announcement.
- If available, trigger fast underlay detection (BFD) and automate policy fallback.
Type 4: Planned maintenance
- Use graceful draining (e.g., GRACEFUL_SHUTDOWN signaling patterns) before teardown.
- Never jump directly to hard withdrawal when avoidable.
4) Core design principles
Health ≠ reachability
A BGP session can be up while the service is unhealthy. Route export must be tied to service health, not only peering state.Prefer de-preference before withdrawal
De-preference preserves alternate path visibility and reduces path-hunting turbulence.Use hysteresis everywhere
Independent thresholds for enter/exit + hold-down timers to prevent oscillation.One controlling signal per decision
Avoid mixing too many uncoordinated gates (NMS alarm + app check + manual override all racing each other).Treat Internet convergence as asynchronous and lossy
Assume some networks converge late; design drain windows accordingly.
5) Recommended policy ladder (least disruptive → most disruptive)
Stage 0: Normal
- Full advertisement from all intended sites.
Stage 1: Soft shift
- Add prepends or lower LOCAL_PREF via agreed communities toward selected upstreams.
- Objective: reduce new inflow while preserving backup reachability.
Stage 2: Hard shift
- Apply stronger de-preference and/or selective no-export scopes.
- Maintain a minimal control path if possible.
Stage 3: Withdrawal
- Withdraw prefix from impaired site only after drain window and impact verification.
- Reserve for hard failure, safety risk, or completed migration.
This staged approach usually beats “panic-withdraw first.”
6) Practical anti-flap controls
Hysteresis template
- Enter-degraded threshold: e.g., error rate > 2% for 60s
- Exit-degraded threshold: error rate < 0.5% for 10m
- Minimum state duration: 5–15m
Change budget
- Cap automated BGP policy flips per prefix/site per hour.
- Exceeding budget forces manual approval.
Damp only the unstable edges, not everything
- Historic RFD defaults were too aggressive in many real networks.
- If damping is used, prefer modernized/recommended profiles and scope narrowly.
Maintenance guardrails
- Graceful drain first (community-based), wait, then teardown.
- Keep rollback policy pre-staged.
7) Safe cutover / withdrawal runbook (Internet-facing anycast)
Step 1 — Introduce same-length alternate path
Advertise the same prefix length from the destination site/provider first.
Step 2 — Wait for global convergence window
Allow 5–10 minutes (or your measured baseline) before withdrawing old path.
Step 3 — Verify catchment movement
Check per-AS / per-region traffic and error metrics; ensure target site is absorbing load safely.
Step 4 — Withdraw old path if needed
Execute withdrawal only after drain success criteria are met.
Step 5 — Post-check for stragglers
Monitor late-converging networks and temporary suboptimal paths.
Why this works: it avoids forcing the DFZ into frantic path search when no equivalent route is pre-positioned.
8) Stateful protocols and “stabilized anycast”
Anycast route movement can reset active TCP flows if packets suddenly land on a different frontend lacking flow state.
Mitigations:
- consistent hashing at edge layer,
- connection tracking replication where feasible,
- longer drain windows before BGP teardown,
- avoid frequent route flaps caused by noisy health checks.
Design goal: keep existing flows sticky while shifting new flows first.
9) Observability minimum set
Routing-plane telemetry
- announce/withdraw counts by prefix/site
- BGP update rate and burstiness
- policy state transitions with reason codes
Traffic telemetry
- per-site ingress volume, SYN rate, success rate
- p50/p95 latency and retransmit indicators by source ASN/region
- catchment map deltas (who moved where)
Reliability telemetry
- route-change-to-user-impact correlation
- flap frequency per site/prefix
- mean recovery time by failure type (Type 1–4)
If you cannot answer “which route action hurt which users?”, your control loop is under-instrumented.
10) Governance: what should be automated vs manual
Automate:
- deterministic Stage 1 soft shifts with strict hysteresis
- clearly bounded Type 1/2 responses
Manual approval:
- repeated Stage 3 withdrawals in short interval
- multi-site simultaneous de-preference
- policy that changes transits/peers globally
Always keep:
- one-click rollback policy bundles
- human-readable change ledger (who/why/when)
11) Quick readiness checklist
Before relying on automatic anycast steering in production:
- Service health gates route export decisions (not BGP session alone)
- Soft-shift policy exists (prepend/community) before withdrawal path
- Hysteresis + hold-down + flip budget configured
- Safe same-length cutover runbook tested in game day
- Planned maintenance uses graceful drain workflow
- Catchment and per-AS impact observability is live
- Rollback tested within a single on-call shift
12) Bottom line
Reliable anycast operations are less about “faster withdrawals” and more about stable control loops:
- absorb locally first,
- shift globally in stages,
- withdraw last,
- and instrument every transition.
That’s how you get low latency and low drama.
References
RFC 4786 — Operation of Anycast Services
https://www.rfc-editor.org/rfc/rfc4786RFC 7094 — Architectural Considerations of IP Anycast
https://www.rfc-editor.org/rfc/rfc7094RFC 8326 — Graceful BGP Session Shutdown
https://www.rfc-editor.org/rfc/rfc8326RFC 5880 — Bidirectional Forwarding Detection (BFD)
https://www.rfc-editor.org/rfc/rfc5880RFC 1997 — BGP Communities Attribute
https://www.rfc-editor.org/rfc/rfc1997RFC 8092 — BGP Large Communities Attribute
https://www.rfc-editor.org/rfc/rfc8092RFC 2439 — BGP Route Flap Damping
https://www.rfc-editor.org/rfc/rfc2439RFC 7196 — Making Route Flap Damping Usable
https://www.rfc-editor.org/rfc/rfc7196RIPE-580 — RIPE Routing WG Recommendations on Route Flap Damping
https://www.ripe.net/publications/docs/ripe-580/Google SRE Workbook — Managing Load (anycast/stabilized-anycast discussion)
https://sre.google/workbook/managing-load/Cloudflare Docs — Troubleshoot routing and BGP (path hunting notes)
https://developers.cloudflare.com/magic-transit/troubleshooting/routing-and-bgp/Cloudflare Docs — Safely withdraw a BYOIP prefix (same-length cutover method)
https://developers.cloudflare.com/magic-transit/how-to/safely-withdraw-byoip-prefix/Google Research / NSDI’16 — Maglev: A Fast and Reliable Software Network Load Balancer
https://research.google/pubs/maglev-a-fast-and-reliable-software-network-load-balancer/