BGP FlowSpec DDoS Mitigation Operations Playbook
How to use FlowSpec as a precise mitigation tool without turning your control plane into a self-inflicted outage.
Why this matters
For volumetric attacks, teams usually start with destination blackholing (RTBH). It is simple and fast, but coarse.
FlowSpec gives you L3/L4 selective filtering (prefix, protocol, ports, flags, packet length, fragments, DSCP), so you can often keep legitimate traffic alive while suppressing attack flows.
But FlowSpec is dangerous when run without guardrails:
- bad or over-broad rules can drop legitimate traffic,
- broken automation can flood routers with updates,
- platform limits (TCAM/ACL/policer) can be hit during the worst moment.
The goal is not just “can we push rules”, but can we push rules safely, repeatedly, under stress.
1) Mental model: RTBH first-aid, FlowSpec surgery
Use both, with clear intent:
- RTBH (RFC 7999 BLACKHOLE): fast blast-radius containment when precision is unavailable.
- FlowSpec (RFC 8955/8956): targeted filtering/rate-limiting/redirect actions when attack signature quality is good.
Practical policy:
- Start with pre-approved coarse control if link is melting.
- Move to narrower FlowSpec rules once signature confidence rises.
- Remove coarse controls quickly after precision rules stabilize.
2) Baseline architecture (minimal safe design)
Use a 3-plane layout:
- Detection plane
- telemetry + anomaly detector generates candidate signatures
- Policy/controller plane
- normalizes rules, enforces validation/risk checks, rate-limits announcements
- Enforcement plane
- routers/switches receiving FlowSpec and applying hardware/software filters
Critical design point: treat the controller as a policy compiler, not a dumb relay.
3) Validation and trust model (non-negotiable)
RFC 8955 requires FlowSpec feasibility checks against unicast reachability (destination component, origin relationship, more-specific checks) and revalidation when best unicast path changes.
Operationally:
- enforce strict inbound policy for FlowSpec AFI/SAFI peers,
- filter allowed actions from third parties (especially redirect/marking),
- require destination-prefix anchoring unless explicitly justified,
- continuously revalidate on route churn.
RFC 9117 relaxes parts of validation for practical topologies (for example, centralized controllers inside the same local domain and route-server realities), but relaxation should be explicitly scoped and policy-guarded, not globally permissive.
4) Rule authoring guardrails (what prevents disasters)
A) Match specificity floor
Never allow first-push rules that are too broad.
Examples of safer defaults:
- destination prefix must be <= /24 (IPv4) or <= /48 (IPv6) unless emergency override,
- include protocol + at least one L4 discriminator (port, flags, fragment, packet length),
- disallow "any-any" style catch-all FlowSpec patterns from automation.
B) Action allowlist by trust tier
- Customer/partner submitted rules: usually drop / bounded rate-limit only.
- Internal controller rules: may use broader actions, but with stronger approval and audit.
- Redirect/marking actions: require explicit policy domain and route-target constraints.
C) TTL + auto-expiry
Every rule gets a hard expiry (for example, 10–60 minutes unless renewed by fresh evidence).
If your detector dies, stale mitigation must not live forever.
D) Dry-run / shadow evaluation
Before activating, run candidate rules against sampled flow logs/pcaps to estimate:
- expected malicious hit-rate,
- expected collateral hit-rate,
- overlap with existing controls.
Promote only if collateral estimate is within policy budget.
5) Capacity engineering: rule budgets are part of DDoS defense
RFC 8955 security considerations explicitly call out device/event capacity limits.
Operate with hard budgets:
- max active rules per device/class,
- max update rate (announcements/withdraws per second),
- max concurrent new rules per incident,
- pre-reserved ACL/TCAM headroom for emergency controls.
If budget would be exceeded, degrade gracefully:
- collapse low-value specific rules into a coarser temporary control,
- prioritize top-impact signatures,
- shed non-critical actions first.
6) Vendor heterogeneity strategy
Real deployments are not semantically identical across vendors.
Observed ecosystem reality (see APNIC operational report):
- uneven support for fragment bits / TCP flags,
- different rule composition constraints,
- different policing granularity,
- different observability quality for dropped traffic.
So, controller must compile by device profile:
- capability matrix per platform/OS version,
- per-vendor rule simplification/fallback,
- pre-deploy compatibility tests for common attack templates.
Do not assume “accepted by BGP” means “enforced as intended”.
7) Observability: metrics that actually matter
Minimum dashboards:
- rule lifecycle counts: candidate / active / expiring / failed,
- push latency: detect -> compiled -> advertised -> enforced,
- router health: CPU, memory, FIB/ACL utilization, control-plane drops,
- mitigation effectiveness: pps/bps reduction per rule,
- collateral indicators: SYN success, DNS success, HTTP 2xx/4xx/5xx shifts,
- churn indicators: announce/withdraw rate, rule flapping,
- route-validation failures and policy rejects.
Key alerts:
- sudden spike in new rules or withdraws,
- active-rule budget > 80%,
- enforcement mismatch (controller says active, device says unsupported/partial),
- user traffic KPI degradation after rule activation.
8) Safe incident workflow (runbook)
- Classify attack mode (volumetric flood vs protocol-specific abuse).
- Pick initial control
- RTBH for immediate containment if links saturate,
- direct FlowSpec if signature confidence is already high.
- Compile candidate FlowSpec with policy checks.
- Shadow evaluate against recent telemetry.
- Canary announce to limited edge scope.
- Observe for 1–3 minutes (effectiveness + collateral).
- Progressive rollout to full scope.
- Auto-expire + review after attack decay.
- Post-incident cleanup: withdraw stale rules, archive metrics, update templates.
9) Failure lessons to institutionalize
A well-known historical outage showed that a bad FlowSpec-style filter can trigger network-wide router instability when distributed broadly.
Actionable lesson:
- include syntax acceptance tests + semantic sanity checks + platform safety checks before global propagation,
- prefer staged rollouts over immediate global fanout,
- keep emergency kill-switch to withdraw all dynamic mitigation rules quickly.
10) Compact operator checklist
- FlowSpec inbound peers are explicitly allowlisted with strict policy
- Validation behavior (RFC 8955 baseline + any RFC 9117 relaxations) is documented and audited
- Rule compiler enforces specificity floor + action allowlist + TTL
- Rule budgets (count/churn/headroom) are configured and alerted
- Vendor capability matrix is maintained per platform/version
- Canary rollout and global kill-switch are tested periodically
- Post-incident review captures collateral metrics, not only dropped bps/pps
References
- RFC 8955 — Dissemination of Flow Specification Rules
https://datatracker.ietf.org/doc/html/rfc8955 - RFC 8956 — Dissemination of Flow Specification Rules for IPv6
https://datatracker.ietf.org/doc/html/rfc8956 - RFC 9117 — Revised Validation Procedure for BGP Flow Specifications
https://datatracker.ietf.org/doc/html/rfc9117 - RFC 7999 — BLACKHOLE Community
https://datatracker.ietf.org/doc/html/rfc7999 - APNIC Blog (P. Odintsov, 2024) — The ultimate weapon against DDoS — BGP Flowspec
https://blog.apnic.net/2024/09/18/the-ultimate-weapon-against-ddos-bgp-flowspec/ - Cloudflare (2013) — Today’s Outage Post Mortem (FlowSpec-related router failure context)
https://blog.cloudflare.com/todays-outage-post-mortem-82515/