BGP FlowSpec DDoS Mitigation Operations Playbook

How to use FlowSpec as a precise mitigation tool without turning your control plane into a self-inflicted outage.

Why this matters

For volumetric attacks, teams usually start with destination blackholing (RTBH). It is simple and fast, but coarse.

FlowSpec gives you L3/L4 selective filtering (prefix, protocol, ports, flags, packet length, fragments, DSCP), so you can often keep legitimate traffic alive while suppressing attack flows.

But FlowSpec is dangerous when run without guardrails:

bad or over-broad rules can drop legitimate traffic,
broken automation can flood routers with updates,
platform limits (TCAM/ACL/policer) can be hit during the worst moment.

The goal is not just “can we push rules”, but can we push rules safely, repeatedly, under stress.

1) Mental model: RTBH first-aid, FlowSpec surgery

Use both, with clear intent:

RTBH (RFC 7999 BLACKHOLE): fast blast-radius containment when precision is unavailable.
FlowSpec (RFC 8955/8956): targeted filtering/rate-limiting/redirect actions when attack signature quality is good.

Practical policy:

Start with pre-approved coarse control if link is melting.
Move to narrower FlowSpec rules once signature confidence rises.
Remove coarse controls quickly after precision rules stabilize.

2) Baseline architecture (minimal safe design)

Use a 3-plane layout:

Detection plane
- telemetry + anomaly detector generates candidate signatures
Policy/controller plane
- normalizes rules, enforces validation/risk checks, rate-limits announcements
Enforcement plane
- routers/switches receiving FlowSpec and applying hardware/software filters

Critical design point: treat the controller as a policy compiler, not a dumb relay.

3) Validation and trust model (non-negotiable)

RFC 8955 requires FlowSpec feasibility checks against unicast reachability (destination component, origin relationship, more-specific checks) and revalidation when best unicast path changes.

Operationally:

enforce strict inbound policy for FlowSpec AFI/SAFI peers,
filter allowed actions from third parties (especially redirect/marking),
require destination-prefix anchoring unless explicitly justified,
continuously revalidate on route churn.

RFC 9117 relaxes parts of validation for practical topologies (for example, centralized controllers inside the same local domain and route-server realities), but relaxation should be explicitly scoped and policy-guarded, not globally permissive.

4) Rule authoring guardrails (what prevents disasters)

A) Match specificity floor

Never allow first-push rules that are too broad.

Examples of safer defaults:

destination prefix must be <= /24 (IPv4) or <= /48 (IPv6) unless emergency override,
include protocol + at least one L4 discriminator (port, flags, fragment, packet length),
disallow "any-any" style catch-all FlowSpec patterns from automation.

B) Action allowlist by trust tier

Customer/partner submitted rules: usually drop / bounded rate-limit only.
Internal controller rules: may use broader actions, but with stronger approval and audit.
Redirect/marking actions: require explicit policy domain and route-target constraints.

C) TTL + auto-expiry

Every rule gets a hard expiry (for example, 10–60 minutes unless renewed by fresh evidence).

If your detector dies, stale mitigation must not live forever.

D) Dry-run / shadow evaluation

Before activating, run candidate rules against sampled flow logs/pcaps to estimate:

expected malicious hit-rate,
expected collateral hit-rate,
overlap with existing controls.

Promote only if collateral estimate is within policy budget.

5) Capacity engineering: rule budgets are part of DDoS defense

RFC 8955 security considerations explicitly call out device/event capacity limits.

Operate with hard budgets:

max active rules per device/class,
max update rate (announcements/withdraws per second),
max concurrent new rules per incident,
pre-reserved ACL/TCAM headroom for emergency controls.

If budget would be exceeded, degrade gracefully:

collapse low-value specific rules into a coarser temporary control,
prioritize top-impact signatures,
shed non-critical actions first.

6) Vendor heterogeneity strategy

Real deployments are not semantically identical across vendors.

Observed ecosystem reality (see APNIC operational report):

uneven support for fragment bits / TCP flags,
different rule composition constraints,
different policing granularity,
different observability quality for dropped traffic.

So, controller must compile by device profile:

capability matrix per platform/OS version,
per-vendor rule simplification/fallback,
pre-deploy compatibility tests for common attack templates.

Do not assume “accepted by BGP” means “enforced as intended”.

7) Observability: metrics that actually matter

Minimum dashboards:

rule lifecycle counts: candidate / active / expiring / failed,
push latency: detect -> compiled -> advertised -> enforced,
router health: CPU, memory, FIB/ACL utilization, control-plane drops,
mitigation effectiveness: pps/bps reduction per rule,
collateral indicators: SYN success, DNS success, HTTP 2xx/4xx/5xx shifts,
churn indicators: announce/withdraw rate, rule flapping,
route-validation failures and policy rejects.

Key alerts:

sudden spike in new rules or withdraws,
active-rule budget > 80%,
enforcement mismatch (controller says active, device says unsupported/partial),
user traffic KPI degradation after rule activation.

8) Safe incident workflow (runbook)

Classify attack mode (volumetric flood vs protocol-specific abuse).
Pick initial control
- RTBH for immediate containment if links saturate,
- direct FlowSpec if signature confidence is already high.
Compile candidate FlowSpec with policy checks.
Shadow evaluate against recent telemetry.
Canary announce to limited edge scope.
Observe for 1–3 minutes (effectiveness + collateral).
Progressive rollout to full scope.
Auto-expire + review after attack decay.
Post-incident cleanup: withdraw stale rules, archive metrics, update templates.

9) Failure lessons to institutionalize

A well-known historical outage showed that a bad FlowSpec-style filter can trigger network-wide router instability when distributed broadly.

Actionable lesson:

include syntax acceptance tests + semantic sanity checks + platform safety checks before global propagation,
prefer staged rollouts over immediate global fanout,
keep emergency kill-switch to withdraw all dynamic mitigation rules quickly.

10) Compact operator checklist

FlowSpec inbound peers are explicitly allowlisted with strict policy
Validation behavior (RFC 8955 baseline + any RFC 9117 relaxations) is documented and audited
Rule compiler enforces specificity floor + action allowlist + TTL
Rule budgets (count/churn/headroom) are configured and alerted
Vendor capability matrix is maintained per platform/version
Canary rollout and global kill-switch are tested periodically
Post-incident review captures collateral metrics, not only dropped bps/pps

References

RFC 8955 — Dissemination of Flow Specification Rules
https://datatracker.ietf.org/doc/html/rfc8955
RFC 8956 — Dissemination of Flow Specification Rules for IPv6
https://datatracker.ietf.org/doc/html/rfc8956
RFC 9117 — Revised Validation Procedure for BGP Flow Specifications
https://datatracker.ietf.org/doc/html/rfc9117
RFC 7999 — BLACKHOLE Community
https://datatracker.ietf.org/doc/html/rfc7999
APNIC Blog (P. Odintsov, 2024) — The ultimate weapon against DDoS — BGP Flowspec
https://blog.apnic.net/2024/09/18/the-ultimate-weapon-against-ddos-bgp-flowspec/
Cloudflare (2013) — Today’s Outage Post Mortem (FlowSpec-related router failure context)
https://blog.cloudflare.com/todays-outage-post-mortem-82515/