nf_conntrack Table-Pressure & Implicit-Drop Slippage Playbook

2026-03-20 ยท finance

nf_conntrack Table-Pressure & Implicit-Drop Slippage Playbook

Why this matters

In low-latency execution, teams usually track exchange throttles, app latency, and packet loss.

But many stacks still carry an invisible network-layer tax: stateful conntrack pressure on paths that should be deterministic.

When nf_conntrack gets stressed (or fills), new-flow packets can be dropped before normal packet handling, producing:

This often gets misdiagnosed as "random venue instability" while the root cause is local path state pressure.


Failure mechanism (kernel path -> execution)

  1. Stateful tracking is enabled on strategy/gateway traffic.
  2. Flow count and/or hash-chain pressure rises (nf_conntrack_count approaches nf_conntrack_max).
  3. New-flow packets are dropped or delayed under pressure episodes.
  4. App-side order-state timeline becomes causally distorted (late/clustered acks, stale pending state).
  5. Router overreacts with urgency bursts and queue-priority-destructive retries.

Result: p95/p99 slippage rises even when median latency looks mostly fine.


Slippage decomposition with conntrack term

For parent order (i):

[ IS_i = C_{impact} + C_{timing} + C_{routing} + C_{ct} ]

Where:

[ C_{ct} = C_{drop-gap} + C_{state-skew} + C_{burst-recovery} ]


Operational metrics (new)

1) CTR โ€” Conntrack Utilization Ratio

[ CTR = \frac{\texttt{nf_conntrack_count}}{\texttt{nf_conntrack_max}} ] Primary stress gauge.

2) HCP95 โ€” Hash-Chain Pressure p95

p95 lookup-chain length proxy (or equivalent per-host lookup cost telemetry).

3) NFD โ€” New-Flow Drop Rate

Rate of packets/flows dropped during high-CTR windows (kernel log + firewall counters).

4) OSD โ€” Order-State Drift

Difference between internal order-state age and reconstructed ground-truth progression age.

5) CDT โ€” Conntrack Distortion Tax

Incremental IS during high-CTR/NFD windows vs matched low-pressure windows.


What to log in production

Kernel/netfilter layer

Transport/order-state layer

Execution outcomes


Identification strategy (causal, not anecdotal)

  1. Match windows by spread, realized vol, participation, and time-of-day.
  2. Split windows into low vs high CTR/NFD regimes.
  3. Estimate incremental tail IS with host fixed effects.
  4. Validate via canary controls:
    • conntrack bypass for trusted deterministic paths (where policy allows), or
    • increased conntrack capacity + timeout hygiene.
  5. Confirm that CDT drops while market covariates remain matched.

If yes, the uplift is infra-causal (conntrack-path), not merely market-regime noise.


Regime state machine

CT_HEALTHY

CT_TIGHTENING

CT_PRESSURED

CT_SAFE_CONTAIN

Use hysteresis + minimum dwell times to avoid flapping.


Control ladder

  1. Scope conntrack intentionally
    • avoid tracking traffic classes that do not require stateful firewall semantics (subject to security policy)
  2. Right-size buckets/max together
    • increasing nf_conntrack_max without healthy hash geometry can move pressure, not remove it
  3. Timeout hygiene by protocol profile
    • reduce stale-entry residency for your actual traffic mix
  4. Isolate noisy traffic domains
    • prevent unrelated connection churn from consuming execution-path headroom
  5. Model pressure features directly
    • include CTR/NFD/OSD signals in slippage mean + tail heads
  6. Fail-safe execution behavior
    • when CT_PRESSURED, enforce anti-burst guards and tighten retry budgets

Failure drills (must run)

  1. Table-pressure replay drill
    • synthetic flow surge to verify detection + state transitions
  2. Bypass canary drill
    • policy-compliant conntrack bypass test on a small host pool
  3. Capacity-step drill
    • controlled bucket/max adjustments and CDT response measurement
  4. Rollback drill
    • deterministic revert path for netfilter parameter changes

Common mistakes


Bottom line

Conntrack pressure can silently distort order-state time and create avoidable slippage tails.

If execution infrastructure is stateful by default, conntrack health must be a first-class feature in both observability and slippage control.


References