FIX Session Reliability Playbook

2026-03-06 · finance

FIX Session Reliability Playbook

Date: 2026-03-06
Category: knowledge (trading infrastructure / operations)

Why this playbook exists

Many desks obsess over alpha and execution models but lose real money from boring session-layer failures:

When FIX session hygiene is weak, you get delayed fills, rejected cancels, phantom exposure, and painful post-trade reconciliation. This is an operations-first guide to keep FIX stable under stress.


Core objective

Treat FIX session correctness as a risk control system, not a networking detail.

A healthy session must guarantee:

  1. Liveness (we detect broken links quickly)
  2. Ordering (sequence integrity)
  3. Recoverability (deterministic replay after gaps)
  4. Idempotent state convergence (same truth on both sides)

Failure taxonomy (what actually breaks)

1) Half-open TCP connection

One side thinks the session is alive; packets stop flowing due to NAT/firewall path issues.

Symptoms: no execution reports, no logout, no hard socket error.

2) Sequence drift

MsgSeqNum diverges after process restart, manual reset mismatch, or missed resend window.

Symptoms: repeating ResendRequest(2), excessive Reject(3), session flapping.

3) Resend storm

Large gap triggers replay flood. Live traffic and replay mix poorly.

Symptoms: latency spikes, parser backlog, stale order-state decisions.

4) Duplicate business events

Replay/dedup bugs cause duplicate ExecType events to be applied twice.

Symptoms: position mismatch, double PnL attribution, false alerts.

5) Daily boundary mistakes

Session reset policy differs by venue/broker; one side resets sequence, the other does not.

Symptoms: immediate morning rejects, repeated logons/logouts.


Minimum operational contract

Track these per session in metrics/logs:

If these are missing, incident response becomes guesswork.


Session health SLOs (practical defaults)

For liquid-market live trading sessions:

Tune per venue, but keep explicit targets.


Sequence and replay safety rules

Rule 1: Never “quick-fix” sequence by ad-hoc manual edits during market hours

Manual number jumps can unstick one session and corrupt downstream state.

Rule 2: Distinguish transport recovery from business-state recovery

Receiving replayed messages is not enough; application state must converge (open qty, leaves qty, cumulative qty, status timeline).

Rule 3: Deduplicate by stable business identifiers

For execution events, dedupe primarily by (session, ExecID) with guard rails for broker-specific behavior.

Rule 4: Keep resend windows bounded

For very large gaps, use chunked replay windows to avoid parser/CPU spikes and event-loop starvation.

Rule 5: Persist sequence/state atomically

Crash-safe persistence should ensure sequence checkpoint and applied business event index move together.


Reconnect state machine (reference)

  1. DISCONNECTED
    • socket down or liveness timeout
  2. CONNECTING
    • TCP/TLS setup
  3. LOGON_NEGOTIATION
    • Logon(A), heartbeat interval agreement, reset flags
  4. RECOVERY
    • gap detection + ResendRequest handling + dedupe
  5. LIVE
    • normal flow
  6. DEGRADED
    • replay backlog or elevated processing lag (trade with tighter safety caps)
  7. SAFE
    • cannot guarantee correctness; throttle/pause strategy participation

Use hysteresis: harder to leave DEGRADED/SAFE than to enter.


Incident runbook (fast path)

A) No fills for suspiciously long period

  1. Check heartbeat/test-request RTT and last inbound timestamp.
  2. If liveness uncertain: force reconnect (graceful logout if possible).
  3. Enter RECOVERY; block new parent slices until sequence aligned.
  4. Rebuild per-order state from replay + drop copy.
  5. Release throttled flow only after state checks pass.

B) Continuous ResendRequest loop

  1. Inspect reset policy mismatch (ResetSeqNumFlag, daily reset schedule).
  2. Verify persisted seq checkpoint vs counterparty expectation.
  3. If mismatch unresolved in N attempts, escalate to SAFE mode and broker desk.
  4. Avoid repeated blind reconnect loops; they amplify queue congestion.

C) Duplicate execution suspicion

  1. Freeze automated position-dependent tactic switching.
  2. Run duplicate detector on (ExecID, side, qty, price, transactTime).
  3. Reconcile with drop copy / back-office feed.
  4. Correct position snapshot before re-enabling normal control loop.

Pre-open checklist (daily)

If any fail, run reduced-risk mode at open.


Good alerting (what to page on)

Page-worthy:

Not page-worthy alone:


Chaos tests worth automating

  1. Drop inbound heartbeats for N intervals
  2. Inject sequence gap bursts (small + large)
  3. Force process crash between seq checkpoint and business apply
  4. Reorder delayed replay chunks
  5. Duplicate a subset of execution reports
  6. Simulate daily reset mismatch across counterparties

Promote code only if state converges correctly under all six.


Desk-level takeaway

A fast strategy on a fragile FIX session is fake speed.

Most “mysterious execution incidents” are session reliability incidents in disguise. If liveness, sequence integrity, replay discipline, and dedupe are explicit SLO-governed controls, you prevent avoidable slippage and far nastier position errors.