FIX Session Reliability Playbook

Date: 2026-03-06
Category: knowledge (trading infrastructure / operations)

Why this playbook exists

Many desks obsess over alpha and execution models but lose real money from boring session-layer failures:

silent disconnects,
sequence gaps,
bad resend handling,
unclean daily resets,
duplicate order state after reconnect.

When FIX session hygiene is weak, you get delayed fills, rejected cancels, phantom exposure, and painful post-trade reconciliation. This is an operations-first guide to keep FIX stable under stress.

Core objective

Treat FIX session correctness as a risk control system, not a networking detail.

A healthy session must guarantee:

Liveness (we detect broken links quickly)
Ordering (sequence integrity)
Recoverability (deterministic replay after gaps)
Idempotent state convergence (same truth on both sides)

Failure taxonomy (what actually breaks)

1) Half-open TCP connection

One side thinks the session is alive; packets stop flowing due to NAT/firewall path issues.

Symptoms: no execution reports, no logout, no hard socket error.

2) Sequence drift

MsgSeqNum diverges after process restart, manual reset mismatch, or missed resend window.

Symptoms: repeating ResendRequest(2), excessive Reject(3), session flapping.

3) Resend storm

Large gap triggers replay flood. Live traffic and replay mix poorly.

Symptoms: latency spikes, parser backlog, stale order-state decisions.

4) Duplicate business events

Replay/dedup bugs cause duplicate ExecType events to be applied twice.

Symptoms: position mismatch, double PnL attribution, false alerts.

5) Daily boundary mistakes

Session reset policy differs by venue/broker; one side resets sequence, the other does not.

Symptoms: immediate morning rejects, repeated logons/logouts.

Minimum operational contract

Track these per session in metrics/logs:

session_id (SenderCompID/TargetCompID[/SubIDs])
connection state (connected, logon_sent, logon_acked, established)
inbound/outbound current sequence number
heartbeat/test-request round-trip times
resend volume (messages_replayed, gap_count)
duplicate-drop counts by business key (ExecID, ClOrdID, OrderID)
message processing lag (ingest timestamp vs application timestamp)
logon/logout reason codes and text

If these are missing, incident response becomes guesswork.

Session health SLOs (practical defaults)

For liquid-market live trading sessions:

Heartbeat miss detection: <= 1.5 × HeartBtInt
TestRequest response timeout: <= 2 × HeartBtInt
Reconnect completion p95: < 5s
Replay completion p95 (normal gap): < 2s
Duplicate application rate: 0 (hard requirement)
Morning session ready by T-5 minutes before strategy start

Tune per venue, but keep explicit targets.

Sequence and replay safety rules

Rule 1: Never “quick-fix” sequence by ad-hoc manual edits during market hours

Manual number jumps can unstick one session and corrupt downstream state.

Rule 2: Distinguish transport recovery from business-state recovery

Receiving replayed messages is not enough; application state must converge (open qty, leaves qty, cumulative qty, status timeline).

Rule 3: Deduplicate by stable business identifiers

For execution events, dedupe primarily by (session, ExecID) with guard rails for broker-specific behavior.

Rule 4: Keep resend windows bounded

For very large gaps, use chunked replay windows to avoid parser/CPU spikes and event-loop starvation.

Rule 5: Persist sequence/state atomically

Crash-safe persistence should ensure sequence checkpoint and applied business event index move together.

Reconnect state machine (reference)

DISCONNECTED
- socket down or liveness timeout
CONNECTING
- TCP/TLS setup
LOGON_NEGOTIATION
- Logon(A), heartbeat interval agreement, reset flags
RECOVERY
- gap detection + ResendRequest handling + dedupe
LIVE
- normal flow
DEGRADED
- replay backlog or elevated processing lag (trade with tighter safety caps)
SAFE
- cannot guarantee correctness; throttle/pause strategy participation

Use hysteresis: harder to leave DEGRADED/SAFE than to enter.

Incident runbook (fast path)

A) No fills for suspiciously long period

Check heartbeat/test-request RTT and last inbound timestamp.
If liveness uncertain: force reconnect (graceful logout if possible).
Enter RECOVERY; block new parent slices until sequence aligned.
Rebuild per-order state from replay + drop copy.
Release throttled flow only after state checks pass.

B) Continuous ResendRequest loop

Inspect reset policy mismatch (ResetSeqNumFlag, daily reset schedule).
Verify persisted seq checkpoint vs counterparty expectation.
If mismatch unresolved in N attempts, escalate to SAFE mode and broker desk.
Avoid repeated blind reconnect loops; they amplify queue congestion.

C) Duplicate execution suspicion

Freeze automated position-dependent tactic switching.
Run duplicate detector on (ExecID, side, qty, price, transactTime).
Reconcile with drop copy / back-office feed.
Correct position snapshot before re-enabling normal control loop.

Pre-open checklist (daily)

Session reset policy verified per venue/broker
Persistent seq store integrity check passed
Last session clean logout status verified
Synthetic heartbeat/test-request check passed
Replay drill (small controlled gap) passes in staging or dry-mode
Alert channels and on-call escalation path confirmed

If any fail, run reduced-risk mode at open.

Good alerting (what to page on)

Page-worthy:

liveness failure > timeout budget
reconnect attempts exceed threshold
resend gap > configured max window
replay lag breaching execution decision budget
duplicate business event detected
sequence reset mismatch at session start

Not page-worthy alone:

single transient reconnect with full automatic recovery under SLO

Chaos tests worth automating

Drop inbound heartbeats for N intervals
Inject sequence gap bursts (small + large)
Force process crash between seq checkpoint and business apply
Reorder delayed replay chunks
Duplicate a subset of execution reports
Simulate daily reset mismatch across counterparties

Promote code only if state converges correctly under all six.

Desk-level takeaway

A fast strategy on a fragile FIX session is fake speed.

Most “mysterious execution incidents” are session reliability incidents in disguise. If liveness, sequence integrity, replay discipline, and dedupe are explicit SLO-governed controls, you prevent avoidable slippage and far nastier position errors.