FIX Session Reliability Playbook
Date: 2026-03-06
Category: knowledge (trading infrastructure / operations)
Why this playbook exists
Many desks obsess over alpha and execution models but lose real money from boring session-layer failures:
- silent disconnects,
- sequence gaps,
- bad resend handling,
- unclean daily resets,
- duplicate order state after reconnect.
When FIX session hygiene is weak, you get delayed fills, rejected cancels, phantom exposure, and painful post-trade reconciliation. This is an operations-first guide to keep FIX stable under stress.
Core objective
Treat FIX session correctness as a risk control system, not a networking detail.
A healthy session must guarantee:
- Liveness (we detect broken links quickly)
- Ordering (sequence integrity)
- Recoverability (deterministic replay after gaps)
- Idempotent state convergence (same truth on both sides)
Failure taxonomy (what actually breaks)
1) Half-open TCP connection
One side thinks the session is alive; packets stop flowing due to NAT/firewall path issues.
Symptoms: no execution reports, no logout, no hard socket error.
2) Sequence drift
MsgSeqNum diverges after process restart, manual reset mismatch, or missed resend window.
Symptoms: repeating ResendRequest(2), excessive Reject(3), session flapping.
3) Resend storm
Large gap triggers replay flood. Live traffic and replay mix poorly.
Symptoms: latency spikes, parser backlog, stale order-state decisions.
4) Duplicate business events
Replay/dedup bugs cause duplicate ExecType events to be applied twice.
Symptoms: position mismatch, double PnL attribution, false alerts.
5) Daily boundary mistakes
Session reset policy differs by venue/broker; one side resets sequence, the other does not.
Symptoms: immediate morning rejects, repeated logons/logouts.
Minimum operational contract
Track these per session in metrics/logs:
session_id(SenderCompID/TargetCompID[/SubIDs])- connection state (connected, logon_sent, logon_acked, established)
- inbound/outbound current sequence number
- heartbeat/test-request round-trip times
- resend volume (
messages_replayed,gap_count) - duplicate-drop counts by business key (
ExecID,ClOrdID,OrderID) - message processing lag (ingest timestamp vs application timestamp)
- logon/logout reason codes and text
If these are missing, incident response becomes guesswork.
Session health SLOs (practical defaults)
For liquid-market live trading sessions:
- Heartbeat miss detection: <= 1.5 × HeartBtInt
- TestRequest response timeout: <= 2 × HeartBtInt
- Reconnect completion p95: < 5s
- Replay completion p95 (normal gap): < 2s
- Duplicate application rate: 0 (hard requirement)
- Morning session ready by T-5 minutes before strategy start
Tune per venue, but keep explicit targets.
Sequence and replay safety rules
Rule 1: Never “quick-fix” sequence by ad-hoc manual edits during market hours
Manual number jumps can unstick one session and corrupt downstream state.
Rule 2: Distinguish transport recovery from business-state recovery
Receiving replayed messages is not enough; application state must converge (open qty, leaves qty, cumulative qty, status timeline).
Rule 3: Deduplicate by stable business identifiers
For execution events, dedupe primarily by (session, ExecID) with guard rails for broker-specific behavior.
Rule 4: Keep resend windows bounded
For very large gaps, use chunked replay windows to avoid parser/CPU spikes and event-loop starvation.
Rule 5: Persist sequence/state atomically
Crash-safe persistence should ensure sequence checkpoint and applied business event index move together.
Reconnect state machine (reference)
- DISCONNECTED
- socket down or liveness timeout
- CONNECTING
- TCP/TLS setup
- LOGON_NEGOTIATION
- Logon(A), heartbeat interval agreement, reset flags
- RECOVERY
- gap detection + ResendRequest handling + dedupe
- LIVE
- normal flow
- DEGRADED
- replay backlog or elevated processing lag (trade with tighter safety caps)
- SAFE
- cannot guarantee correctness; throttle/pause strategy participation
Use hysteresis: harder to leave DEGRADED/SAFE than to enter.
Incident runbook (fast path)
A) No fills for suspiciously long period
- Check heartbeat/test-request RTT and last inbound timestamp.
- If liveness uncertain: force reconnect (graceful logout if possible).
- Enter RECOVERY; block new parent slices until sequence aligned.
- Rebuild per-order state from replay + drop copy.
- Release throttled flow only after state checks pass.
B) Continuous ResendRequest loop
- Inspect reset policy mismatch (
ResetSeqNumFlag, daily reset schedule). - Verify persisted seq checkpoint vs counterparty expectation.
- If mismatch unresolved in N attempts, escalate to SAFE mode and broker desk.
- Avoid repeated blind reconnect loops; they amplify queue congestion.
C) Duplicate execution suspicion
- Freeze automated position-dependent tactic switching.
- Run duplicate detector on
(ExecID, side, qty, price, transactTime). - Reconcile with drop copy / back-office feed.
- Correct position snapshot before re-enabling normal control loop.
Pre-open checklist (daily)
- Session reset policy verified per venue/broker
- Persistent seq store integrity check passed
- Last session clean logout status verified
- Synthetic heartbeat/test-request check passed
- Replay drill (small controlled gap) passes in staging or dry-mode
- Alert channels and on-call escalation path confirmed
If any fail, run reduced-risk mode at open.
Good alerting (what to page on)
Page-worthy:
- liveness failure > timeout budget
- reconnect attempts exceed threshold
- resend gap > configured max window
- replay lag breaching execution decision budget
- duplicate business event detected
- sequence reset mismatch at session start
Not page-worthy alone:
- single transient reconnect with full automatic recovery under SLO
Chaos tests worth automating
- Drop inbound heartbeats for N intervals
- Inject sequence gap bursts (small + large)
- Force process crash between seq checkpoint and business apply
- Reorder delayed replay chunks
- Duplicate a subset of execution reports
- Simulate daily reset mismatch across counterparties
Promote code only if state converges correctly under all six.
Desk-level takeaway
A fast strategy on a fragile FIX session is fake speed.
Most “mysterious execution incidents” are session reliability incidents in disguise. If liveness, sequence integrity, replay discipline, and dedupe are explicit SLO-governed controls, you prevent avoidable slippage and far nastier position errors.