Parent-Child Desync Orphan-Child-Order Slippage Playbook
Date: 2026-03-15
Category: research
Focus: Modeling and controlling slippage when parent schedulers lose authoritative state and orphan child orders keep trading.
1) Why this failure mode matters
Execution stacks usually assume one invariant:
parent state == live child-order state
In production, that invariant can fail during short network partitions, gateway restarts, drop-copy lag, or timeout-driven retries:
- parent believes child orders are canceled/expired,
- one or more child orders are still live at venue/broker,
- parent launches replacement flow,
- effective participation doubles (or worse),
- cleanup happens late and at worse prices.
This creates a slippage branch that often gets mislabeled as "market moved." In reality, it is state-desynchronization slippage.
2) Mechanism map
2.1 Desync entry points
Common triggers:
- cancel ACK timeout interpreted as cancel success,
- session reconnect with incomplete open-order snapshot,
- divergent semantics between OMS view vs broker view,
- duplicate ClOrdID mapping after failover,
- delayed drop-copy reconciliation.
2.2 Orphan lifecycle
- Birth: parent loses linkage to a live child.
- Amplification: new child orders launched against already-live residuals.
- Collision: unintended self-competition or self-trade-prevention rejects.
- Forced unwind: emergency flatten/cleanup in thin liquidity windows.
The expensive part is usually not the orphan itself, but the late discovery + urgent correction loop.
3) Cost decomposition
Let total execution cost be:
[ C_{total} = C_{base} + C_{overparticipation} + C_{collision} + C_{cleanup} ]
Where:
- (C_{base}): expected cost without desync,
- (C_{overparticipation}): extra impact from unintended volume,
- (C_{collision}): self-competition / reject-retry churn cost,
- (C_{cleanup}): forced cancel/flatten/re-hedge cost.
Expected-value framing by state:
[ \mathbb{E}[C] = p_S C_S + p_D C_D ]
- (p_S): synchronized-state probability,
- (p_D): desync probability.
The goal is not only lowering (p_D), but shrinking (C_D) through fast containment.
4) Feature set for modeling
4.1 State-consistency features
open_order_count_oms - open_order_count_brokerunknown_exec_report_ratecancel_ack_timeout_ratesnapshot_rebuild_age_msdropcopy_lag_ms_p95clordid_reuse_collision_count
4.2 Orphan-risk features
unlinked_live_notionalsuspect_child_age_msparent_generation_gap(active parent version mismatch)live_child_without_parent_ratio
4.3 Market interaction features
- spread z-score at containment time,
- top-of-book depth percentile,
- short-horizon volatility,
- time-to-close / auction proximity.
Desync is most costly when containment occurs in thin/high-volatility intervals.
5) Operational metrics
5.1 OER — Orphan Exposure Ratio
[ OER = \frac{unlinked_live_notional}{active_parent_notional + \epsilon} ]
Direct measure of hidden live risk.
5.2 RDL — Reconciliation Delay Lag
[ RDL = \text{p95}(t_{live_at_venue} - t_{recognized_by_parent}) ]
How long the parent is blind to true state.
5.3 DPP — Duplicate Participation Pressure
[ DPP = \frac{realized_participation}{target_participation} ]
DPP > 1 indicates overparticipation from desync/retry overlap.
5.4 OST — Orphan Slippage Tax
[ OST = \frac{C_{overparticipation}+C_{collision}+C_{cleanup}}{executed_notional} ]
Primary KPI for this regime.
6) State machine and controls
SYNCED
- normal scheduling/routing,
- full tactic set enabled.
DESYNC_SUSPECT
Triggered when OER or RDL crosses watch threshold.
- freeze new aggressive child launches,
- enforce low-risk passive mode,
- request authoritative broker open-order snapshot.
ORPHAN_CONTAINMENT
Triggered on confirmed orphan(s).
- block parent generation rollover,
- issue idempotent cancel sweep by account/symbol/side scope,
- cap participation at strict emergency ceiling,
- avoid replace storms until state converges.
RECONCILED_RECOVERY
- resume child launches gradually,
- require OER/RDL clear for hold period (hysteresis),
- post-incident attribution + symbol/venue confidence update.
7) Practical modeling approach
- Reconstruct truth timeline from OMS, gateway logs, broker reports, drop-copy.
- Label incidents (
no_desync,suspect,confirmed_orphan,cleanup). - Estimate hazard of orphan creation after timeout/reconnect/failover events.
- Simulate policies:
- immediate retry,
- retry with authoritative snapshot gate,
- containment-first then relaunch.
- Evaluate tail outcomes (q95 OST, max DPP, incident duration).
8) 30-day rollout plan
Week 1 — Data contract + identifiers
- enforce immutable
parent_id,child_id,generation_id, - guarantee idempotency keys for cancel/retry flows,
- unify OMS↔broker status code mapping.
Week 2 — Shadow detection
- compute OER/RDL/DPP/OST without behavior change,
- detect venue/broker-specific desync hotspots,
- validate incident reconstruction quality.
Week 3 — Containment policy pilot
- activate
DESYNC_SUSPECTgates on small traffic slice, - require snapshot confirmation before relaunch,
- compare OST tails vs control.
Week 4 — Scale + runbooks
- roll out containment globally with per-venue thresholds,
- add pager alerts for OER and RDL spikes,
- codify kill-switch + recovery runbook.
9) Common anti-patterns
- Assuming cancel timeout == cancel success.
- Relaunching children before authoritative open-order sync.
- Measuring only average slippage (tail desync episodes hidden).
- Reusing order IDs across failover boundaries.
- Treating drop-copy as optional for intraday truth.
10) Bottom line
When parent and live child state diverge, your execution stack can trade more than intended without noticing immediately.
That hidden overparticipation is preventable slippage. Model desync risk explicitly, monitor orphan exposure in real time, and enforce containment-first recovery. If you can reduce reconciliation lag and incident tail size, you cut both impact cost and operational surprises.