Parent-Child Desync Orphan-Child-Order Slippage Playbook

Date: 2026-03-15
Category: research
Focus: Modeling and controlling slippage when parent schedulers lose authoritative state and orphan child orders keep trading.

1) Why this failure mode matters

Execution stacks usually assume one invariant:

parent state == live child-order state

In production, that invariant can fail during short network partitions, gateway restarts, drop-copy lag, or timeout-driven retries:

parent believes child orders are canceled/expired,
one or more child orders are still live at venue/broker,
parent launches replacement flow,
effective participation doubles (or worse),
cleanup happens late and at worse prices.

This creates a slippage branch that often gets mislabeled as "market moved." In reality, it is state-desynchronization slippage.

2) Mechanism map

2.1 Desync entry points

Common triggers:

cancel ACK timeout interpreted as cancel success,
session reconnect with incomplete open-order snapshot,
divergent semantics between OMS view vs broker view,
duplicate ClOrdID mapping after failover,
delayed drop-copy reconciliation.

2.2 Orphan lifecycle

Birth: parent loses linkage to a live child.
Amplification: new child orders launched against already-live residuals.
Collision: unintended self-competition or self-trade-prevention rejects.
Forced unwind: emergency flatten/cleanup in thin liquidity windows.

The expensive part is usually not the orphan itself, but the late discovery + urgent correction loop.

3) Cost decomposition

Let total execution cost be:

[ C_{total} = C_{base} + C_{overparticipation} + C_{collision} + C_{cleanup} ]

Where:

(C_{base}): expected cost without desync,
(C_{overparticipation}): extra impact from unintended volume,
(C_{collision}): self-competition / reject-retry churn cost,
(C_{cleanup}): forced cancel/flatten/re-hedge cost.

Expected-value framing by state:

[ \mathbb{E}[C] = p_S C_S + p_D C_D ]

(p_S): synchronized-state probability,
(p_D): desync probability.

The goal is not only lowering (p_D), but shrinking (C_D) through fast containment.

4) Feature set for modeling

4.1 State-consistency features

open_order_count_oms - open_order_count_broker
unknown_exec_report_rate
cancel_ack_timeout_rate
snapshot_rebuild_age_ms
dropcopy_lag_ms_p95
clordid_reuse_collision_count

4.2 Orphan-risk features

unlinked_live_notional
suspect_child_age_ms
parent_generation_gap (active parent version mismatch)
live_child_without_parent_ratio

4.3 Market interaction features

spread z-score at containment time,
top-of-book depth percentile,
short-horizon volatility,
time-to-close / auction proximity.

Desync is most costly when containment occurs in thin/high-volatility intervals.

5) Operational metrics

5.1 OER — Orphan Exposure Ratio

[ OER = \frac{unlinked_live_notional}{active_parent_notional + \epsilon} ]

Direct measure of hidden live risk.

5.2 RDL — Reconciliation Delay Lag

[ RDL = \text{p95}(t_{live_at_venue} - t_{recognized_by_parent}) ]

How long the parent is blind to true state.

5.3 DPP — Duplicate Participation Pressure

[ DPP = \frac{realized_participation}{target_participation} ]

DPP > 1 indicates overparticipation from desync/retry overlap.

5.4 OST — Orphan Slippage Tax

[ OST = \frac{C_{overparticipation}+C_{collision}+C_{cleanup}}{executed_notional} ]

Primary KPI for this regime.

6) State machine and controls

`SYNCED`

normal scheduling/routing,
full tactic set enabled.

`DESYNC_SUSPECT`

Triggered when OER or RDL crosses watch threshold.

freeze new aggressive child launches,
enforce low-risk passive mode,
request authoritative broker open-order snapshot.

`ORPHAN_CONTAINMENT`

Triggered on confirmed orphan(s).

block parent generation rollover,
issue idempotent cancel sweep by account/symbol/side scope,
cap participation at strict emergency ceiling,
avoid replace storms until state converges.

`RECONCILED_RECOVERY`

resume child launches gradually,
require OER/RDL clear for hold period (hysteresis),
post-incident attribution + symbol/venue confidence update.

7) Practical modeling approach

Reconstruct truth timeline from OMS, gateway logs, broker reports, drop-copy.
Label incidents (no_desync, suspect, confirmed_orphan, cleanup).
Estimate hazard of orphan creation after timeout/reconnect/failover events.
Simulate policies:
- immediate retry,
- retry with authoritative snapshot gate,
- containment-first then relaunch.
Evaluate tail outcomes (q95 OST, max DPP, incident duration).

8) 30-day rollout plan

Week 1 — Data contract + identifiers

enforce immutable parent_id, child_id, generation_id,
guarantee idempotency keys for cancel/retry flows,
unify OMS↔broker status code mapping.

Week 2 — Shadow detection

compute OER/RDL/DPP/OST without behavior change,
detect venue/broker-specific desync hotspots,
validate incident reconstruction quality.

Week 3 — Containment policy pilot

activate DESYNC_SUSPECT gates on small traffic slice,
require snapshot confirmation before relaunch,
compare OST tails vs control.

Week 4 — Scale + runbooks

roll out containment globally with per-venue thresholds,
add pager alerts for OER and RDL spikes,
codify kill-switch + recovery runbook.

9) Common anti-patterns

Assuming cancel timeout == cancel success.
Relaunching children before authoritative open-order sync.
Measuring only average slippage (tail desync episodes hidden).
Reusing order IDs across failover boundaries.
Treating drop-copy as optional for intraday truth.

10) Bottom line

When parent and live child state diverge, your execution stack can trade more than intended without noticing immediately.

That hidden overparticipation is preventable slippage. Model desync risk explicitly, monitor orphan exposure in real time, and enforce containment-first recovery. If you can reduce reconciliation lag and incident tail size, you cut both impact cost and operational surprises.