Inference Timeout + Fallback Policy Drift Slippage Playbook

2026-03-24 · finance

Inference Timeout + Fallback Policy Drift Slippage Playbook

Date: 2026-03-24
Category: research
Scope: How model-serving timeouts and fallback routing policies create hidden execution-cost drift in live slippage stacks

Why this matters

Many modern execution engines now place an ML policy in the critical path (child-order aggression, venue ranking, or cancel/replace timing).

When inference latency spikes or dependencies fail, systems typically fail over to a fallback policy (heuristic router, stale model snapshot, or “safe default” schedule).

That fallback is usually tested for uptime, but not always for cost-shape equivalence.

Result in production:

This is a classic “operationally healthy, economically degraded” regime.


Failure mechanism (operator timeline)

  1. Primary policy is used under normal inference latency.
  2. Latency variance rises (GC, network, feature-store lag, hot shard, GPU queueing).
  3. Timeout threshold is crossed for a growing share of requests.
  4. Fallback policy starts serving decisions.
  5. Fallback has different aggression/venue/cancel logic than primary.
  6. During volatile windows, this policy gap amplifies cost and markout.
  7. TCA attributes damage to “market noise” unless fallback telemetry is joined.

Key point: this is not only model quality decay. It is control-path regime switching.


Extend slippage decomposition with fallback-regime term

[ IS = IS_{market} + IS_{impact} + IS_{timing} + IS_{fees} + \underbrace{IS_{fallback}}_{\text{timeout/failover policy tax}} ]

Practical approximation:

[ IS_{fallback,t} \approx a\cdot FAR_t + b\cdot TSR_t + c\cdot PMG_t + d\cdot SAD_t + e\cdot URX_t ]

Where:


Production metrics to add

1) Fallback Activation Ratio (FAR)

[ FAR = \frac{#,\text{orders served by fallback}}{#,\text{total eligible orders}} ]

Track by symbol-liquidity bucket, session phase, and urgency tier.

2) Timeout Switch Rate (TSR)

[ TSR = \frac{#,\text{primary→fallback switches}}{\text{minute}} ]

High TSR usually indicates instability rather than a single outage.

3) Policy Mismatch Gap (PMG)

Compute comparable action vectors from primary and fallback:

[ PMG = E\left[\lVert a_{primary} - a_{fallback} \rVert_2\right] ]

4) Shadow-Action Divergence (SAD)

Run primary in shadow during fallback episodes and measure divergence frequency:

[ SAD = P\left(\text{primary action class} \neq \text{fallback action class}\right) ]

5) Fallback Cost Uplift (FCU)

Matched-cohort uplift vs primary-serving windows:

[ FCU_{q95} = q95(IS\mid fallback) - q95(IS\mid primary) ]

Also compute mean and q99.

6) Urgency-Regime Interaction (URX)

[ URX = \Delta IS\big|{high_urgency,fallback} - \Delta IS\big|{low_urgency,fallback} ]

Fallback damage is often nonlinear in urgency.


Queueing lens (why this explodes abruptly)

Even modest load increases can push inference into a nonlinear waiting regime.

So a small latency shift can produce a large FAR/TSR jump, causing sudden slippage regime breaks.


Modeling architecture

Stage 1: serving-regime classifier

Inputs (rolling windows):

Output:

Stage 2: conditional slippage model

Estimate slippage conditional on regime and urgency:

[ \Delta IS \sim \beta_1,urgency + \beta_2,p_{fallback} + \beta_3,(urgency\times p_{fallback}) + \beta_4,PMG ]

This explicitly separates market-driven cost from fallback-driven cost.


Controller state machine

GREEN — PRIMARY_STABLE

YELLOW — LATENCY_PRESSURE

ORANGE — FALLBACK_ACTIVE

RED — FLAPPING_OR_DEGRADED

Use hysteresis + minimum dwell time to avoid control oscillation.


Engineering mitigations (high ROI first)

  1. Design fallback for economic parity, not just availability
    Evaluate fallback against slippage KPIs, not only uptime/SLA.

  2. Shadow primary during fallback windows
    Persist counterfactual actions to quantify PMG/SAD continuously.

  3. Timeout budget by session regime
    Tight open/close budgets, looser midday budgets when safe.

  4. Fallback-specific risk envelope
    Apply stricter participation/aggression caps when fallback is active.

  5. Anti-flap guardrails
    Avoid rapid primary↔fallback toggles; toggling itself raises cost variance.

  6. Join serving telemetry into TCA
    No fallback fields in TCA = chronic misattribution.


Validation protocol

  1. Label each child-order decision with serving regime (primary/fallback/flapping).
  2. Build matched cohorts by symbol, spread, volatility, urgency, and participation.
  3. Compare mean, q95, q99 slippage and markout by regime.
  4. Run canary mitigations (timeout tuning, anti-flap hysteresis, fallback caps).
  5. Promote only if tail improvement persists without completion failure.

Practical observability checklist

Success criterion: reduced q95/q99 slippage during degraded serving windows, not just fewer 5xx or timeouts.


Pseudocode sketch

serving = get_serving_metrics()  # latency tails, FAR, TSR, queue depth
p_fallback = regime_model.predict_proba(serving)["fallback"]

if p_fallback < 0.1:
    state = "GREEN"
    params = primary_policy()
elif p_fallback < 0.3:
    state = "YELLOW"
    params = guarded_primary_policy()
elif p_fallback < 0.7:
    state = "ORANGE"
    params = fallback_capped_policy()
else:
    state = "RED"
    params = failsafe_policy()

action = execute(params)
log(state=state, p_fallback=p_fallback, action=action)

Bottom line

Inference fallback is often treated as a reliability concern, but in execution systems it is also a microstructure-cost concern.

If your slippage model ignores serving-regime transitions, you will under-measure tail risk exactly when markets are least forgiving.


References