Inference Timeout + Fallback Policy Drift Slippage Playbook

Date: 2026-03-24
Category: research
Scope: How model-serving timeouts and fallback routing policies create hidden execution-cost drift in live slippage stacks

Why this matters

Many modern execution engines now place an ML policy in the critical path (child-order aggression, venue ranking, or cancel/replace timing).

When inference latency spikes or dependencies fail, systems typically fail over to a fallback policy (heuristic router, stale model snapshot, or “safe default” schedule).

That fallback is usually tested for uptime, but not always for cost-shape equivalence.

Result in production:

completion reliability may stay acceptable,
headline uptime may look healthy,
but implementation shortfall tails silently widen.

This is a classic “operationally healthy, economically degraded” regime.

Failure mechanism (operator timeline)

Primary policy is used under normal inference latency.
Latency variance rises (GC, network, feature-store lag, hot shard, GPU queueing).
Timeout threshold is crossed for a growing share of requests.
Fallback policy starts serving decisions.
Fallback has different aggression/venue/cancel logic than primary.
During volatile windows, this policy gap amplifies cost and markout.
TCA attributes damage to “market noise” unless fallback telemetry is joined.

Key point: this is not only model quality decay. It is control-path regime switching.

Extend slippage decomposition with fallback-regime term

[ IS = IS_{market} + IS_{impact} + IS_{timing} + IS_{fees} + \underbrace{IS_{fallback}}_{\text{timeout/failover policy tax}} ]

Practical approximation:

[ IS_{fallback,t} \approx a\cdot FAR_t + b\cdot TSR_t + c\cdot PMG_t + d\cdot SAD_t + e\cdot URX_t ]

Where:

(FAR): fallback activation ratio,
(TSR): timeout switch rate,
(PMG): policy mismatch gap (primary vs fallback action delta),
(SAD): shadow-action divergence,
(URX): urgency × fallback interaction.

Production metrics to add

1) Fallback Activation Ratio (FAR)

[ FAR = \frac{#,\text{orders served by fallback}}{#,\text{total eligible orders}} ]

Track by symbol-liquidity bucket, session phase, and urgency tier.

2) Timeout Switch Rate (TSR)

[ TSR = \frac{#,\text{primary→fallback switches}}{\text{minute}} ]

High TSR usually indicates instability rather than a single outage.

3) Policy Mismatch Gap (PMG)

Compute comparable action vectors from primary and fallback:

target participation,
urgency score,
venue weights,
cancel/replace threshold.

[ PMG = E\left[\lVert a_{primary} - a_{fallback} \rVert_2\right] ]

4) Shadow-Action Divergence (SAD)

Run primary in shadow during fallback episodes and measure divergence frequency:

[ SAD = P\left(\text{primary action class} \neq \text{fallback action class}\right) ]

5) Fallback Cost Uplift (FCU)

Matched-cohort uplift vs primary-serving windows:

[ FCU_{q95} = q95(IS\mid fallback) - q95(IS\mid primary) ]

Also compute mean and q99.

6) Urgency-Regime Interaction (URX)

[ URX = \Delta IS\big|{high_urgency,fallback} - \Delta IS\big|{low_urgency,fallback} ]

Fallback damage is often nonlinear in urgency.

Queueing lens (why this explodes abruptly)

Even modest load increases can push inference into a nonlinear waiting regime.

Little’s Law: (L=\lambda W) ties in-flight queue depth to latency.
Under high utilization, tail wait-time rises faster than mean.
Timeout probability can jump sharply once p95/p99 nears timeout budget.

So a small latency shift can produce a large FAR/TSR jump, causing sudden slippage regime breaks.

Modeling architecture

Stage 1: serving-regime classifier

Inputs (rolling windows):

p50/p95/p99 inference latency,
timeout count,
FAR/TSR,
queue depth/concurrency,
dependency health (feature store / RPC / cache).

Output:

(P(\text{PRIMARY}))
(P(\text{FALLBACK}))
(P(\text{FLAPPING}))

Stage 2: conditional slippage model

Estimate slippage conditional on regime and urgency:

[ \Delta IS \sim \beta_1,urgency + \beta_2,p_{fallback} + \beta_3,(urgency\times p_{fallback}) + \beta_4,PMG ]

This explicitly separates market-driven cost from fallback-driven cost.

Controller state machine

GREEN — PRIMARY_STABLE

FAR near zero, stable latency tails.
Normal adaptive execution.

YELLOW — LATENCY_PRESSURE

p95/p99 rising, occasional switches.
Actions:
- preemptive load-shedding for non-critical requests,
- tighten timeout observability,
- increase shadow logging density.

ORANGE — FALLBACK_ACTIVE

FAR/TSR elevated, PMG non-trivial.
Actions:
- cap aggression slope,
- freeze high-variance venue reranking,
- shift to smoother participation schedule.

RED — FLAPPING_OR_DEGRADED

frequent regime toggling or high FCU tail breach.
Actions:
- lock into conservative fail-safe execution profile,
- suspend unstable adaptive knobs,
- page on-call and preserve forensic traces.

Use hysteresis + minimum dwell time to avoid control oscillation.

Engineering mitigations (high ROI first)

Design fallback for economic parity, not just availability
Evaluate fallback against slippage KPIs, not only uptime/SLA.
Shadow primary during fallback windows
Persist counterfactual actions to quantify PMG/SAD continuously.
Timeout budget by session regime
Tight open/close budgets, looser midday budgets when safe.
Fallback-specific risk envelope
Apply stricter participation/aggression caps when fallback is active.
Anti-flap guardrails
Avoid rapid primary↔fallback toggles; toggling itself raises cost variance.
Join serving telemetry into TCA
No fallback fields in TCA = chronic misattribution.

Validation protocol

Label each child-order decision with serving regime (primary/fallback/flapping).
Build matched cohorts by symbol, spread, volatility, urgency, and participation.
Compare mean, q95, q99 slippage and markout by regime.
Run canary mitigations (timeout tuning, anti-flap hysteresis, fallback caps).
Promote only if tail improvement persists without completion failure.

Practical observability checklist

inference latency histogram (p50/p95/p99) by model + symbol bucket
timeout count and timeout reason taxonomy
FAR/TSR by session phase (open/midday/close)
shadow primary outputs during fallback episodes
action deltas (primary vs fallback) for aggression/venue/cancel behavior
matched-cohort TCA with regime labels

Success criterion: reduced q95/q99 slippage during degraded serving windows, not just fewer 5xx or timeouts.

Pseudocode sketch

serving = get_serving_metrics()  # latency tails, FAR, TSR, queue depth
p_fallback = regime_model.predict_proba(serving)["fallback"]

if p_fallback < 0.1:
    state = "GREEN"
    params = primary_policy()
elif p_fallback < 0.3:
    state = "YELLOW"
    params = guarded_primary_policy()
elif p_fallback < 0.7:
    state = "ORANGE"
    params = fallback_capped_policy()
else:
    state = "RED"
    params = failsafe_policy()

action = execute(params)
log(state=state, p_fallback=p_fallback, action=action)

Bottom line

Inference fallback is often treated as a reliability concern, but in execution systems it is also a microstructure-cost concern.

If your slippage model ignores serving-regime transitions, you will under-measure tail risk exactly when markets are least forgiving.

References

Dean, J., & Barroso, L. A. (2013). The Tail at Scale. Communications of the ACM.
https://www.barroso.org/publications/TheTailAtScale.pdf
NVIDIA Triton Inference Server — Dynamic Batching and Optimization docs (latency/throughput tradeoffs):
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/batcher.html
Martin Fowler — Circuit Breaker pattern (degraded-mode control primitive):
https://martinfowler.com/bliki/CircuitBreaker.html
Little, J. D. C. (1961). A Proof for the Queueing Formula: L = λW. Operations Research.
Almgren, R., & Chriss, N. (2000). Optimal Execution of Portfolio Transactions.