Cgroup CPU Quota Throttling Burst Slippage Playbook

2026-03-18 · finance

Cgroup CPU Quota Throttling Burst Slippage Playbook

Date: 2026-03-18
Category: research

Why this exists

Execution stacks are increasingly containerized. That helps isolation, but it also introduces a subtle slippage tax: CFS quota throttling (cpu.max) can pause a latency-critical strategy exactly when microstructure is moving.

The host can still look “fine” on average CPU usage while the strategy experiences:

This playbook treats quota throttling as a first-class slippage factor.


Core failure mode

In cgroup v2, cpu.max = <quota> <period> grants CPU runtime budget each period. If budget is exhausted, runnable threads in that cgroup are throttled until the next period refill.

For an execution engine, this creates a repeated pattern:

  1. Budget depletion during bursty compute/IO windows.
  2. Throttle pause in decision / amend / cancel loop.
  3. Unthrottle surge at period boundary.
  4. Dispatch clumping (late child-order burst).
  5. Queue reset + adverse selection.

Result: tail implementation shortfall worsens, even if mean completion stays acceptable.


Slippage decomposition with quota-throttle term

For parent order (i):

[ IS_i = C_{spread} + C_{impact} + C_{opportunity} + C_{throttle} ]

Where:

[ C_{throttle} = C_{pause} + C_{catchup} + C_{queue_decay} + C_{timing_alias} ]


Production observability (minimum)

1) cgroup CPU telemetry

From cpu.stat (cgroup v2):

2) Scheduler / pressure context

3) Execution-path telemetry

4) Outcome telemetry


Desk metrics to track

Define these over rolling windows (e.g., 1m / 5m):

  1. TRR (Throttle Ratio Rate)

[ TRR = \frac{\Delta nr_throttled}{\Delta nr_periods} ]

  1. TDR (Throttle Duty Ratio)

[ TDR = \frac{\Delta throttled_usec}{\Delta window_usec} ]

  1. PBA (Period-Boundary Aliasing)

Correlation between dispatch spikes and quota period refill boundaries.

  1. CBI (Catch-up Burst Index)

Post-unthrottle child-send rate divided by baseline send rate.

  1. QDL (Queue Decay Loss)

Passive fill-ratio drop conditioned on throttle events vs matched non-throttle windows.


Modeling approach

Use baseline + throttle overlay architecture.

Stage A: Baseline slippage model

Normal spread/impact/fill model with market-state features.

Stage B: Quota-throttle uplift model

Predict incremental uplift:

using features:

Final estimate:

[ \hat{IS}{final} = \hat{IS}{baseline} + \Delta\hat{IS}_{throttle} ]

Train with matched market windows (same symbol/session/volatility buckets) to isolate infra-induced uplift from market confounders.


Controller state machine

State 1 — NORMAL

Action: normal pacing.

State 2 — QUOTA_EDGE

Action: reduce replace churn, smooth child spacing, raise passive selectivity.

State 3 — THROTTLE_BURST

Action: cap burst size, disable panic catch-up, enforce inter-send minimum gap.

State 4 — SAFE_CONTAIN

Action: conservative completion mode, stricter aggression cap, optional route to non-throttled host pool.

Use hysteresis + minimum dwell times to prevent flapping.


Mitigation ladder

  1. Right-size quota headroom

    • Avoid sizing from average CPU use; size from p95/p99 burst demand.
    • Explicitly budget risk checks + serialization + retry spikes.
  2. Tune quota period deliberately

    • Shorter period reduces max single stall, but can increase scheduling overhead.
    • Validate period choice with latency-tail experiments, not CPU-average metrics.
  3. Use cpu.weight + topology isolation together

    • Quota alone is a blunt tool.
    • Combine fair-share (cpu.weight) and cpuset isolation for critical paths.
  4. Throttle-aware pacer

    • Detect recent throttle events and suppress catch-up bursts.
    • Prefer bounded repayment over immediate backlog flush.
  5. Host-pool policy

    • Keep urgent flow off aggressively capped multi-tenant pools.
    • Separate “latency-critical” and “batch-contended” deployment classes.

Validation drills (must run)

  1. Quota squeeze drill

    • Temporarily tighten cpu.max; verify uplift detector catches IS tail rise.
  2. Period sensitivity drill

    • Sweep period values (e.g., 100ms -> 50ms -> 20ms) with fixed effective quota.
    • Compare q95 dispatch latency, CBI, and q95 IS.
  3. Catch-up policy A/B

    • naive backlog flush vs bounded repayment policy.
    • Select policy by q95 IS + completion stability, not mean IS alone.
  4. Confounder separation drill

    • Distinguish quota-throttle signatures from network/venue incidents.

Anti-patterns


Practical rollout checklist


Bottom line

In containerized execution systems, cgroup quota is not just a resource-control setting; it is a microstructure timing control.

If you do not model and govern quota-throttle bursts explicitly, you will keep paying a hidden tail-slippage tax that looks like “market noise” but is mostly self-inflicted.


References