Coordinated Omission in Latency Benchmarks: A Practical Detection & Mitigation Playbook

2026-03-08 · software

Coordinated Omission in Latency Benchmarks: A Practical Detection & Mitigation Playbook

Date: 2026-03-08
Category: knowledge
Domain: software / performance engineering / reliability

Why this matters

Many load tests underreport tail latency exactly when systems are unhealthy.

If your generator waits for a slow response before sending the next request, it silently stops sampling the bad period. This is coordinated omission (CO): the measurement process unintentionally synchronizes with the SUT and omits pain.

Result: p99/p99.9 can look great while users are having a terrible time.


One-line definition

Coordinated omission = a benchmarking artifact where request generation and latency sampling are coupled to completion, so delayed periods cause missed arrivals and missing bad samples.


Mental model: open vs closed workload

Closed model (CO-prone by default)

Open model (CO-resistant)

Rule of thumb:


Why CO destroys tail truth

Suppose target arrival interval is 10ms (100 rps), then SUT stalls for 2 seconds.

This massively compresses tail probability mass and makes percentiles look safer than reality.


Practical symptoms you likely have CO

  1. Latency tails barely move when you intentionally inject pauses/stalls.
  2. Throughput drops during degradation, but percentile charts remain oddly smooth.
  3. Tool uses fixed VU loops and no constant-arrival executor for your scenario.
  4. Results look much better than production telemetry at similar offered load.

Mitigation ladder (in order)

1) Pick the right load model first

For arrival-rate/SLO studies, prefer:

2) Record latency relative to planned send time

Measure from when request should have been sent (schedule time), not only actual send timestamp.

That captures queueing delay caused by missed schedule slots.

3) Use CO-aware histogram recording

HdrHistogram supports correction methods such as recording with expected interval and copy/add corrected variants.

Conceptually, when an observed latency L exceeds expected interval I, add synthetic samples at L-I, L-2I, ... down to I.

This approximates omitted arrivals that experienced long waits.

4) Publish full percentile spectrum

Always expose:

Do not rely on median or p95-only dashboards.

5) Run controlled “truth tests”

Inject deterministic stalls (e.g., 1s pause every 30s) and verify your tooling reflects expected tail inflation. If not, instrumentation is lying.


Tooling notes


Minimal benchmark protocol (production-worthy)

  1. Define objective: saturation curve, SLO at fixed offered load, or capacity frontier.
  2. Choose workload model matching objective (open for arrival-rate realism).
  3. Fix offered load schedule (warmup, steady-state, stress steps).
  4. Capture both offered and achieved rates.
  5. Log full latency histogram per interval (not just summarized percentiles).
  6. Report CO-corrected and raw views when possible.
  7. Inject synthetic pauses to validate measurement honesty.
  8. Compare benchmark tails with production traces for sanity.

Common mistakes

  1. Using closed-loop VU tests to claim fixed-RPS SLOs
    → invalid inference.

  2. Plotting only p50/p95
    → hides user-impactful outliers.

  3. Ignoring max values as “noise”
    → often throwing away the signal.

  4. Generator saturation mistaken for SUT saturation
    → benchmarking the client, not the server.

  5. No pause/fault injection in benchmark validation
    → cannot detect CO failure mode.


Decision cheatsheet


Implementation snippet (conceptual)

Given expected interval I:

This reconstructs omitted samples implied by schedule misses.


KPI set to track benchmark integrity

If these are absent, you are probably optimizing against comforting artifacts.


One-line takeaway

Before tuning your service, verify your benchmark is not coordinating away the very tail events your users actually feel.


References