Online Change-Point Detection for Streaming Observability Playbook
Date: 2026-02-28
Category: knowledge
Domain: software / observability / reliability engineering
Why this matters
Static thresholds (p95 latency > 300ms) miss two painful realities:
- baseline shifts (deploys, traffic mix, infra changes),
- subtle-but-persistent drifts that stay below fixed alarms until they become incidents.
Change-point detection gives an operator-grade answer to: "Did the process regime change?"
- not just "is this datapoint large?"
- but "is this still the same distribution/system state?"
This is the difference between catching degradation early vs discovering it during customer impact.
Problem framing (operator view)
For a metric stream (x_t) (e.g., queue wait p95 every minute), detect abrupt or gradual structural changes in:
- mean level,
- variance/tail heaviness,
- trend slope,
- event rate.
In production, the objective is usually:
- Low detection delay for real shifts,
- Bounded false alarms (pager sanity),
- Recoverable decisions (safe rollback/degrade first).
Detector families you should actually use
1) CUSUM (fast mean-shift detector)
Best when you care about small persistent mean shifts and have a reasonably stable baseline.
One-sided upward CUSUM:
[ S_t = \max\big(0, S_{t-1} + (x_t - \mu_0) - k\big) ]
Alert if (S_t > h).
- (\mu_0): in-control mean
- (k): reference slack (often tied to target shift size)
- (h): decision threshold
Practical notes:
- maintain separate up/down CUSUM for regressions vs surprising improvements,
- normalize metrics (z-score or robust scale) before combining streams,
- tune on false alarms/day and median detection delay, not vibes.
2) EWMA control chart (smooth + robust to noise)
Useful when raw metrics are noisy and you want smoothed shift detection.
[ z_t = \lambda x_t + (1-\lambda) z_{t-1}, \quad 0 < \lambda \le 1 ]
Smaller (\lambda) = more smoothing, slower reaction.
Operationally:
- great for infra metrics with jitter,
- weaker than CUSUM for very tiny persistent shifts,
- simple and cheap enough for broad deployment.
3) Page-Hinkley (stream drift without strict stationarity assumptions)
Classic online detector for mean drift using cumulative deviations from running average.
Typical form tracks cumulative centered deviations and triggers when excursion exceeds threshold.
Why operators like it:
- no full offline training needed,
- low memory,
- works well as a guardrail trigger for retraining or rollback workflows.
4) BOCPD (Bayesian Online Change-Point Detection)
Tracks posterior over run length (time since last change), not just one scalar alarm statistic.
Core idea:
- recursively update (P(r_t \mid x_{1:t})),
- combine hazard prior (expected regime duration) + predictive likelihood,
- alarm when posterior mass shifts toward short run lengths.
Great when you need:
- uncertainty-aware detection,
- interpretable confidence about regime changes,
- richer controller logic than binary alarm/no-alarm.
Trade-off: more modeling choices and compute than CUSUM/EWMA.
5) ADWIN-style adaptive windows (concept drift in data streams)
Maintains variable-length window and statistically cuts old data when distribution shift is detected.
Good fit for:
- streaming ML monitoring,
- nonstationary environments with unknown drift speed,
- automatic adaptation without fixed-size window guessing.
Quick selection matrix
- Need ultra-fast mean shift detection, low complexity: CUSUM
- Need smoother low-noise alarming at scale: EWMA
- Need lightweight online drift trigger: Page-Hinkley
- Need posterior/probabilistic regime beliefs: BOCPD
- Need auto-adaptive windows for concept drift: ADWIN
In practice, a hybrid stack usually wins:
- CUSUM/EWMA for pager-grade service metrics,
- ADWIN/BOCPD for model-performance and distribution-drift channels.
What to monitor (don’t detect on one metric)
Use a compact multi-channel panel:
- User impact metrics: error rate, latency p95/p99, timeout rate
- Load/context metrics: traffic rate, burstiness, request mix
- System internals: queue wait, saturation, retry amplification
- Model/data metrics (if ML in loop): PSI/KS/calibration drift
Then classify change events as:
- impact-only,
- load-only,
- infra-only,
- model-only,
- coupled (highest risk).
Coupled changes are where hidden incidents begin.
Implementation blueprint (production-safe)
Step 1) Build baseline channels per metric
For each metric stream:
- seasonality normalization (hour-of-day/day-of-week where relevant),
- robust scaling (median/MAD or winsorized z-score),
- missing-data handling policy (hold, interpolate, skip update).
Step 2) Run dual detectors
Per critical metric, run:
- fast detector (CUSUM or Page-Hinkley),
- stable detector (EWMA or BOCPD confidence threshold).
Raise only when one of these holds:
- both detectors confirm, or
- fast detector exceeds a higher threshold with persistence.
This cuts false pages from single-detector quirks.
Step 3) Add persistence + hysteresis
Alarm if condition persists for (N) windows.
Clear only after recovery holds for (M) windows where (M > N).
Avoids alert flapping in high-variance periods.
Step 4) Route to explicit actions
Map alarms into state machine:
GREEN: observe onlyWATCH: increase sampling/logging, freeze risky rolloutsAMBER: partial mitigation (rate shaping, brownout, canary rollback)RED: hard protective action (rollback, traffic shed, model fallback)
Detection without action mapping is dashboard theater.
Calibration protocol (how to avoid pager spam)
Define SLO-like detector targets:
- max false pages/week,
- target median detection delay,
- max miss rate on replayed historical incidents.
Then calibrate with three datasets:
- Quiet periods (for false alarm estimation),
- Known incident windows (for delay/recall),
- Synthetic injections (controlled shift magnitudes, variance jumps, trend breaks).
Track:
false_alarm_rate_per_daydetection_delay_minutesalert_precisionalert_recalltime_in_alert_state
If precision is low, first increase persistence/hysteresis before blindly raising raw thresholds.
Common failure modes
No seasonality normalization
- daily traffic cycle appears as fake drift.
Detector on one aggregated metric only
- misses segment-specific breakage (region, endpoint, tenant).
No separation of load shift vs quality degradation
- pages infra team for expected demand spikes.
No post-alert label loop
- detector never improves because every alert is forgotten.
No action policy
- alarms fire but nobody knows whether to rollback, scale, or wait.
Minimal rollout plan (2 weeks)
Week 1:
- instrument normalized per-minute streams,
- backtest CUSUM + EWMA on last 30-90 days,
- choose initial thresholds using false-page budget.
Week 2:
- run monitor-only in prod,
- attach incident labels (true/false/benign-change),
- enable WATCH/AMBER automations first,
- delay RED auto-actions until precision is validated.
Goal is boring reliability, not mathematically perfect detectors.
Practical checklist
- Are baselines seasonality-aware?
- Do we run at least two complementary detectors?
- Do alerts require persistence/hysteresis?
- Do we distinguish load/context shift from service degradation?
- Is there a labeled feedback loop for threshold retuning?
- Does each alert severity map to a predefined action?
If any answer is “no,” drift detection is likely decorative.
References (researched)
- Adams, R. P., & MacKay, D. J. C. (2007). Bayesian Online Changepoint Detection
https://arxiv.org/abs/0710.3742 - CUSUM overview and historical references
https://en.wikipedia.org/wiki/CUSUM - EWMA control chart background
https://en.wikipedia.org/wiki/Exponentially_weighted_moving_average - ADWIN (adaptive windowing drift detector, practical API/docs)
https://riverml.xyz/dev/api/drift/ADWIN/ - Page-Hinkley (practical online drift detector docs)
https://riverml.xyz/dev/api/drift/PageHinkley/