SLO Burn-Rate Alerting Playbook (Multi-Window, Multi-Burn)
Date: 2026-02-28
Category: knowledge (software / SRE)
Why this matters
Threshold alerts like "5xx > 1% for 5m" are easy, but they page too often and miss the incidents that quietly burn your monthly reliability budget. Burn-rate alerting ties paging directly to error-budget depletion speed, which is much closer to business risk.
The practical goal:
- page quickly for fast, dangerous burns,
- avoid noise from tiny short spikes,
- and still catch slower sustained degradation.
Core model
For an SLO target T (e.g., 99.9%):
Error budget fraction
B = 1 - T
(99.9% โB = 0.001)Observed error rate over a window
W:E(W)Burn rate over window
W:BR(W) = E(W) / B
Interpretation:
BR = 1means you consume budget exactly at sustainable pace.BR = 10means budget burns 10x faster than planned.- Time-to-exhaustion โ
(SLO window length) / BR.
Recommended starting thresholds (Google SRE Workbook)
For a 30-day SLO, common paging baselines:
Fast burn page
- long window: 1h
- short window: 5m
- burn-rate threshold: 14.4
- roughly corresponds to ~2% budget consumption in 1h
Slow burn page
- long window: 6h
- short window: 30m
- burn-rate threshold: 6
- roughly corresponds to ~5% budget consumption in 6h
Ticket/non-page signal
- long window: 3d
- threshold near 1
- catches persistent drift before monthly breach risk accumulates
The key pattern is AND within each pair (long + short), then OR across pairs. This reduces flapping while keeping reset time practical.
PromQL templates
1) Availability SLO (bad/total)
Assume:
http_requests_total{service="api"}- 5xx considered bad
(
rate(http_requests_total{service="api",status=~"5.."}[1h])
/
rate(http_requests_total{service="api"}[1h])
> (14.4 * 0.001)
)
and
(
rate(http_requests_total{service="api",status=~"5.."}[5m])
/
rate(http_requests_total{service="api"}[5m])
> (14.4 * 0.001)
)
(
rate(http_requests_total{service="api",status=~"5.."}[6h])
/
rate(http_requests_total{service="api"}[6h])
> (6 * 0.001)
)
and
(
rate(http_requests_total{service="api",status=~"5.."}[30m])
/
rate(http_requests_total{service="api"}[30m])
> (6 * 0.001)
)
Final page condition:
(fast_burn_expr) or (slow_burn_expr)
2) Latency SLO (histogram bucket)
If SLO is "99.9% โค 750ms":
1 - (
rate(http_request_duration_seconds_bucket{service="api",le="0.75"}[1h])
/
rate(http_request_duration_seconds_bucket{service="api",le="+Inf"}[1h])
)
Use the same burn-rate pattern by replacing availability error-rate expression with this latency error-rate expression.
How to pick custom thresholds (instead of copying 14.4 / 6 blindly)
Two practical methods:
Recovery-time method
- If your team can usually recover in
Rhours, - with SLO window
Shours, - choose threshold โ
S / R.
- If your team can usually recover in
Budget-consumption method
- Decide fraction
Fof budget you are willing to spend over alert long windowW. - Threshold โ
(S * F) / W.
- Decide fraction
Example: 7-day SLO (S=168h), want alert when projected 80% budget use over a 4h long window:
(168 * 0.8) / 4 = 33.6.
Production gotchas
Low-traffic false signals
Tiny denominators inflate rates. Add minimum traffic guards (e.g., require request volume floor) or route these services to synthetic checks + ticketing.Mislabeled "bad" events
If business-acceptable 4xx are treated as bad, youโll page on non-incidents.Summary metrics for latency SLOs
Percentile summaries are hard to aggregate correctly across instances; prefer histogram buckets for burn-rate math.No action policy attached
Burn-rate alerts without explicit response policy become expensive dashboards. Map each severity to concrete actions.
Runbook contract (minimum)
For each burn-rate alert, define:
- Owner (primary + secondary)
- Immediate actions (rollback, traffic shift, feature flag off, autoscaling override)
- Abort criteria (when to escalate incident command)
- Recovery checks (burn rate below threshold on both windows for N minutes)
- Post-incident tasks (error-budget accounting + prevention ticket)
7-day rollout plan
- Instrument clean good/bad counters and latency histograms.
- Backtest candidate rules against last 30โ90 days.
- Start as non-paging notifications for 1 week.
- Measure precision/recall and adjust traffic floors.
- Enable paging for fast burn only.
- Add slow burn paging once false positives are controlled.
- Add non-page budget trend ticketing for reliability planning.
Quick decision cheat sheet
- Need fast incident detection without alert storms? โ multi-window burn-rate paging
- Need planning signal, not pager load? โ error-budget/ticket alerts
- Low-traffic service? โ traffic floor + synthetic + ticket-first
References
Google SRE Workbook โ Alerting on SLOs
https://sre.google/workbook/alerting-on-slos/Grafana Labs โ How to implement multi-window, multi-burn-rate alerts with Grafana Cloud
https://grafana.com/blog/how-to-implement-multi-window-multi-burn-rate-alerts-with-grafana-cloud/Datadog โ Proactively monitor service performance with SLO alerts
https://www.datadoghq.com/blog/monitor-service-performance-with-slo-alerts/