SLO Burn-Rate Alerting Playbook (Multi-Window, Multi-Burn)

Date: 2026-02-28
Category: knowledge (software / SRE)

Why this matters

Threshold alerts like "5xx > 1% for 5m" are easy, but they page too often and miss the incidents that quietly burn your monthly reliability budget. Burn-rate alerting ties paging directly to error-budget depletion speed, which is much closer to business risk.

The practical goal:

page quickly for fast, dangerous burns,
avoid noise from tiny short spikes,
and still catch slower sustained degradation.

Core model

For an SLO target T (e.g., 99.9%):

Error budget fraction B = 1 - T
(99.9% → B = 0.001)
Observed error rate over a window W: E(W)
Burn rate over window W:
BR(W) = E(W) / B

Interpretation:

BR = 1 means you consume budget exactly at sustainable pace.
BR = 10 means budget burns 10x faster than planned.
Time-to-exhaustion ≈ (SLO window length) / BR.

Recommended starting thresholds (Google SRE Workbook)

For a 30-day SLO, common paging baselines:

Fast burn page
- long window: 1h
- short window: 5m
- burn-rate threshold: 14.4
- roughly corresponds to ~2% budget consumption in 1h
Slow burn page
- long window: 6h
- short window: 30m
- burn-rate threshold: 6
- roughly corresponds to ~5% budget consumption in 6h
Ticket/non-page signal
- long window: 3d
- threshold near 1
- catches persistent drift before monthly breach risk accumulates

The key pattern is AND within each pair (long + short), then OR across pairs. This reduces flapping while keeping reset time practical.

PromQL templates

1) Availability SLO (bad/total)

Assume:

http_requests_total{service="api"}
5xx considered bad

(
  rate(http_requests_total{service="api",status=~"5.."}[1h])
  /
  rate(http_requests_total{service="api"}[1h])
  > (14.4 * 0.001)
)
and
(
  rate(http_requests_total{service="api",status=~"5.."}[5m])
  /
  rate(http_requests_total{service="api"}[5m])
  > (14.4 * 0.001)
)

(
  rate(http_requests_total{service="api",status=~"5.."}[6h])
  /
  rate(http_requests_total{service="api"}[6h])
  > (6 * 0.001)
)
and
(
  rate(http_requests_total{service="api",status=~"5.."}[30m])
  /
  rate(http_requests_total{service="api"}[30m])
  > (6 * 0.001)
)

Final page condition:

(fast_burn_expr) or (slow_burn_expr)

2) Latency SLO (histogram bucket)

If SLO is "99.9% ≤ 750ms":

1 - (
  rate(http_request_duration_seconds_bucket{service="api",le="0.75"}[1h])
  /
  rate(http_request_duration_seconds_bucket{service="api",le="+Inf"}[1h])
)

Use the same burn-rate pattern by replacing availability error-rate expression with this latency error-rate expression.

How to pick custom thresholds (instead of copying 14.4 / 6 blindly)

Two practical methods:

Recovery-time method
- If your team can usually recover in R hours,
- with SLO window S hours,
- choose threshold ≈ S / R.
Budget-consumption method
- Decide fraction F of budget you are willing to spend over alert long window W.
- Threshold ≈ (S * F) / W.

Example: 7-day SLO (S=168h), want alert when projected 80% budget use over a 4h long window: (168 * 0.8) / 4 = 33.6.

Production gotchas

Low-traffic false signals
Tiny denominators inflate rates. Add minimum traffic guards (e.g., require request volume floor) or route these services to synthetic checks + ticketing.
Mislabeled "bad" events
If business-acceptable 4xx are treated as bad, you’ll page on non-incidents.
Summary metrics for latency SLOs
Percentile summaries are hard to aggregate correctly across instances; prefer histogram buckets for burn-rate math.
No action policy attached
Burn-rate alerts without explicit response policy become expensive dashboards. Map each severity to concrete actions.

Runbook contract (minimum)

For each burn-rate alert, define:

Owner (primary + secondary)
Immediate actions (rollback, traffic shift, feature flag off, autoscaling override)
Abort criteria (when to escalate incident command)
Recovery checks (burn rate below threshold on both windows for N minutes)
Post-incident tasks (error-budget accounting + prevention ticket)

7-day rollout plan

Instrument clean good/bad counters and latency histograms.
Backtest candidate rules against last 30–90 days.
Start as non-paging notifications for 1 week.
Measure precision/recall and adjust traffic floors.
Enable paging for fast burn only.
Add slow burn paging once false positives are controlled.
Add non-page budget trend ticketing for reliability planning.

Quick decision cheat sheet

Need fast incident detection without alert storms? → multi-window burn-rate paging
Need planning signal, not pager load? → error-budget/ticket alerts
Low-traffic service? → traffic floor + synthetic + ticket-first

References

Google SRE Workbook — Alerting on SLOs
https://sre.google/workbook/alerting-on-slos/
Grafana Labs — How to implement multi-window, multi-burn-rate alerts with Grafana Cloud
https://grafana.com/blog/how-to-implement-multi-window-multi-burn-rate-alerts-with-grafana-cloud/
Datadog — Proactively monitor service performance with SLO alerts
https://www.datadoghq.com/blog/monitor-service-performance-with-slo-alerts/