SLO Burn-Rate Alerting Playbook (Multi-Window, Multi-Burn)

2026-02-28 ยท software

SLO Burn-Rate Alerting Playbook (Multi-Window, Multi-Burn)

Date: 2026-02-28
Category: knowledge (software / SRE)

Why this matters

Threshold alerts like "5xx > 1% for 5m" are easy, but they page too often and miss the incidents that quietly burn your monthly reliability budget. Burn-rate alerting ties paging directly to error-budget depletion speed, which is much closer to business risk.

The practical goal:


Core model

For an SLO target T (e.g., 99.9%):

Interpretation:


Recommended starting thresholds (Google SRE Workbook)

For a 30-day SLO, common paging baselines:

  1. Fast burn page

    • long window: 1h
    • short window: 5m
    • burn-rate threshold: 14.4
    • roughly corresponds to ~2% budget consumption in 1h
  2. Slow burn page

    • long window: 6h
    • short window: 30m
    • burn-rate threshold: 6
    • roughly corresponds to ~5% budget consumption in 6h
  3. Ticket/non-page signal

    • long window: 3d
    • threshold near 1
    • catches persistent drift before monthly breach risk accumulates

The key pattern is AND within each pair (long + short), then OR across pairs. This reduces flapping while keeping reset time practical.


PromQL templates

1) Availability SLO (bad/total)

Assume:

(
  rate(http_requests_total{service="api",status=~"5.."}[1h])
  /
  rate(http_requests_total{service="api"}[1h])
  > (14.4 * 0.001)
)
and
(
  rate(http_requests_total{service="api",status=~"5.."}[5m])
  /
  rate(http_requests_total{service="api"}[5m])
  > (14.4 * 0.001)
)
(
  rate(http_requests_total{service="api",status=~"5.."}[6h])
  /
  rate(http_requests_total{service="api"}[6h])
  > (6 * 0.001)
)
and
(
  rate(http_requests_total{service="api",status=~"5.."}[30m])
  /
  rate(http_requests_total{service="api"}[30m])
  > (6 * 0.001)
)

Final page condition:

(fast_burn_expr) or (slow_burn_expr)

2) Latency SLO (histogram bucket)

If SLO is "99.9% โ‰ค 750ms":

1 - (
  rate(http_request_duration_seconds_bucket{service="api",le="0.75"}[1h])
  /
  rate(http_request_duration_seconds_bucket{service="api",le="+Inf"}[1h])
)

Use the same burn-rate pattern by replacing availability error-rate expression with this latency error-rate expression.


How to pick custom thresholds (instead of copying 14.4 / 6 blindly)

Two practical methods:

  1. Recovery-time method

    • If your team can usually recover in R hours,
    • with SLO window S hours,
    • choose threshold โ‰ˆ S / R.
  2. Budget-consumption method

    • Decide fraction F of budget you are willing to spend over alert long window W.
    • Threshold โ‰ˆ (S * F) / W.

Example: 7-day SLO (S=168h), want alert when projected 80% budget use over a 4h long window: (168 * 0.8) / 4 = 33.6.


Production gotchas

  1. Low-traffic false signals
    Tiny denominators inflate rates. Add minimum traffic guards (e.g., require request volume floor) or route these services to synthetic checks + ticketing.

  2. Mislabeled "bad" events
    If business-acceptable 4xx are treated as bad, youโ€™ll page on non-incidents.

  3. Summary metrics for latency SLOs
    Percentile summaries are hard to aggregate correctly across instances; prefer histogram buckets for burn-rate math.

  4. No action policy attached
    Burn-rate alerts without explicit response policy become expensive dashboards. Map each severity to concrete actions.


Runbook contract (minimum)

For each burn-rate alert, define:


7-day rollout plan

  1. Instrument clean good/bad counters and latency histograms.
  2. Backtest candidate rules against last 30โ€“90 days.
  3. Start as non-paging notifications for 1 week.
  4. Measure precision/recall and adjust traffic floors.
  5. Enable paging for fast burn only.
  6. Add slow burn paging once false positives are controlled.
  7. Add non-page budget trend ticketing for reliability planning.

Quick decision cheat sheet


References