Webhook Reliability Playbook (Security, Idempotency, Ordering, Recovery)

2026-02-26 · software

Webhook Reliability Playbook (Security, Idempotency, Ordering, Recovery)

Date: 2026-02-26
Category: knowledge
Domain: backend integrations / event-driven systems

Why this matters

Webhook incidents are rarely about one bug. They are usually a compound failure:

If your webhook endpoint is “just an HTTP POST handler,” it will eventually fail in production.


First principles

Treat webhooks as:

  1. Untrusted input (must authenticate every request)
  2. At-least-once delivery (duplicates are normal)
  3. Potentially out-of-order (ordering is not guaranteed)
  4. Eventually consistent signals (payload may be stale vs provider source of truth)

This mindset avoids most expensive mistakes.


Production architecture (recommended)

1) Ingest layer (fast path)

On request:

  1. Verify signature using raw request body and provider secret/public key.
  2. Extract provider delivery ID + event ID + event type + event timestamp.
  3. Persist raw payload + metadata to an append-only ingress table/log.
  4. Enqueue internal job.
  5. Return 2xx quickly.

Goal: decouple provider delivery latency from business processing latency.

GitHub recommends responding within 10 seconds; otherwise delivery is considered failed.

2) Processing layer (slow path)

Worker consumes queue and applies idempotent business logic:

3) Recovery layer


Data model that actually works

A) Ingress log (immutable)

B) Dedup key

Use a unique constraint like:

Keep delivery-attempt IDs too, but dedupe on event identity for side effects.

C) Idempotent side effects ledger

For each external action (email, shipment, entitlement change):

Never fire side effects without recording/reusing idempotency keys.


Ordering strategy (realistic, not fantasy)

Strict global ordering is usually fragile for webhooks. Prefer one of these:

  1. Fetch-before-process

    • Treat webhook as a hint.
    • Fetch latest object state from provider API and apply state transition from source-of-truth.
  2. Conditional upsert with freshness guard

    • Upsert only if incoming event_time (or version counter) is newer.
  3. Per-entity sequence check (if provider gives monotonic version/sequence)

    • Apply only if incoming_seq > stored_seq.

For many systems, (1) + (2) together is the most robust.


Security controls (minimum bar)

  1. Signature verification mandatory

    • reject unsigned/invalid signatures.
  2. Replay window check

    • enforce timestamp tolerance (e.g., 5 minutes).
  3. Raw-body verification

    • verify before JSON mutation/reformatting.
  4. HTTPS only + secret rotation

    • dual-key overlap window for rotation.
  5. Optional IP allowlisting

    • useful defense-in-depth, but not a replacement for signatures.

Standard Webhooks guidance: sign message_id.timestamp.payload, where message ID helps idempotency and timestamp helps replay defense.


Provider-specific reliability notes

GitHub

Stripe

Design implication: your runbook cannot assume all providers retry the same way.


Failure-state runbook

Incident: retry storm / endpoint saturation

  1. Keep signature validation on (do not disable auth during incident).
  2. Switch ingest endpoint to queue-only mode (minimal parsing, immediate 2xx after durable enqueue).
  3. Apply worker concurrency caps + backoff to downstream dependencies.
  4. Monitor queue depth and max event age (minutes behind).

Incident: processing bug caused bad side effects

  1. Pause affected event types.
  2. Patch worker logic.
  3. Replay from ingress log for bounded time window.
  4. Use side-effect idempotency ledger to avoid duplicate external actions.

Incident: downtime caused event gaps

  1. Recover from provider redelivery APIs (GitHub script / Stripe event listing).
  2. Reconcile by polling provider source-of-truth for critical entities.
  3. Backfill missing state transitions.

Observability: metrics that matter

Track at least:

SLO suggestion:


Practical checklist

Before going live, confirm:

If any checkbox is missing, webhook reliability is not production-ready yet.


References (researched)