Webhook Reliability Playbook (Security, Idempotency, Ordering, Recovery)

Date: 2026-02-26
Category: knowledge
Domain: backend integrations / event-driven systems

Why this matters

Webhook incidents are rarely about one bug. They are usually a compound failure:

duplicate deliveries trigger duplicate side effects,
out-of-order events overwrite newer state,
slow handlers time out and create retry storms,
no replay tooling means missed events stay missed.

If your webhook endpoint is “just an HTTP POST handler,” it will eventually fail in production.

First principles

Treat webhooks as:

Untrusted input (must authenticate every request)
At-least-once delivery (duplicates are normal)
Potentially out-of-order (ordering is not guaranteed)
Eventually consistent signals (payload may be stale vs provider source of truth)

This mindset avoids most expensive mistakes.

Production architecture (recommended)

1) Ingest layer (fast path)

On request:

Verify signature using raw request body and provider secret/public key.
Extract provider delivery ID + event ID + event type + event timestamp.
Persist raw payload + metadata to an append-only ingress table/log.
Enqueue internal job.
Return 2xx quickly.

Goal: decouple provider delivery latency from business processing latency.

GitHub recommends responding within 10 seconds; otherwise delivery is considered failed.

2) Processing layer (slow path)

Worker consumes queue and applies idempotent business logic:

dedupe by stable event key,
apply ordering guards (timestamp/version checks),
execute side effects with idempotency keys,
mark processing status (processing → processed / failed / dead-letter).

3) Recovery layer

replay failed jobs from internal DLQ,
redeliver from provider APIs when available,
reconcile from provider source-of-truth API for gap repair.

Data model that actually works

A) Ingress log (immutable)

provider (stripe/github/...)
delivery_id (provider delivery attempt id if available)
event_id (provider logical event id)
event_type
event_time
received_at
payload_raw
signature_verified (bool)
status (received/enqueued/processing/processed/failed/dead_letter)

B) Dedup key

Use a unique constraint like:

preferred: (provider, event_id)
fallback (if no stable event id): hash of provider + canonical payload identity fields

Keep delivery-attempt IDs too, but dedupe on event identity for side effects.

C) Idempotent side effects ledger

For each external action (email, shipment, entitlement change):

idempotency_key
action_type
target_id
result
executed_at

Never fire side effects without recording/reusing idempotency keys.

Ordering strategy (realistic, not fantasy)

Strict global ordering is usually fragile for webhooks. Prefer one of these:

Fetch-before-process
- Treat webhook as a hint.
- Fetch latest object state from provider API and apply state transition from source-of-truth.
Conditional upsert with freshness guard
- Upsert only if incoming event_time (or version counter) is newer.
Per-entity sequence check (if provider gives monotonic version/sequence)
- Apply only if incoming_seq > stored_seq.

For many systems, (1) + (2) together is the most robust.

Security controls (minimum bar)

Signature verification mandatory
- reject unsigned/invalid signatures.
Replay window check
- enforce timestamp tolerance (e.g., 5 minutes).
Raw-body verification
- verify before JSON mutation/reformatting.
HTTPS only + secret rotation
- dual-key overlap window for rotation.
Optional IP allowlisting
- useful defense-in-depth, but not a replacement for signatures.

Standard Webhooks guidance: sign message_id.timestamp.payload, where message ID helps idempotency and timestamp helps replay defense.

Provider-specific reliability notes

GitHub

Delivery is considered failed if you do not respond within 10 seconds.
GitHub does not automatically redeliver failed deliveries.
You should run scheduled redelivery/recovery using GitHub webhook delivery APIs.
X-GitHub-Delivery can be used as a unique delivery identifier.

Stripe

Stripe automatically retries undelivered webhook events for up to 3 days.
During manual catch-up, already-processed events should be ignored but return success to stop further retries.
Stripe docs also emphasize quick 2xx responses and asynchronous handling.

Design implication: your runbook cannot assume all providers retry the same way.

Failure-state runbook

Incident: retry storm / endpoint saturation

Keep signature validation on (do not disable auth during incident).
Switch ingest endpoint to queue-only mode (minimal parsing, immediate 2xx after durable enqueue).
Apply worker concurrency caps + backoff to downstream dependencies.
Monitor queue depth and max event age (minutes behind).

Incident: processing bug caused bad side effects

Pause affected event types.
Patch worker logic.
Replay from ingress log for bounded time window.
Use side-effect idempotency ledger to avoid duplicate external actions.

Incident: downtime caused event gaps

Recover from provider redelivery APIs (GitHub script / Stripe event listing).
Reconcile by polling provider source-of-truth for critical entities.
Backfill missing state transitions.

Observability: metrics that matter

Track at least:

ingestion success rate,
signature verification failures,
2xx latency at ingress,
queue depth,
oldest queued event age,
dedupe hit rate,
processing success/failure by event type,
DLQ size and age,
replay volume per incident.

SLO suggestion:

Ingress availability: >= 99.9%
p95 ingress response: < 500ms
Max event age: < 5 minutes (normal), < 30 minutes (degraded)

Practical checklist

Before going live, confirm:

Signature verification on raw body is implemented and tested.
Replay-window timestamp checks enforced.
Durable ingress + async queue path exists.
Event dedupe key has DB unique constraint.
Side effects are idempotent with key ledger.
Out-of-order protection exists (fetch-before-process and/or freshness guard).
Redelivery/reconciliation scripts are automated.
Dashboard includes queue age + DLQ + dedupe metrics.
Game-day test includes duplicate, delayed, reordered, and replayed events.

If any checkbox is missing, webhook reliability is not production-ready yet.

References (researched)

GitHub Docs — Best practices for using webhooks
https://docs.github.com/en/webhooks/using-webhooks/best-practices-for-using-webhooks
GitHub Docs — Handling failed webhook deliveries
https://docs.github.com/en/webhooks/using-webhooks/handling-failed-webhook-deliveries
Stripe Docs — Receive Stripe events in your webhook endpoint
https://docs.stripe.com/webhooks
Stripe Docs — Process undelivered webhook events
https://docs.stripe.com/webhooks/process-undelivered-events
Standard Webhooks Specification
https://github.com/standard-webhooks/standard-webhooks/blob/main/spec/standard-webhooks.md
Svix Blog — Why You Can’t Guarantee Webhook Ordering
https://www.svix.com/blog/guaranteeing-webhook-ordering/
Hookdeck Blog — Webhooks at Scale: Best Practices and Lessons Learned
https://hookdeck.com/blog/webhooks-at-scale