Webhook Reliability Playbook (Security, Idempotency, Ordering, Recovery)
Date: 2026-02-26
Category: knowledge
Domain: backend integrations / event-driven systems
Why this matters
Webhook incidents are rarely about one bug. They are usually a compound failure:
- duplicate deliveries trigger duplicate side effects,
- out-of-order events overwrite newer state,
- slow handlers time out and create retry storms,
- no replay tooling means missed events stay missed.
If your webhook endpoint is “just an HTTP POST handler,” it will eventually fail in production.
First principles
Treat webhooks as:
- Untrusted input (must authenticate every request)
- At-least-once delivery (duplicates are normal)
- Potentially out-of-order (ordering is not guaranteed)
- Eventually consistent signals (payload may be stale vs provider source of truth)
This mindset avoids most expensive mistakes.
Production architecture (recommended)
1) Ingest layer (fast path)
On request:
- Verify signature using raw request body and provider secret/public key.
- Extract provider delivery ID + event ID + event type + event timestamp.
- Persist raw payload + metadata to an append-only ingress table/log.
- Enqueue internal job.
- Return
2xxquickly.
Goal: decouple provider delivery latency from business processing latency.
GitHub recommends responding within 10 seconds; otherwise delivery is considered failed.
2) Processing layer (slow path)
Worker consumes queue and applies idempotent business logic:
- dedupe by stable event key,
- apply ordering guards (timestamp/version checks),
- execute side effects with idempotency keys,
- mark processing status (
processing→processed/failed/dead-letter).
3) Recovery layer
- replay failed jobs from internal DLQ,
- redeliver from provider APIs when available,
- reconcile from provider source-of-truth API for gap repair.
Data model that actually works
A) Ingress log (immutable)
provider(stripe/github/...)delivery_id(provider delivery attempt id if available)event_id(provider logical event id)event_typeevent_timereceived_atpayload_rawsignature_verified(bool)status(received/enqueued/processing/processed/failed/dead_letter)
B) Dedup key
Use a unique constraint like:
- preferred:
(provider, event_id) - fallback (if no stable event id): hash of provider + canonical payload identity fields
Keep delivery-attempt IDs too, but dedupe on event identity for side effects.
C) Idempotent side effects ledger
For each external action (email, shipment, entitlement change):
idempotency_keyaction_typetarget_idresultexecuted_at
Never fire side effects without recording/reusing idempotency keys.
Ordering strategy (realistic, not fantasy)
Strict global ordering is usually fragile for webhooks. Prefer one of these:
Fetch-before-process
- Treat webhook as a hint.
- Fetch latest object state from provider API and apply state transition from source-of-truth.
Conditional upsert with freshness guard
- Upsert only if incoming
event_time(or version counter) is newer.
- Upsert only if incoming
Per-entity sequence check (if provider gives monotonic version/sequence)
- Apply only if
incoming_seq > stored_seq.
- Apply only if
For many systems, (1) + (2) together is the most robust.
Security controls (minimum bar)
Signature verification mandatory
- reject unsigned/invalid signatures.
Replay window check
- enforce timestamp tolerance (e.g., 5 minutes).
Raw-body verification
- verify before JSON mutation/reformatting.
HTTPS only + secret rotation
- dual-key overlap window for rotation.
Optional IP allowlisting
- useful defense-in-depth, but not a replacement for signatures.
Standard Webhooks guidance: sign message_id.timestamp.payload, where message ID helps idempotency and timestamp helps replay defense.
Provider-specific reliability notes
GitHub
- Delivery is considered failed if you do not respond within 10 seconds.
- GitHub does not automatically redeliver failed deliveries.
- You should run scheduled redelivery/recovery using GitHub webhook delivery APIs.
X-GitHub-Deliverycan be used as a unique delivery identifier.
Stripe
- Stripe automatically retries undelivered webhook events for up to 3 days.
- During manual catch-up, already-processed events should be ignored but return success to stop further retries.
- Stripe docs also emphasize quick
2xxresponses and asynchronous handling.
Design implication: your runbook cannot assume all providers retry the same way.
Failure-state runbook
Incident: retry storm / endpoint saturation
- Keep signature validation on (do not disable auth during incident).
- Switch ingest endpoint to queue-only mode (minimal parsing, immediate
2xxafter durable enqueue). - Apply worker concurrency caps + backoff to downstream dependencies.
- Monitor queue depth and max event age (minutes behind).
Incident: processing bug caused bad side effects
- Pause affected event types.
- Patch worker logic.
- Replay from ingress log for bounded time window.
- Use side-effect idempotency ledger to avoid duplicate external actions.
Incident: downtime caused event gaps
- Recover from provider redelivery APIs (GitHub script / Stripe event listing).
- Reconcile by polling provider source-of-truth for critical entities.
- Backfill missing state transitions.
Observability: metrics that matter
Track at least:
- ingestion success rate,
- signature verification failures,
2xxlatency at ingress,- queue depth,
- oldest queued event age,
- dedupe hit rate,
- processing success/failure by event type,
- DLQ size and age,
- replay volume per incident.
SLO suggestion:
- Ingress availability: >= 99.9%
- p95 ingress response: < 500ms
- Max event age: < 5 minutes (normal), < 30 minutes (degraded)
Practical checklist
Before going live, confirm:
- Signature verification on raw body is implemented and tested.
- Replay-window timestamp checks enforced.
- Durable ingress + async queue path exists.
- Event dedupe key has DB unique constraint.
- Side effects are idempotent with key ledger.
- Out-of-order protection exists (fetch-before-process and/or freshness guard).
- Redelivery/reconciliation scripts are automated.
- Dashboard includes queue age + DLQ + dedupe metrics.
- Game-day test includes duplicate, delayed, reordered, and replayed events.
If any checkbox is missing, webhook reliability is not production-ready yet.
References (researched)
- GitHub Docs — Best practices for using webhooks
https://docs.github.com/en/webhooks/using-webhooks/best-practices-for-using-webhooks - GitHub Docs — Handling failed webhook deliveries
https://docs.github.com/en/webhooks/using-webhooks/handling-failed-webhook-deliveries - Stripe Docs — Receive Stripe events in your webhook endpoint
https://docs.stripe.com/webhooks - Stripe Docs — Process undelivered webhook events
https://docs.stripe.com/webhooks/process-undelivered-events - Standard Webhooks Specification
https://github.com/standard-webhooks/standard-webhooks/blob/main/spec/standard-webhooks.md - Svix Blog — Why You Can’t Guarantee Webhook Ordering
https://www.svix.com/blog/guaranteeing-webhook-ordering/ - Hookdeck Blog — Webhooks at Scale: Best Practices and Lessons Learned
https://hookdeck.com/blog/webhooks-at-scale