Event-Time Watermarks & Late Data Playbook

2026-02-27 · software

Event-Time Watermarks & Late Data Playbook

Date: 2026-02-27
Category: software
Purpose: Practical guide for designing stream pipelines that stay correct under out-of-order/late events without exploding latency or state cost.


1) Why this matters

Most real-time bugs are time bugs.

If you aggregate by processing time only, you get low latency but non-deterministic results during backpressure, retries, partition skew, or replay. If you aggregate by event time with no guardrails, you can get unbounded state, stalled outputs, or silent data drops.

The core engineering trade-off is:

You can’t maximize all three simultaneously. Watermark strategy is the control surface.


2) Mental model (keep this in your head)

Event time vs processing time

A healthy pipeline treats processing time as runtime reality, and event time as analytical truth.

Watermark

Operationally, watermark is a statement like:

“I believe events older than T are mostly complete.”

Important: watermark is not a perfect truth guarantee. It is a bounded-risk contract.

Late data

Any event whose timestamp is older than current watermark is late. Late handling policy decides whether it is:

  1. Included via allowed lateness,
  2. Routed to side output / dead-letter stream,
  3. Dropped.

3) The 5 knobs that decide everything

1. Timestamp quality

2. Watermark lag

Rule of thumb: start from observed inter-arrival delay percentiles, not gut feeling.

3. Window type and size

Window size multiplies with watermark lag to define how long state stays resident.

4. Allowed lateness

5. Late-data routing

Never drop silently. Route very-late records to:


4) Architecture pattern (production default)

  1. Bronze (raw immutable): append-only source events with event_time + ingest_time.
  2. Silver (streaming aggregate): event-time windows + watermark + bounded lateness.
  3. Late lane: side output for too-late data.
  4. Correction job: periodic backfill/reconciliation from bronze + late lane.
  5. Gold serving: upsert/merge serving table with versioned updates.

This pattern separates low-latency serving from eventual correction, avoiding false “exactly right now” promises.


5) Failure modes you should expect

A) Watermark freeze (no progress)

Symptoms:

Likely causes:

Mitigations:

B) Catch-up replay drops historical truth

Symptoms:

Mitigations:

C) Sink semantics mismatch

Symptoms:

Mitigations:


6) SLOs and observability checklist

Track these by stream and by key segment:

Set guardrails:


7) Tuning playbook (step-by-step)

  1. Baseline delay distribution from raw stream (p50/p90/p99/p99.9 by source).
  2. Set initial watermark lag near p99 delay.
  3. Choose allowed lateness for business criticality (often smaller than you think).
  4. Define sink semantics (append vs upsert vs retract-aware).
  5. Run shadow pipeline with stricter and looser configs.
  6. Compare:
    • freshness,
    • late loss,
    • cost (state/checkpoint).
  7. Promote config via staged rollout; keep replay mode documented.

8) Minimal policy template


9) What “exactly-once” does not save you from

Exactly-once guarantees in stream engines usually protect state/sink consistency under retries/checkpoint recovery. They do not decide your event-time truth boundary.

You can still be exactly-once and wrong if watermark/lateness policies are misconfigured.


10) Practical defaults (good first production pass)


References