Event-Time Watermarks & Late Data Playbook

Date: 2026-02-27
Category: software
Purpose: Practical guide for designing stream pipelines that stay correct under out-of-order/late events without exploding latency or state cost.

1) Why this matters

Most real-time bugs are time bugs.

If you aggregate by processing time only, you get low latency but non-deterministic results during backpressure, retries, partition skew, or replay. If you aggregate by event time with no guardrails, you can get unbounded state, stalled outputs, or silent data drops.

The core engineering trade-off is:

Correctness (include late/out-of-order events)
Freshness (emit results quickly)
Cost (state size, checkpoint/recovery overhead)

You can’t maximize all three simultaneously. Watermark strategy is the control surface.

2) Mental model (keep this in your head)

Event time vs processing time

Event time: when the event actually happened at the source.
Processing time: when your engine observes it.

A healthy pipeline treats processing time as runtime reality, and event time as analytical truth.

Watermark

Operationally, watermark is a statement like:

“I believe events older than T are mostly complete.”

Important: watermark is not a perfect truth guarantee. It is a bounded-risk contract.

Late data

Any event whose timestamp is older than current watermark is late. Late handling policy decides whether it is:

Included via allowed lateness,
Routed to side output / dead-letter stream,
Dropped.

3) The 5 knobs that decide everything

1. Timestamp quality

Use source event timestamp if trustworthy.
If upstream clocks are noisy, add ingest timestamp too (dual timestamp schema).
Monitor clock drift distribution (p50/p95/p99).

2. Watermark lag

Larger lag -> better completeness, higher latency/state.
Smaller lag -> lower latency, more late drops.

Rule of thumb: start from observed inter-arrival delay percentiles, not gut feeling.

3. Window type and size

Tumbling for stable reporting,
Sliding for smoother signals,
Session for bursty user/activity behavior.

Window size multiplies with watermark lag to define how long state stays resident.

4. Allowed lateness

Extra grace after watermark for updates.
Should be short and explicit.
Pair with update/upsert sink semantics when possible.

5. Late-data routing

Never drop silently. Route very-late records to:

late_events topic/table,
periodic correction batch,
quality dashboard.

4) Architecture pattern (production default)

Bronze (raw immutable): append-only source events with event_time + ingest_time.
Silver (streaming aggregate): event-time windows + watermark + bounded lateness.
Late lane: side output for too-late data.
Correction job: periodic backfill/reconciliation from bronze + late lane.
Gold serving: upsert/merge serving table with versioned updates.

This pattern separates low-latency serving from eventual correction, avoiding false “exactly right now” promises.

5) Failure modes you should expect

A) Watermark freeze (no progress)

Symptoms:

Window outputs stop even though ingestion continues.
Checkpoint size grows.

Likely causes:

One partition/idle source pinning global watermark.
Skewed key or stalled upstream shard.

Mitigations:

Enable idle source handling.
Per-partition watermarking where supported.
Alert on watermark stagnation duration.

B) Catch-up replay drops historical truth

Symptoms:

After outage replay, old events are discarded as “too late”.

Mitigations:

Temporarily widen watermark/allowed lateness for replay mode,
or run dedicated backfill pipeline and merge corrections.

C) Sink semantics mismatch

Symptoms:

Late updates computed but sink only append-only -> duplicates/wrong totals.

Mitigations:

Use upsert-capable sink for mutable windows,
or explicitly model revision records (window_id, version, is_final).

6) SLOs and observability checklist

Track these by stream and by key segment:

Watermark lag (event_time_now - watermark)
Late-event ratio (overall + by lateness bucket)
Side-output late volume
Window finalize delay
State bytes / key cardinality
Reprocessing correction delta (% changed records)

Set guardrails:

Watermark lag p95 > threshold for N minutes -> page
Late ratio spike > baseline x2 -> investigate upstream timing/drift
Correction delta > tolerance -> data quality incident

7) Tuning playbook (step-by-step)

Baseline delay distribution from raw stream (p50/p90/p99/p99.9 by source).
Set initial watermark lag near p99 delay.
Choose allowed lateness for business criticality (often smaller than you think).
Define sink semantics (append vs upsert vs retract-aware).
Run shadow pipeline with stricter and looser configs.
Compare:
- freshness,
- late loss,
- cost (state/checkpoint).
Promote config via staged rollout; keep replay mode documented.

8) Minimal policy template

Event timestamp source: event_time (fallback: ingest_time + quality flag)
Watermark lag: X minutes
Allowed lateness: Y minutes
Very-late handling: side output late_events
Sink mode: upsert by (window_start, window_end, entity_key)
Reconciliation cadence: hourly/daily
Incident trigger: watermark stagnation > Z minutes

9) What “exactly-once” does not save you from

Exactly-once guarantees in stream engines usually protect state/sink consistency under retries/checkpoint recovery. They do not decide your event-time truth boundary.

You can still be exactly-once and wrong if watermark/lateness policies are misconfigured.

10) Practical defaults (good first production pass)

Prefer event-time windows for analytics/business metrics.
Keep watermark policy explicit in code and runbook.
Treat late events as first-class data, not noise.
Add correction lane early; retrofitting later is expensive.
Review delay distribution quarterly (traffic patterns drift).

References

Databricks Structured Streaming docs — watermarks and state/output semantics:
https://docs.databricks.com/aws/en/structured-streaming/watermarks
Confluent Cloud for Flink docs — event time, processing time, watermark behavior:
https://docs.confluent.io/cloud/current/flink/concepts/timely-stream-processing.html
Apache Flink docs — watermark strategy and window lifecycle/allowed lateness:
https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/event-time/generating_watermarks/
https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/operators/windows/#allowed-lateness