Kafka Exactly-Once Semantics Playbook (Idempotent Producer, Transactions, and Real-World Limits)

Date: 2026-03-24
Category: knowledge / software

TL;DR

Start with idempotent producer for single-stream duplicate prevention (enable.idempotence=true).
Use transactions only when you must atomically do consume-transform-produce (or multi-partition writes) and can pay operational complexity.
For consumers, EOS is incomplete unless they read with isolation.level=read_committed.
Kafka EOS is not magic across external systems (DB, REST calls). For end-to-end correctness, combine with outbox/inbox/idempotency keys.

1) What “exactly once” in Kafka really means

In Kafka, “exactly once” is a scoped guarantee:

Producer idempotence: broker de-duplicates retried appends from the same producer session.
Transactions: writes across multiple partitions (and offset commits) are committed atomically.
Read committed: consumers can hide aborted transactional writes.

It does not mean “exactly once across every side effect in your architecture.”

2) Decision ladder: which guarantee level to choose

Situation	Recommended mode	Why
Fire-and-forget telemetry, occasional loss acceptable	at-most-once-ish	Lowest overhead
Most event pipelines, duplicates acceptable with consumer dedup	at-least-once + idempotent consumers	Simpler ops
Need duplicate-safe producer writes per partition	idempotent producer	Cheap reliability upgrade
Need atomic consume→produce or multi-topic atomic publish	transactions + read_committed	True Kafka-level EOS
Need DB + Kafka atomicity	transactional outbox (+ CDC/inbox)	Kafka transaction alone is insufficient

3) Producer idempotence baseline (default-safe)

Use idempotence as the default baseline for production producers.

Typical safety constraints:

enable.idempotence=true
acks=all
retries>0 (often effectively large, bounded by delivery.timeout.ms)
max.in.flight.requests.per.connection<=5 (required for idempotent ordering guarantees)

Practical note: idempotence protects retries, but by itself it does not provide atomic cross-partition semantics.

4) Transactional producer pattern (consume → process → produce)

Core lifecycle:

initTransactions() once at startup
per batch/window:
- beginTransaction()
- produce output records
- sendOffsetsToTransaction(...)
- commitTransaction()
on failure: abortTransaction()

Operational rules

transactional.id must be stable and unique per logical producer identity.
Reusing the same transactional.id across concurrent live instances causes fencing behavior.
Tune transaction.timeout.ms to match realistic batch duration + jitter, not wishful latency.

5) Consumer-side requirement many teams miss

If downstream readers use read_uncommitted, they can observe aborted records and nullify your EOS intent.

For transactional topics used in correctness-sensitive paths:

set isolation.level=read_committed
verify downstream libraries/connectors also honor this

Remember: visibility semantics are part of the guarantee, not an optional detail.

6) Kafka Streams shortcut

For Kafka Streams applications, EOS is generally enabled via:

processing.guarantee=exactly_once_v2

This is usually safer than hand-rolling low-level producer transaction choreography, especially for topology-style pipelines.

7) Post-2.6 improvement: why EOS got more practical

A major historical pain point was scaling transactional producers with consumer-group rebalances. Improvements (notably KIP-447) reduced this mismatch and made EOS friendlier for high-partition stream workloads.

Meaning: EOS is still not free, but much less of a “toy only” feature than early versions.

8) Failure modes and how to preempt them

Zombie/fenced producer surprises
- Symptom: producer suddenly fails transactional calls after failover/restart race.
- Fix: strict ownership of each transactional.id; clean instance orchestration.
Long transactions causing lag/visibility pain
- Symptom: downstream sees delayed progression due to transaction boundaries.
- Fix: reduce transaction duration; commit more frequently with bounded batch size.
EOS assumed across DB side effects
- Symptom: “Kafka says exactly once” but DB rows duplicated/missing on retries.
- Fix: outbox + idempotent consumers/inbox, or explicit dedup keys.
Connector mismatch
- Symptom: source/sink connectors violate expected guarantee.
- Fix: verify connector-specific EOS support and exact config semantics.
Throughput collapse from over-transactionalization
- Symptom: high coordinator overhead, poor batching.
- Fix: apply EOS only to flows that need atomic semantics.

9) Minimal production checklist

Producers explicitly configured for idempotence (and validated in effective config).
Transactional flows have deterministic, stable transactional.id mapping.
Consumers on transactional topics use read_committed where correctness matters.
Transaction duration SLO defined (p95/p99) and alerting in place.
Clear “EOS boundary” docs: what is guaranteed, what is not.
External side effects protected with outbox/inbox/idempotency keys.
Replay test proves no duplicate business effect under crash/restart/retry scenarios.

10) Operator metrics to watch

Producer transaction abort/commit rates
Transaction duration distribution
Consumer lag around transactional topics (especially burst periods)
Rebalance frequency (for stream apps)
Duplicate-business-event counters (downstream truth metric)

If you only monitor broker health and not duplicate business outcomes, you are flying blind.

11) Practical rollout strategy

Enable idempotence everywhere first (low-risk win).
Choose one high-value pipeline for transactions.
Add read_committed consumers + shadow validation.
Run chaos tests: broker failover, process restart, rebalance storms.
Promote gradually and keep explicit rollback path to at-least-once + dedup.

References

Apache Kafka KIP-98: Exactly Once Delivery and Transactional Messaging
https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging
Apache Kafka KIP-447: Producer scalability for exactly once semantics
https://cwiki.apache.org/confluence/display/KAFKA/KIP-447:+Producer+scalability+for+exactly+once+semantics
Apache Kafka producer configuration reference
https://kafka.apache.org/38/generated/producer_config.html
Kafka transactions deep dive (Strimzi)
https://strimzi.io/blog/2023/05/03/kafka-transactions/
Confluent delivery semantics overview
https://docs.confluent.io/kafka/design/delivery-semantics.html