Kafka Exactly-Once Semantics Playbook (Idempotent Producer, Transactions, and Real-World Limits)
Date: 2026-03-24
Category: knowledge / software
TL;DR
- Start with idempotent producer for single-stream duplicate prevention (
enable.idempotence=true). - Use transactions only when you must atomically do consume-transform-produce (or multi-partition writes) and can pay operational complexity.
- For consumers, EOS is incomplete unless they read with
isolation.level=read_committed. - Kafka EOS is not magic across external systems (DB, REST calls). For end-to-end correctness, combine with outbox/inbox/idempotency keys.
1) What “exactly once” in Kafka really means
In Kafka, “exactly once” is a scoped guarantee:
- Producer idempotence: broker de-duplicates retried appends from the same producer session.
- Transactions: writes across multiple partitions (and offset commits) are committed atomically.
- Read committed: consumers can hide aborted transactional writes.
It does not mean “exactly once across every side effect in your architecture.”
2) Decision ladder: which guarantee level to choose
| Situation | Recommended mode | Why |
|---|---|---|
| Fire-and-forget telemetry, occasional loss acceptable | at-most-once-ish | Lowest overhead |
| Most event pipelines, duplicates acceptable with consumer dedup | at-least-once + idempotent consumers | Simpler ops |
| Need duplicate-safe producer writes per partition | idempotent producer | Cheap reliability upgrade |
| Need atomic consume→produce or multi-topic atomic publish | transactions + read_committed | True Kafka-level EOS |
| Need DB + Kafka atomicity | transactional outbox (+ CDC/inbox) | Kafka transaction alone is insufficient |
3) Producer idempotence baseline (default-safe)
Use idempotence as the default baseline for production producers.
Typical safety constraints:
enable.idempotence=trueacks=allretries>0(often effectively large, bounded bydelivery.timeout.ms)max.in.flight.requests.per.connection<=5(required for idempotent ordering guarantees)
Practical note: idempotence protects retries, but by itself it does not provide atomic cross-partition semantics.
4) Transactional producer pattern (consume → process → produce)
Core lifecycle:
initTransactions()once at startup- per batch/window:
beginTransaction()- produce output records
sendOffsetsToTransaction(...)commitTransaction()
- on failure:
abortTransaction()
Operational rules
transactional.idmust be stable and unique per logical producer identity.- Reusing the same
transactional.idacross concurrent live instances causes fencing behavior. - Tune
transaction.timeout.msto match realistic batch duration + jitter, not wishful latency.
5) Consumer-side requirement many teams miss
If downstream readers use read_uncommitted, they can observe aborted records and nullify your EOS intent.
For transactional topics used in correctness-sensitive paths:
- set
isolation.level=read_committed - verify downstream libraries/connectors also honor this
Remember: visibility semantics are part of the guarantee, not an optional detail.
6) Kafka Streams shortcut
For Kafka Streams applications, EOS is generally enabled via:
processing.guarantee=exactly_once_v2
This is usually safer than hand-rolling low-level producer transaction choreography, especially for topology-style pipelines.
7) Post-2.6 improvement: why EOS got more practical
A major historical pain point was scaling transactional producers with consumer-group rebalances. Improvements (notably KIP-447) reduced this mismatch and made EOS friendlier for high-partition stream workloads.
Meaning: EOS is still not free, but much less of a “toy only” feature than early versions.
8) Failure modes and how to preempt them
Zombie/fenced producer surprises
- Symptom: producer suddenly fails transactional calls after failover/restart race.
- Fix: strict ownership of each
transactional.id; clean instance orchestration.
Long transactions causing lag/visibility pain
- Symptom: downstream sees delayed progression due to transaction boundaries.
- Fix: reduce transaction duration; commit more frequently with bounded batch size.
EOS assumed across DB side effects
- Symptom: “Kafka says exactly once” but DB rows duplicated/missing on retries.
- Fix: outbox + idempotent consumers/inbox, or explicit dedup keys.
Connector mismatch
- Symptom: source/sink connectors violate expected guarantee.
- Fix: verify connector-specific EOS support and exact config semantics.
Throughput collapse from over-transactionalization
- Symptom: high coordinator overhead, poor batching.
- Fix: apply EOS only to flows that need atomic semantics.
9) Minimal production checklist
- Producers explicitly configured for idempotence (and validated in effective config).
- Transactional flows have deterministic, stable
transactional.idmapping. - Consumers on transactional topics use
read_committedwhere correctness matters. - Transaction duration SLO defined (p95/p99) and alerting in place.
- Clear “EOS boundary” docs: what is guaranteed, what is not.
- External side effects protected with outbox/inbox/idempotency keys.
- Replay test proves no duplicate business effect under crash/restart/retry scenarios.
10) Operator metrics to watch
- Producer transaction abort/commit rates
- Transaction duration distribution
- Consumer lag around transactional topics (especially burst periods)
- Rebalance frequency (for stream apps)
- Duplicate-business-event counters (downstream truth metric)
If you only monitor broker health and not duplicate business outcomes, you are flying blind.
11) Practical rollout strategy
- Enable idempotence everywhere first (low-risk win).
- Choose one high-value pipeline for transactions.
- Add
read_committedconsumers + shadow validation. - Run chaos tests: broker failover, process restart, rebalance storms.
- Promote gradually and keep explicit rollback path to at-least-once + dedup.
References
- Apache Kafka KIP-98: Exactly Once Delivery and Transactional Messaging
https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging - Apache Kafka KIP-447: Producer scalability for exactly once semantics
https://cwiki.apache.org/confluence/display/KAFKA/KIP-447:+Producer+scalability+for+exactly+once+semantics - Apache Kafka producer configuration reference
https://kafka.apache.org/38/generated/producer_config.html - Kafka transactions deep dive (Strimzi)
https://strimzi.io/blog/2023/05/03/kafka-transactions/ - Confluent delivery semantics overview
https://docs.confluent.io/kafka/design/delivery-semantics.html