Your understanding sounds correct. One follow up: even idempotent producer itself, gives you "exactly-once", because for many use cases it's important to not write duplicates into a topic.
Thus, I would not say that you need transactions to do exactly-once (but I guess it depends what your exact understanding of the term is). Transactions are for atomic multi-partitions writes, that is more than "just" exactly-once. Also note, for a "producer only" pattern, you need to check what data was written successfully for both cases, to resume correctly. The full power of transactions only comes for a consume-process-write pattern (as used in Kafka Streams), because it allows to couple the read-path and the write-path and thus it's easy to recover in a consistent way. For the "producer only" pattern, there is still manual work required to figure out where the producer left off exactly. Neither idempotent producer not transactional producer can detect if you call `send()` for the same data twice. Thus, on recovery, you need to figure out which pre-crash `send()` was successful and what was the first unsuccessful one (and only call `send()` for the unnecessful one again). -Matthias On 2/21/19 3:15 AM, Greenhorn Techie wrote: > Thanks Matthias for the answers and the update to FAQ. I understand > exactly-once semantics much better now. > > In summary, producer side idempotence can be used on its own using > enable.idempotence parameter (which underneath uses PID and sequence > number combo). However, if exactly-once semantics is needed, we need to > use enable.idempotence along with transactional.id > <http://transactional.id> on the producer side. ’transactional.id > <http://transactional.id>’ acts more as a supersede and hence nullifies > and handles all the (ill)effects of PID - scenarios like producer > restarts, crashes etc. > > Please reply if my understanding is incorrect. > > Thanks > > > On 20 February 2019 at 23:57:17, Matthias J. Sax (matth...@confluent.io > <mailto:matth...@confluent.io>) wrote: > >> Done. Feel free to extend/correct/complete etc. >> >> -Matthias >> >> On 2/20/19 9:56 AM, Guozhang Wang wrote: >> > Since we've seen quite a lot of questions recently about EOS on the >> > mailing list. I think it worth adding an FAQ entry here: >> > >> > https://cwiki.apache.org/confluence/display/KAFKA/FAQ >> > >> > So that we can refer future questions to the page than answering them >> > repeatedly. @Matthias J Sax <mailto:matth...@confluent.io >> > <mailto:matth...@confluent.io>> : would you >> > like to do it? >> > >> > >> > Guozhang >> > >> > On Tue, Feb 19, 2019 at 3:12 PM Matthias J. Sax <matth...@confluent.io >> > <mailto:matth...@confluent.io> >> > <mailto:matth...@confluent.io <mailto:matth...@confluent.io>>> wrote: >> > >> > Even if the question was sent 4 times to the mailing list, I am only >> > answering is exactly-once (sorry for the bad joke -- could not >> > resist...) >> > >> > >> > You have to distinguish between "idempotent producer" and >> > "transactional >> > producer". >> > >> > If you enable idempotent writes (config `enable.idempotence`), your >> > producer will get a cluster wide unique PID assigned. This PID, >> > together >> > with the sequence number, is used broker side to de-duplicate messages >> > on write (in case the producer retries). Different producers can use >> > the >> > same sequence numbers, so PID are used to distinguish different >> > producers and get unique PID-seqNum pairs. >> > >> > Idempotent writes, apply to single messages in isolation only. Consumer >> > side, there is no change because no transactions are used >> > (`isolation.level` config has no impact). >> > >> > >> > If you want to write multiple message in an atomic manner (ie, write >> > all >> > 5 messages or none of them), you would need to use transactions. For >> > this case you also assign a `transactional.id <http://transactional.id> >> > <http://transactional.id>` producer side and should >> > configure consumers with `read_committed` mode. The >> > `transactional.id <http://transactional.id> <http://transactional.id>` >> > is required, to abort in-flight transactions, in case a producer has an >> > open transaction, crashes, and is restarted. (A PID is not sufficient, >> > because it's lost on a crash). When there is an open transaction, and a >> > producer crashes and is restarted, the broker will detect the open >> > transaction (ie, same `transactional.id <http://transactional.id> >> > <http://transactional.id>`) >> > and abort it automatically. >> > >> > For compacted topics or multi-segment transactions are no special case. >> > They work like regular transactions. >> > >> > >> > -Matthias >> > >> > >> > On 2/19/19 5:14 AM, Greenhorn Techie wrote: >> > > Hi, >> > > >> > > Our data getting into Kafka is transactional in nature and hence I am >> > > trying to understand EOS better. My present understanding is as >> > below: >> > > >> > > It is mentioned that when producer starts, it will have a new PID, >> > but only >> > > valid till the session. Does that mean, is it a pre-requisite to >> > have the >> > > same / single producer session for exactly-once guarantees? I >> > presume it is >> > > not required. As per my understanding, this is where >> > transactionl.id <http://transactionl.id> <http://transactionl.id> comes >> > > into picture which is user defined and hence can survive producer >> > restarts. >> > > >> > > I have few questions regarding the same: >> > > >> > > 1. If the above statement is correct, why do we need PID in the >> > first place >> > > and instead use transactionl.id <http://transactionl.id> >> > <http://transactionl.id> all >> over? >> > > 2. I understand that sequence number is something that is generated >> > by >> > > producer and increases monotonically. Does that mean, the sequence >> > number >> > > changes across producer restarts along with a new PID? >> > > 3. Is PID meant mainly for idempotence where as transactional.id >> > <http://transactional.id> >> > <http://transactional.id> is for >> > > transactional support? >> > > 4. On the consumer side, only one config parameter is defined i.e. >> > > isolation.level. For EOS, I presume this needs to be set to >> > > ‘read_committed’ only. For EOS, it should never be set to >> > ‘read_uncommitted’ >> > > 5. What is the impact of setting ‘enable.idempotence’ to true without >> > > setting ‘transactional.id <http://transactional.id> >> > <http://transactional.id>’ >> on the >> > producer side? Does it have any >> > > (side)effect? >> > > 6. How does EOS work for compacted topics? Will the EOS behaviour >> > be any >> > > different for compacted topics? >> > > 7. How does EOS work when transactions are written to two >> > different log >> > > segments? >> > > >> > > Can anyone please help me understand the nuances around EOS >> > guarantees? >> > > >> > > Thanks >> > > >> > >> > >> > >> > -- >> > -- Guozhang >>
signature.asc
Description: OpenPGP digital signature