[ https://issues.apache.org/jira/browse/KAFKA-18067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Bruno Cadonna reopened KAFKA-18067: ----------------------------------- We had to revert the fix for this bug (https://github.com/apache/kafka/pull/19078) because it introduces a blocking bug for AK 4.0. The issue is that the fix prevented Kafka Streams from re-initializing its transactional producer under exactly-once semantics. That led to an infinite loop of {{ProducerFencedException}}s with corresponding rebalances. For example: # 1 A network partitions happens that causes the timeout of a transaction. # 2 The transactional producer is fenced due to invalid producer epoch. # 3 Kafka Streams closes the tasks dirty and re-joins the group, i.e., a rebalance is triggered. # 4 The transactional producer is NOT re-initialized and does NOT get a new producer epoch. # 5 Processing starts but the transactional producer is immediately fenced during the attempt to start a new transaction because of the invalid producer epoch. # step 3 is repeated > Kafka Streams can leak Producer client under EOS > ------------------------------------------------ > > Key: KAFKA-18067 > URL: https://issues.apache.org/jira/browse/KAFKA-18067 > Project: Kafka > Issue Type: Bug > Components: streams > Reporter: A. Sophie Blee-Goldman > Assignee: TengYao Chi > Priority: Major > Labels: newbie, newbie++ > Fix For: 4.0.0 > > > Under certain conditions Kafka Streams can end up closing a producer client > twice and creating a new one that then is never closed. > During a StreamThread's shutdown, the TaskManager is closed first, through > which the thread's producer client is also closed. Later on we call > #unsubscribe on the main consumer, which can result in the #onPartitionsLost > callback being invoked and ultimately trying to reset/reinitialize the > StreamsProducer if EOS is enabled. This in turn includes closing the current > producer and creating a new one. And since the current producer was already > closed, we end up closing that client twice and never closing the newly > created producer. > Ideally we would just skip the reset/reinitialize process entirely when > invoked during shutdown. This solves the two problems here (leaked client and > double close), while also removing the unnecessary overhead of creating an > entirely new client just to throw it away -- This message was sent by Atlassian Jira (v8.20.10#820010)