[ 
https://issues.apache.org/jira/browse/KAFKA-18067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Cadonna reopened KAFKA-18067:
-----------------------------------

We had to revert the fix for this bug 
(https://github.com/apache/kafka/pull/19078) because it introduces a blocking 
bug for AK 4.0. 
The issue is that the fix prevented Kafka Streams from re-initializing its 
transactional producer under exactly-once semantics. That led to an infinite 
loop of {{ProducerFencedException}}s with corresponding rebalances.

For example:
# 1 A network partitions happens that causes the timeout of a transaction.
# 2 The transactional producer is fenced due to invalid producer epoch.
# 3 Kafka Streams closes the tasks dirty and re-joins the group, i.e., a 
rebalance is triggered. 
# 4 The transactional producer is NOT re-initialized and does NOT get a new 
producer epoch. 
# 5 Processing starts but the transactional producer is immediately fenced 
during the attempt to start a new transaction because of the invalid producer 
epoch.
# step 3 is repeated

> Kafka Streams can leak Producer client under EOS
> ------------------------------------------------
>
>                 Key: KAFKA-18067
>                 URL: https://issues.apache.org/jira/browse/KAFKA-18067
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>            Reporter: A. Sophie Blee-Goldman
>            Assignee: TengYao Chi
>            Priority: Major
>              Labels: newbie, newbie++
>             Fix For: 4.0.0
>
>
> Under certain conditions Kafka Streams can end up closing a producer client 
> twice and creating a new one that then is never closed.
> During a StreamThread's shutdown, the TaskManager is closed first, through 
> which the thread's producer client is also closed. Later on we call 
> #unsubscribe on the main consumer, which can result in the #onPartitionsLost 
> callback being invoked and ultimately trying to reset/reinitialize the 
> StreamsProducer if EOS is enabled. This in turn includes closing the current 
> producer and creating a new one. And since the current producer was already 
> closed, we end up closing that client twice and never closing the newly 
> created producer.
> Ideally we would just skip the reset/reinitialize process entirely when 
> invoked during shutdown. This solves the two problems here (leaked client and 
> double close), while also removing the unnecessary overhead of creating an 
> entirely new client just to throw it away



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to