[ https://issues.apache.org/jira/browse/KAFKA-9140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Manikumar resolved KAFKA-9140. ------------------------------ Assignee: Guozhang Wang Resolution: Fixed > Consumer gets stuck rejoining the group indefinitely > ---------------------------------------------------- > > Key: KAFKA-9140 > URL: https://issues.apache.org/jira/browse/KAFKA-9140 > Project: Kafka > Issue Type: Bug > Components: clients, consumer > Affects Versions: 2.4.0 > Reporter: Sophie Blee-Goldman > Assignee: Guozhang Wang > Priority: Blocker > Fix For: 2.4.0 > > Attachments: debug.tgz, info.tgz, kafka-data-logs-1.tgz, > kafka-data-logs-2.tgz, server-start-stdout-stderr.log.tgz, streams.log.tgz > > > There seems to be a race condition that is now causing a rejoining member to > potentially get stuck infinitely initiating a rejoin. The relevant client > logs are attached (streams.log.tgz; all others attachments are broker logs), > but basically it repeats this message (and nothing else) continuously until > killed/shutdown: > > {code:java} > [2019-11-05 01:53:54,699] INFO [Consumer > clientId=StreamsUpgradeTest-a4c1cff8-7883-49cd-82da-d2cdfc33a2f0-StreamThread-1-consumer, > groupId=StreamsUpgradeTest] Generation data was cleared by heartbeat thread. > Initiating rejoin. > (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) > {code} > > The message that appears was added as part of the bugfix ([PR > 7460|https://github.com/apache/kafka/pull/7460]) for this related race > condition: KAFKA-8104. > This issue was uncovered by the Streams version probing upgrade test, which > fails with a varying frequency. Here is the rate of failures for different > system test runs so far: > trunk (cooperative): 1/1 and 2/10 failures > 2.4 (cooperative) : 0/10 and 1/15 failures > trunk (eager): 0/10 failures > I've kicked off some high-repeat runs to complete overnight and hopefully > shed more light. > Note that I have also kicked off runs of both 2.4 and trunk with the PR for > KAFKA-8104 reverted. Both of them saw 2/10 failures, due to hitting the bug > that was fixed by [PR 7460|https://github.com/apache/kafka/pull/7460]. It is > therefore unclear whether [PR 7460|https://github.com/apache/kafka/pull/7460] > introduced another or a new race condition/bug, or merely uncovered an > existing one that previously would have first failed due to KAFKA-8104. > -- This message was sent by Atlassian Jira (v8.3.4#803005)