Sophie Blee-Goldman created KAFKA-9140: ------------------------------------------
Summary: Consumer gets stuck rejoining the group indefinitely Key: KAFKA-9140 URL: https://issues.apache.org/jira/browse/KAFKA-9140 Project: Kafka Issue Type: Bug Components: clients, consumer Affects Versions: 2.4.0 Reporter: Sophie Blee-Goldman There seems to be a race condition that is now causing a rejoining member to potentially get stuck infinitely initiating a rejoin. The relevant logs are attached, but basically it repeats this message (and nothing else) continuously until killed/shutdown: {code:java} [2019-11-05 01:53:54,699] INFO [Consumer clientId=StreamsUpgradeTest-a4c1cff8-7883-49cd-82da-d2cdfc33a2f0-StreamThread-1-consumer, groupId=StreamsUpgradeTest] Generation data was cleared by heartbeat thread. Initiating rejoin. (org.apache.kafka.clients.consumer.internals.AbstractCoordinator) {code} The message that appears was added as part of the bugfix (PR #7460) for this related race condition: KAFKA-8104. This issue was uncovered by the Streams version probing upgrade test, which fails with a varying frequency. Here is the rate of failures for different system test runs so far: trunk (cooperative): 1/1 and 2/10 failures 2.4 (cooperative) : 0/10 and 1/15 failures trunk (eager): 0/10 failures I've kicked off some high-repeat runs to complete overnight and hopefully shed more light. Note that I have also kicked off runs of both 2.4 and trunk with the PR for KAFKA-8104 reverted. Both of them saw 2/10 failures, due to hitting the bug that was fixed by PR #7460. It is therefore unclear whether PR #7460 introduced another or a new race condition/bug, or merely uncovered an existing one that previously would have first failed due to KAFKA-8104. -- This message was sent by Atlassian Jira (v8.3.4#803005)