Hello. I got very helpfull advices from guozhang. And now, we have a ready fix and reproducer.
This PR fixes a very long living Kafka Consumer bug. Please, join to the review. [1] https://issues.apache.org/jira/browse/KAFKA-8104 [2] https://github.com/apache/kafka/pull/7460 В Пн, 07/10/2019 в 21:37 +0300, Nikolay Izhikov пишет: > Hello. > > We have KAFKA-8104 "Consumer cannot rejoin to the group after rebalancing" > [1] issue. > It reproduces on many production environments. > > I prepared reproducer and fix [2] for this issue. > But, I need assistance with the "fair" reproducer. > > Please, help me with the review and "fair" reproducer: > > PR contains the fix of race condition bug between "consumer thread" and > "consumer coordinator heartbeat thread". It reproduces in many production > environments. > > Condition for reproducing: > > 1. Consumer thread initiates rejoin to the group because of commit timeout. > Call of `AbstractCoordinator#joinGroupIfNeeded` which leads to > `sendJoinGroupRequest`. > 2. `JoinGroupResponseHandler` writes to the > `AbstractCoordinator.this.generation` new generation data and leaves the` > synchronized` section. > 3. Heartbeat thread executes `mabeLeaveGroup` and clears generation data via > `resetGenerationOnLeaveGroup`. > 4. Consumer thread executes `onJoinComplete(generation.generationId, > generation.memberId, generation.protocol, memberAssignment);` with the > cleared generation data. This leads to the corresponding > exception. > > The race fixed with the condition in `maybeLeaveGroup`: if we have ongoing > rejoin process in consumer thread there is no reason to reset generation data > and send `LeaveGroupRequest` in heartbeat > thread. > > This PR contains unfair "reproducer". > It implemented with the `CountDownLatch` that imitates described race in > `AbstractCoordinator` code. > > > > [1] https://issues.apache.org/jira/browse/KAFKA-8104 > [2] https://github.com/apache/kafka/pull/7460
signature.asc
Description: This is a digitally signed message part