[ https://issues.apache.org/jira/browse/KAFKA-9184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16987279#comment-16987279 ]
ASF GitHub Bot commented on KAFKA-9184: --------------------------------------- kkonstantine commented on pull request #7771: KAFKA-9184: Redundant task creation and periodic rebalances after zombie Connect worker rejoins the group URL: https://github.com/apache/kafka/pull/7771 Zombie workers, defined as workers that lose connectivity with the Kafka broker coordinator and get kicked out of the group but don't experience a jvm restart, have been keeping their tasks running. This side-effect is more disrupting with the new Incremental Cooperative rebalance protocol. When such workers return: a) they join the group with existing assignment and this leads to redundant tasks running in the Connect cluster, and b) they interfere with the computation of lost tasks, which before this fix would lead to the scheduled rebalance delay not being reset correctly back to 0. This results in periodic rebalances. This fix focuses on resolving the above side-effects as follows: * Each worker now tracks its connectivity with the broker coordinator in an unblocking manner. This allows the worker to detect that the broker coordinator is unreachable. The timeout is set to be equal to the heartbeat interval. If during this time the connection remains inactive, the worker will proactively stop all its connectors and tasks and will keep attempting to connect to the coordinator. * The incremental cooperative assignor will keep the delay to a positive value as long as it can detect lost tasks. If the set of tasks that are computed as lost becomes empty, the delay will be set to zero and no additional rebalancing will be scheduled. Besides the test included in this PR, the improvements are being tested with a framework that deploys a Connect cluster on docker images and introduces network partitions between all or selected workers and the Kafka brokers. ### Committer Checklist (excluded from commit message) - [ ] Verify design and implementation - [ ] Verify test coverage and CI build status - [ ] Verify documentation (including upgrade notes) ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Redundant task creation and periodic rebalances after zombie worker rejoins > the group > ------------------------------------------------------------------------------------- > > Key: KAFKA-9184 > URL: https://issues.apache.org/jira/browse/KAFKA-9184 > Project: Kafka > Issue Type: Bug > Components: KafkaConnect > Affects Versions: 2.4.0, 2.3.2 > Reporter: Konstantine Karantasis > Assignee: Konstantine Karantasis > Priority: Blocker > Fix For: 2.4.0, 2.3.2 > > > First reported here: > https://stackoverflow.com/questions/58631092/kafka-connect-assigns-same-task-to-multiple-workers > There seems to be an issue with task reassignment when a worker rejoins after > an unsuccessful join request. The worker seems to be outside the group for a > generation but when it joins again the same task is running in more than one > worker -- This message was sent by Atlassian Jira (v8.3.4#803005)