[ 
https://issues.apache.org/jira/browse/KAFKA-9184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16987279#comment-16987279
 ] 

ASF GitHub Bot commented on KAFKA-9184:
---------------------------------------

kkonstantine commented on pull request #7771: KAFKA-9184: Redundant task 
creation and periodic rebalances after zombie Connect worker rejoins the group
URL: https://github.com/apache/kafka/pull/7771
 
 
   Zombie workers, defined as workers that lose connectivity with the Kafka 
broker coordinator and get kicked out of the group but don't experience a jvm 
restart, have been keeping their tasks running. This side-effect is more 
disrupting with the new Incremental Cooperative rebalance protocol. When such 
workers return: 
   a) they join the group with existing assignment and this leads to redundant 
tasks running in the Connect cluster, and
   b) they interfere with the computation of lost tasks, which before this fix 
would lead to the scheduled rebalance delay not being reset correctly back to 
0. This results in periodic rebalances. 
   
   This fix focuses on resolving the above side-effects as follows: 
   * Each worker now tracks its connectivity with the broker coordinator in an 
unblocking manner. This allows the worker to detect that the broker coordinator 
is unreachable. The timeout is set to be equal to the heartbeat interval. If 
during this time the connection remains inactive, the worker will proactively 
stop all its connectors and tasks and will keep attempting to connect to the 
coordinator. 
   * The incremental cooperative assignor will keep the delay to a positive 
value as long as it can detect lost tasks. If the set of tasks that are 
computed as lost becomes empty, the delay will be set to zero and no additional 
rebalancing will be scheduled. 
   
   Besides the test included in this PR, the improvements are being tested with 
a framework that deploys a Connect cluster on docker images and introduces 
network partitions between all or selected workers and the Kafka brokers. 
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Redundant task creation and periodic rebalances after zombie worker rejoins 
> the group
> -------------------------------------------------------------------------------------
>
>                 Key: KAFKA-9184
>                 URL: https://issues.apache.org/jira/browse/KAFKA-9184
>             Project: Kafka
>          Issue Type: Bug
>          Components: KafkaConnect
>    Affects Versions: 2.4.0, 2.3.2
>            Reporter: Konstantine Karantasis
>            Assignee: Konstantine Karantasis
>            Priority: Blocker
>             Fix For: 2.4.0, 2.3.2
>
>
> First reported here: 
> https://stackoverflow.com/questions/58631092/kafka-connect-assigns-same-task-to-multiple-workers
> There seems to be an issue with task reassignment when a worker rejoins after 
> an unsuccessful join request. The worker seems to be outside the group for a 
> generation but when it joins again the same task is running in more than one 
> worker



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to