Konstantine Karantasis created KAFKA-9849:
---------------------------------------------

             Summary: Fix issue with worker.unsync.backoff.ms creating zombie 
workers when incremental cooperative rebalancing is used
                 Key: KAFKA-9849
                 URL: https://issues.apache.org/jira/browse/KAFKA-9849
             Project: Kafka
          Issue Type: Bug
          Components: KafkaConnect
    Affects Versions: 2.4.1, 2.3.1, 2.5.0
            Reporter: Konstantine Karantasis
            Assignee: Konstantine Karantasis


{{worker.unsync.backoff.ms}} is a property that was introduced a while ago when 
eager (stop-the-world) rebalancing was the only option for Connect workers. The 
goal of this property is to avoid triggering consecutive rebalances when a 
worker fails to catch up with the config topic in time and therefore 
voluntarily leaves the group with a {{LeaveGroupRequest}}.

With incremental cooperative rebalancing this backoff 
({{worker.unsync.backoff.ms) }}that has a default value equal to the default 
value of {{scheduled.rebalance.max.delay.ms}} (5min) might end up turning a 
worker into a zombie worker that retains its tasks but stays out of the group. 
This worker, by backing off from rebalancing, leaves not option to the leader 
of the group but to reassign the missing tasks that were thought as lost to 
other members of the group if the worker that backs off does not return in time 
before {{scheduled.rebalance.max.delay.ms}} expires. 

Clearly, {{worker.unsync.backoff.ms}} was introduced to avoid rebalancing 
storms under the presence of intermittent connectivity issues with eager 
rebalancing. However when incremental cooperative rebalancing is used this 
property might inadvertently make workers operate as zombie workers that keep 
running tasks while they are out of the group.

Of course, a good tradeoff needs to be made between avoiding to make the 
protocol too eager again and at the same time avoiding to turn workers into 
zombies when connection is not lost for too long from the broker coordinator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to