Jason Gustafson created KAFKA-9232:
--------------------------------------

             Summary: Coordinator heartbeat completion does not work for 
JoinGroup v3
                 Key: KAFKA-9232
                 URL: https://issues.apache.org/jira/browse/KAFKA-9232
             Project: Kafka
          Issue Type: Bug
            Reporter: Jason Gustafson
            Assignee: Sophie Blee-Goldman


For older versions of the JoinGroup API, the coordinator implements a static 
timeout for new members of 5 minutes. This timeout is implemented using the 
heartbeat purgatory and we expect that the delayed operation will be force 
completed if the member successfully joins. This is implemented in 
GroupCoordinator with the following logic:

{code:scala}
            group.maybeInvokeJoinCallback(member, joinResult)
            completeAndScheduleNextHeartbeatExpiration(group, member)
            member.isNew = false
{code}

However, heartbeat completion depends on this check:

{code:scala}
  def shouldKeepAlive(deadlineMs: Long): Boolean = {
    if (isAwaitingJoin)
      !isNew || latestHeartbeat + GroupCoordinator.NewMemberJoinTimeoutMs > 
deadlineMs
    else awaitingSyncCallback != null ||
      latestHeartbeat + sessionTimeoutMs > deadlineMs
  }
{code}
Since we invoke the join callback first, we will fall to the second branch. 
This will only return true when the latest heartbeat plus session timeout 
exceeds the deadline. The deadline in this case depends only on the statically 
configured new member timeout, which means the heartbeat cannot complete until 
about 5 minutes have passed. If the member falls out of the group before then, 
then the heartbeat ultimately expires, which may trigger a spurious rebalance.

Newer versions of the protocol are not affected by this bug because we return 
immediately the first time a member joins the group.






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to