[ https://issues.apache.org/jira/browse/KAFKA-13766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601949#comment-17601949 ]
Guozhang Wang commented on KAFKA-13766: --------------------------------------- Inside onCompleteJoin, in the block starting with {{// trigger the awaiting join group response callback for all the members after rebalancing{{ Indicates that once we are in the completing rebalance phase, we’ve re-enabled the HB with session timeout. I.e. in that phase we effectively have two timers: {{completeAndScheduleNextHeartbeatExpiration(group, member)}} and {{schedulePendingSync(group)}} whichever triggers first, we would fail the member and re-trigger the rebalance. And since in general session.timeout is smaller than rebalance timeout, we would hit the former if there’s a delay on assignment. > Use `max.poll.interval.ms` as the timeout during complete-rebalance phase > ------------------------------------------------------------------------- > > Key: KAFKA-13766 > URL: https://issues.apache.org/jira/browse/KAFKA-13766 > Project: Kafka > Issue Type: Bug > Components: core, group-coordinator > Reporter: Guozhang Wang > Assignee: David Jacot > Priority: Major > Labels: new-rebalance-should-fix > > The lifetime of a consumer can be categorized in three phases: > 1) During normal processing, the broker expects a hb request periodically > from consumer, and that is timed by the `session.timeout.ms`. > 2) During the prepare_rebalance, the broker would expect a join-group request > to be received within the rebalance.timeout, which is piggy-backed as the > `max.poll.interval.ms`. > 3) During the complete_rebalance, the broker would expect a sync-group > request to be received again within the `session.timeout.ms`. > So during different phases of the life of the consumer, different timeout > would be used to bound the timer. > Nowadays with cooperative rebalance protocol, we can still return records and > process them in the middle of a rebalance from {{consumer.poll}}. In that > case, for phase 3) we should also use the `max.poll.interval.ms` to bound the > timer, which is in practice larger than `session.timeout.ms`. -- This message was sent by Atlassian Jira (v8.20.10#820010)