[ https://issues.apache.org/jira/browse/KAFKA-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16078541#comment-16078541 ]
Guozhang Wang commented on KAFKA-4237: -------------------------------------- I think option 2 is not actually safe and we need to think a bit more, since we cannot guarantee the client will always resend the JoinGroup request with the same fields (subscription, protocol, etc); however when we end the prepare rebalance phase we need all these information to send to the leader, and if a later JoinGroup changed such information, we need to either stop the current rebalance or ignore it, right? > Avoid long request timeout for the consumer > ------------------------------------------- > > Key: KAFKA-4237 > URL: https://issues.apache.org/jira/browse/KAFKA-4237 > Project: Kafka > Issue Type: Improvement > Components: consumer > Reporter: Jason Gustafson > > In the consumer rebalance protocol, the JoinGroup can stay in purgatory on > the server for as long as the rebalance timeout. For the Java client, that > means that the request timeout must be at least as large as the rebalance > timeout (which is governed by {{max.poll.interval.ms}} since KIP-62 and > {{session.timeout.ms}} before then). By default, since 0.10.1, this is 5 > minutes plus some change, which makes the clients slow to detect some hard > failures. > To fix this, two options come to mind: > 1. Right now, all request APIs are limited by the same request timeout in > {{NetworkClient}}, but there's not really any reason why this must be so. We > could use a separate timeout for the JoinGroup request (the implementations > of this is straightforward: > https://github.com/confluentinc/kafka/pull/108/files). > 2. Alternatively, we could prevent the server from holding the JoinGroup in > purgatory for such a long time. Instead, it could return early from the > JoinGroup (say before the session timeout has expired) with an error code > (e.g. REBALANCE_IN_PROGRESS), which tells the client that it should just > resend the JoinGroup. -- This message was sent by Atlassian JIRA (v6.4.14#64029)