[ 
https://issues.apache.org/jira/browse/KAFKA-4237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16078541#comment-16078541
 ] 

Guozhang Wang commented on KAFKA-4237:
--------------------------------------

I think option 2 is not actually safe and we need to think a bit more, since we 
cannot guarantee the client will always resend the JoinGroup request with the 
same fields (subscription, protocol, etc); however when we end the prepare 
rebalance phase we need all these information to send to the leader, and if a 
later JoinGroup changed such information, we need to either stop the current 
rebalance or ignore it, right?

> Avoid long request timeout for the consumer
> -------------------------------------------
>
>                 Key: KAFKA-4237
>                 URL: https://issues.apache.org/jira/browse/KAFKA-4237
>             Project: Kafka
>          Issue Type: Improvement
>          Components: consumer
>            Reporter: Jason Gustafson
>
> In the consumer rebalance protocol, the JoinGroup can stay in purgatory on 
> the server for as long as the rebalance timeout. For the Java client, that 
> means that the request timeout must be at least as large as the rebalance 
> timeout (which is governed by {{max.poll.interval.ms}} since KIP-62 and 
> {{session.timeout.ms}} before then). By default, since 0.10.1, this is 5 
> minutes plus some change, which makes the clients slow to detect some hard 
> failures.
> To fix this, two options come to mind:
> 1. Right now, all request APIs are limited by the same request timeout in 
> {{NetworkClient}}, but there's not really any reason why this must be so. We 
> could use a separate timeout for the JoinGroup request (the implementations 
> of this is straightforward: 
> https://github.com/confluentinc/kafka/pull/108/files).
> 2. Alternatively, we could prevent the server from holding the JoinGroup in 
> purgatory for such a long time. Instead, it could return early from the 
> JoinGroup (say before the session timeout has expired) with an error code 
> (e.g. REBALANCE_IN_PROGRESS), which tells the client that it should just 
> resend the JoinGroup.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to