Hi Guozhang,
Thanks for the clarify. For the clarify 2, I think the key thing is not users control how much time in maximum to wait for inside code, but is the network client can be aware of the connecting can't be finished and try a good node. In the producer.sender even the selector.poll can timeout, but the next time is also not close the previous connecting and try another good node. In out test env, QA shutdown one of the leader node, the producer send the request will timeout and close the node's connection then request the metadata. But sometimes the request node is also the shutdown node. When connecting the shutting down node to get the metadata, it is in the connecting phase, network client mark the connecting node's state to CONNECTING, but if the node is shutdown, the socket can't be aware of the connecting is broken. Though the selector.poll has timeout parameter, but it will not close the connection, so the next time in the "networkclient.maybeUpdate" it will check if isAnyNodeConnecting, then will not connect to any good node the get the metadata. It need about several minutes to aware the connecting is timeout and try other node. So I want to add a connect.timeout parameter, the selector can find the connecting is timeout and close the connection. It seems the currently the timeout value passed in `selector.poll()` seems can not do this. Thanks, David ------------------ ???????? ------------------ ??????: "Guozhang Wang";<wangg...@gmail.com>; ????????: 2017??5??16??(??????) ????1:51 ??????: "dev@kafka.apache.org"<dev@kafka.apache.org>; ????: Re: [DISCUSS] KIP-148: Add a connect timeout for client Hi David, I may be a bit confused before, just clarifying a few things: 1. As you mentioned, a client will always try to first establish the connection with a broker node before it tries to send any request to it. And after connection is established, it will either continuously send many requests (e.g. produce) for just a single request (e.g. metadata) to the broker, so these two phases are indeed different. 2. In the connected phase, connections.max.idle.ms is used to auto-disconnect the socket if no requests has been sent / received during that period of time; in the connecting phase, we always try to create the socket via "socketChannel.connect" in a non-blocking call, and then checks if the connection has been established, but all the callers of this function (in either producer or consumer) has a timeout parameter as in `selector.poll()`, and the timeout parameter is set either by calculations based on metadata.expiration.time and backoff for producer#sender, or by directly passed values from consumer#poll(timeout), so although there is no directly config controlling that, users can still control how much time in maximum to wait for inside code. I originally thought your scenarios is more on the connected phase, but now I feel you are talking about the connecting phase. For that case, I still feel currently the timeout value passed in `selector.poll()` which is controllable from user code should be sufficient? Guozhang On Sun, May 14, 2017 at 2:37 AM, ???????? <254479...@qq.com> wrote: > Hi Guozhang, > > > Sorry for the delay, thanks for the question. It seems two different > parameters to me: > connect.timeout.ms: only work for the connecting phrase, after connected > phrase this parameter is not used. > connections.max.idle.ms: currently not work in the connecting phrase > (only select return readyKeys >0) will add to the expired manager, after > connected will check if the connection is still alive in some time. > > > Even if we change the connections.max.idle.ms to work including the > connecting phrase, we can not set this parameter to a small value, such as > 5 seconds. Because the client is maybe busy sending message to other node, > it will be disconnected in 5 seconds, so the default value of > connections.max.idle.ms is setting to a larger time. We should have two > parameters to control the connecting phrase behavior and the connected > phrase behavior, do you think so? > > > Thanks, > > > David > > > > > ------------------ ???????? ------------------ > ??????: "Guozhang Wang";<wangg...@gmail.com>; > ????????: 2017??5??6??(??????) ????7:52 > ??????: "dev@kafka.apache.org"<dev@kafka.apache.org>; > > ????: Re: [DISCUSS] KIP-148: Add a connect timeout for client > > > > Hello David, > > Thanks for the KIP. For the described issue, I'm wondering if it can be > resolved by tuning the CONNECTIONS_MAX_IDLE_MS_CONFIG ( > connections.max.idle.ms) on the client side? Default is 9 minutes. > > > Guozhang > > On Tue, May 2, 2017 at 8:22 AM, ???????? <254479...@qq.com> wrote: > > > Hi all, > > > > Currently in our test environment, we found that after one of the broker > > node crash (reboot or os crash), the client may still be connecting to > the > > crash node to send metadata request or other request, and it needs > several > > minutes to be aware that the connection is timeout then try another node > to > > connect to send the request. Then the client may still not be aware of > the > > metadata change after several minutes. > > > > > > So I want to add a connect timeout on the client, please take a look > at?? > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP- > > 148%3A+Add+a+connect+timeout+for+client > > > > Regards, > > > > David > > > > > -- > -- Guozhang > -- -- Guozhang