Hi Colin, I think the exponential backoff should still apply, thanks for the explanation. thanks, David
------------------ ???????? ------------------ ??????: "Colin McCabe";<cmcc...@apache.org>; ????????: 2017??6??13??(??????) ????1:43 ??????: "dev"<dev@kafka.apache.org>; ????: Re: ??????Re?? [DISCUSS] KIP-148: Add a connect timeout for client Just a note: KIP-144 added exponential backoff for broker reconnect attempts, configured via reconnect.backoff.max.ms. cheers, Colin On Sat, Jun 10, 2017, at 08:42, ???????? wrote: > ------------------ ???????? ------------------ > ??????: "????????";<254479...@qq.com>; > ????????: 2017??6??4??(??????) ????6:05 > ??????: "dev"<dev@kafka.apache.org>; > > ????: Re?? [DISCUSS] KIP-148: Add a connect timeout for client > > > > >I guess one obvious question is, how does this interact with retries? > >Does it result in a failure getting delivered to the end user more > >quickly if connecting is impossible the first few times we try? Does > >exponential backoff still apply? > > > Yes, for the retries it will make the end user more quickly to connect. > After the produce request > failed because of timeout, network client close the connection and start > to connect to the leastLoadedNode node. > If the node has no response, we will quickly close the connecting in the > specified timeout and try another node. > > > And for the exponential backoff, do you mean for the TCP's exponential > backoff or the NetworkClient's exponential backoff ? > It seems the NetworkClient has no exponential backoff (the > reconnect.backoff.ms parameter) > > > Thanks > David > > > > > ------------------ ???????? ------------------ > ??????: "Colin McCabe";<cmcc...@apache.org>; > ????????: 2017??5??31??(??????) ????2:44 > ??????: "dev"<dev@kafka.apache.org>; > > ????: Re: [DISCUSS] KIP-148: Add a connect timeout for client > > > > On Mon, May 29, 2017, at 15:46, Guozhang Wang wrote: > > On Wed, May 24, 2017 at 9:59 AM, Colin McCabe <cmcc...@apache.org> wrote: > > > > > On Tue, May 23, 2017, at 19:07, Guozhang Wang wrote: > > > > I think using a single config to cover end-to-end latency with > > > > connecting > > > > and request round-trip may not be best appropriate since 1) some request > > > > may need much more time than others since they are parked (fetch request > > > > with long polling, join group request etc) or throttled, > > > > > > Hmm. My proposal was to implement _both_ end-to-end timeouts and > > > per-call timeouts. In that case, some requests needing much more time > > > than others should not be a concern, since we can simply set a higher > > > per-call timeout on the requests we think will need more time. > > > > > > > and 2) some > > > > requests are prerequisite of others, like group request to discover the > > > > coordinator before the fetch offset request, and implementation wise > > > > these > > > > request send/receive is embedded in latter ones, hence it is not clear > > > > if > > > > the `request.timeout.ms` should cover just a single RPC or more. > > > > > > As far as I know, the request timeout has always covered a single RP If > > > we want to implement a higher level timeout that spans multiple RPCs, we > > > can set the per-call timeouts appropriately. For example: > > > > > > > long deadline = System.currentTimeMillis() + 60000; > > > > callA(callTimeout = deadline - System.currentTimeMillis()) > > > > callB(callTimeout = deadline - System.currentTimeMillis()) > > > > > > > > I may have misunderstand your previous email. Just clarifying: > > > > 1) On the client we already have some configs for controlling end-to-end > > timeout, e.g. "max.block.ms" on producer controls how long "send()" and > > "partitionsFor()" will block for, and inside such API calls multiple > > request round trips may be sent, and for the first request round trip, a > > connecting phase may or may not be included. All of these are be covered > > in > > this "max.block.ms" timeout today. However, as we discussed before not > > all > > request round trips have similar latency expectation, so it is better to > > make a per-request "request.timeout.ms" and the overall "max.block.ms" > > would need to be at least the max of them. > > That makes sense. > > Just to be clear, when you say "per-request timeout" are you talking > about a timeout that can be different for each request? (This doesn't > exist today, but has been proposed.) Or are you talking about > request.timeout.ms, the single timeout that currently applies to all > requests in NetworkClient? > > > > > 2) Now back to the question whether we should make "request.timeout.ms" > > include potential connection phase as well: assume we are going to add > > the > > pre-request "request.timeout.ms" as suggested above, then we may still > > have > > a tight bound on how long connecting should take. For example, let's say > > we > > make "joingroup.request.timeout.ms" (or "fetch.request.timeout.ms" to be > > large since we want really long polling behavior) to be a large value, > > say > > 200 seconds, then if the client is trying to connect to the broker while > > sending the request, and the broker has died, then we may still be > > blocked > > waiting for 30 seconds while I think David's motivation is to fail-fast > > in > > these cases. > > Thanks for the explanation. I think I understand better now. David > wants to be able to have a long timeout for waiting for the server to > process the request, but a shorter timeout for waiting for the > connection to be established. In that case, implementing the additional > timeout makes sense. > > I guess one obvious question is, how does this interact with retries? > Does it result in a failure getting delivered to the end user more > quickly if connecting is impossible the first few times we try? Does > exponential backoff still apply? > > best, > Colin > > > > > > > > > > > > > > So no matter whether we add a `connect.timeout.ms` in addition to ` > > > > request.timeout.ms`, we should consider adding per-request-type timeout > > > > value, and make `request.timeout.ms` a global default; if we add the ` > > > > connect.timeout.ms` the per-request value is only for the round trip, > > > > otherwise it is supposed to include the connecting time. Personally I'd > > > > prefer the first option to add a universal `connect.timeout.ms`, and in > > > > another KIP consider adding per-request-type timeout overrides. > > > > > > Why have a special case for time spent connecting, though? Why would > > > the user care where the time went, as long as the timeout was met? It > > > feels like this is just a hack because we couldn't raise > > > request.timeout.ms to the value that it "should" have been at for the > > > shorter requests. As someone already commented, it's confusing to have > > > all these knobs that we don't really need. > > > > > > > > I think that is exactly what David cares (please correct me if I'm > > wrong): > > for some request I would like to wait long enough for it to be completed, > > like join-group request; while at the same time if it has encountered > > some > > issues while trying to connect to the broker to send the join group > > request, I want to be notified sooner. > > > > > > > > > > > > BTW if the consumer issue is the only cause that we are having a high > > > > default value, I'd suggest we separate the consumer rebalance timeout > > > > and > > > > not piggy-back on the session timeout. Then we can set the default ` > > > > request.timeout.ms` to a smaller value, like 10 secs. This is orthogonal > > > > to > > > > this KIP discussion and we can continue this in a separate thread. > > > > > > +1 > > > > > > cheers, > > > Colin > > > > > > > > > > > > > > > Guozhang > > > > > > > > On Tue, May 23, 2017 at 3:31 PM, Colin McCabe <cmcc...@apache.org> > > > wrote: > > > > > > > > > Another note-- it would be really nice if timeouts were end-to-end, > > > > > rather than being set for particular phases of an RP From a user > > > > > point > > > > > of view, a 30 second timeout should mean that the call either succeeds > > > > > or fails after 30 seconds, regardless of how much time is spent > > > > > looking > > > > > for metadata, connecting to brokers, waiting for brokers, etc. This > > > > > is > > > > > implemented in AdminClient by setting a deadline when the call is > > > > > first > > > > > created and referring to that afterwards. > > > > > > > > > > best, > > > > > Colin > > > > > > > > > > > > > > > On Tue, May 23, 2017, at 13:18, Colin McCabe wrote: > > > > > > In the AdminClient, we allow setting per-call timeouts. The global > > > > > > timeout is just a default. It seems like that is really what we > > > should > > > > > > do in the producer and consumer as well, rather than having a lot of > > > > > > special cases for timeouts in connecting vs. other call states. > > > Then > > > > > > join requests could gave a 5 minute timeout, but other requests > > > > > > could > > > > > > gave a shorter one. Thoughts? > > > > > > > > > > > > Cheers, > > > > > > Colin > > > > > > > > > > > > OnTue, May 23, 2017, at 04:27, Rajini Sivaram wrote: > > > > > > > Guozhang, > > > > > > > > > > > > > > At the moment we don't have a connect timeout. And the behaviour > > > > > > > suggested > > > > > > > in the KIP is useful to address this. > > > > > > > > > > > > > > We do however have a request.timeout.ms. This is the amount of > > > time it > > > > > > > would take to detect a crashed broker if the broker crashed after > > > > > > > a > > > > > > > connection was established. Unfortunately in the consumer, this > > > > > > > was > > > > > > > increased to > 5minutes since JoinRequest can take up to > > > > > > > max.poll.interval.ms, which has a default of 5 minutes. Since the > > > > > > > whole point of this timeout is to detect a crashed broker, 5 > > > minutes is > > > > > > > too > > > > > > > large. > > > > > > > > > > > > > > My suggestion was to use request.timeout.ms to also detect > > > connection > > > > > > > timeouts to a crashed broker - implement the behavior suggested in > > > the > > > > > > > KIP > > > > > > > without adding a new config parameter. As Ismael has said, this > > > will > > > > > need > > > > > > > to fix request.timeout.ms in the consumer. > > > > > > > > > > > > > > > > > > > > > On Mon, May 22, 2017 at 1:23 PM, Simon Souter < > > > > > sim...@cakesolutions.net> > > > > > > > wrote: > > > > > > > > > > > > > > > The following tickets are probably relevant to this KIP: > > > > > > > > > > > > > > > > https://issues.apache.org/jira/browse/KAFKA-3457 > > > > > > > > https://issues.apache.org/jira/browse/KAFKA-1894 > > > > > > > > https://issues.apache.org/jira/browse/KAFKA-3834 > > > > > > > > > > > > > > > > On 22 May 2017 at 16:30, Rajini Sivaram <rajinisiva...@gmail.com > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > Ismael, > > > > > > > > > > > > > > > > > > Yes, agree. My concern was that a connection can be shutdown > > > > > uncleanly at > > > > > > > > > any time. If a client is in the middle of a request, then it > > > times > > > > > out > > > > > > > > > after min(request.timeout.ms, tcp-timeout). If we add another > > > > > config > > > > > > > > > option > > > > > > > > > connect.timeout.ms, then we will sometimes wait for min( > > > > > > > > connect.timeout.ms > > > > > > > > > , > > > > > > > > > tcp-timeout) and sometimes for min(request.timeout.ms, > > > > > tcp-timeout), > > > > > > > > > depending > > > > > > > > > on connection state. One config option feels neater to me. > > > > > > > > > > > > > > > > > > On Mon, May 22, 2017 at 11:21 AM, Ismael Juma < > > > ism...@juma.me.uk> > > > > > wrote: > > > > > > > > > > > > > > > > > > > Rajini, > > > > > > > > > > > > > > > > > > > > For this to have the desired effect, we'd probably need to > > > lower > > > > > the > > > > > > > > > > default request.timeout.ms for the consumer and fix the > > > > > underlying > > > > > > > > > reason > > > > > > > > > > why it is a little over 5 minutes at the moment. > > > > > > > > > > > > > > > > > > > > Ismael > > > > > > > > > > > > > > > > > > > > On Mon, May 22, 2017 at 4:15 PM, Rajini Sivaram < > > > > > > > > rajinisiva...@gmail.com > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > Hi David, > > > > > > > > > > > > > > > > > > > > > > Sorry, what I meant was: Can you reuse the existing > > > > > configuration > > > > > > > > > option > > > > > > > > > > > request.timeout,ms , instead of adding a new config and > > > add the > > > > > > > > > behaviour > > > > > > > > > > > that you have proposed in the KIP for the connection phase > > > > > using this > > > > > > > > > > > timeout? I think the timeout for connection is useful. I > > > > > > > > > > > am > > > > > not sure > > > > > > > > we > > > > > > > > > > > need another configuration option to implement it. > > > > > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > > > > > > > > > > > > > Rajini > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, May 22, 2017 at 11:06 AM, ???????? > > > > > > > > > > > <254479...@qq.com> > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > Hi Rajini. > > > > > > > > > > > > > > > > > > > > > > > > When kafka node' machine is shutdown or network is > > > closed, > > > > > the > > > > > > > > > > connecting > > > > > > > > > > > > phase could not use the request.timeout.ms, because the > > > > > client > > > > > > > > > haven't > > > > > > > > > > > > send a req yet. And no response for the nio, the > > > selector > > > > > will > > > > > > > > not > > > > > > > > > > > close > > > > > > > > > > > > the connect, so it will not choose other good node to > > > get the > > > > > > > > > metadata. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks > > > > > > > > > > > > David > > > > > > > > > > > > > > > > > > > > > > > > ------------------ ???????? ------------------ > > > > > > > > > > > > *??????:* "Rajini Sivaram" <rajinisiva...@gmail.com>; > > > > > > > > > > > > *????????:* 2017??5??22??(??????) 20:17 > > > > > > > > > > > > *??????:* "dev" <dev@kafka.apache.org>; > > > > > > > > > > > > *????:* Re: [DISCUSS] KIP-148: Add a connect timeout for > > > client > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi David, > > > > > > > > > > > > > > > > > > > > > > > > Is there a reason why you wouldn't want to use > > > > > request.timeout.ms > > > > > > > > as > > > > > > > > > > the > > > > > > > > > > > > timeout parameter for connections? Then you would use > > > > > > > > > > > > the > > > > > same > > > > > > > > > timeout > > > > > > > > > > > for > > > > > > > > > > > > connected and connecting phases when shutdown is > > > > > > > > > > > > unclean. > > > > > You could > > > > > > > > > > still > > > > > > > > > > > > use the timeout to ensure that next metadata request is > > > sent > > > > > to > > > > > > > > > another > > > > > > > > > > > > node. > > > > > > > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > > > > > > > > > > > > > > > Rajini > > > > > > > > > > > > > > > > > > > > > > > > On Sun, May 21, 2017 at 9:51 AM, ???????? > > > > > > > > > > > > <254479...@qq.com> > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > Hi Guozhang, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks for the clarify. For the clarify 2, I think the > > > key > > > > > thing > > > > > > > > is > > > > > > > > > > not > > > > > > > > > > > > > users control how much time in maximum to wait for > > > inside > > > > > code, > > > > > > > > but > > > > > > > > > > is > > > > > > > > > > > > the > > > > > > > > > > > > > network client can be aware of the connecting can't be > > > > > finished > > > > > > > > and > > > > > > > > > > > try a > > > > > > > > > > > > > good node. In the producer.sender even the > > > selector.poll > > > > > can > > > > > > > > > timeout, > > > > > > > > > > > but > > > > > > > > > > > > > the next time is also not close the previous > > > > > > > > > > > > > connecting > > > > > and try > > > > > > > > > > another > > > > > > > > > > > > > good node. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > In out test env, QA shutdown one of the leader node, > > > the > > > > > producer > > > > > > > > > > send > > > > > > > > > > > > the > > > > > > > > > > > > > request will timeout and close the node's connection > > > then > > > > > request > > > > > > > > > the > > > > > > > > > > > > > metadata. But sometimes the request node is also the > > > > > shutdown > > > > > > > > > node. > > > > > > > > > > > > When > > > > > > > > > > > > > connecting the shutting down node to get the metadata, > > > it > > > > > is in > > > > > > > > the > > > > > > > > > > > > > connecting phase, network client mark the connecting > > > > > node's state > > > > > > > > > to > > > > > > > > > > > > > CONNECTING, but if the node is shutdown, the socket > > > can't > > > > > be > > > > > > > > aware > > > > > > > > > > of > > > > > > > > > > > > the > > > > > > > > > > > > > connecting is broken. Though the selector.poll has > > > timeout > > > > > > > > > parameter, > > > > > > > > > > > but > > > > > > > > > > > > > it will not close the connection, so the next > > > > > > > > > > > > > time in the "networkclient.maybeUpdate" it will check > > > if > > > > > > > > > > > > > isAnyNodeConnecting, then will not connect to any good > > > > > node the > > > > > > > > get > > > > > > > > > > the > > > > > > > > > > > > > metadata. It need about several minutes to > > > > > > > > > > > > > aware the connecting is timeout and try other node. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > So I want to add a connect.timeout parameter, the > > > > > selector can > > > > > > > > > find > > > > > > > > > > > the > > > > > > > > > > > > > connecting is timeout and close the connection. It > > > seems > > > > > the > > > > > > > > > > currently > > > > > > > > > > > > the > > > > > > > > > > > > > timeout value passed in `selector.poll()` > > > > > > > > > > > > > seems can not do this. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > David > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ------------------ ???????? ------------------ > > > > > > > > > > > > > ??????: "Guozhang Wang";<wangg...@gmail.com>; > > > > > > > > > > > > > ????????: 2017??5??16??(??????) ????1:51 > > > > > > > > > > > > > ??????: "dev@kafka.apache.org"<dev@kafka.apache.org>; > > > > > > > > > > > > > > > > > > > > > > > > > > ????: Re: [DISCUSS] KIP-148: Add a connect timeout for > > > client > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi David, > > > > > > > > > > > > > > > > > > > > > > > > > > I may be a bit confused before, just clarifying a few > > > > > things: > > > > > > > > > > > > > > > > > > > > > > > > > > 1. As you mentioned, a client will always try to first > > > > > establish > > > > > > > > > the > > > > > > > > > > > > > connection with a broker node before it tries to send > > > any > > > > > request > > > > > > > > > to > > > > > > > > > > > it. > > > > > > > > > > > > > And after connection is established, it will either > > > > > continuously > > > > > > > > > send > > > > > > > > > > > > many > > > > > > > > > > > > > requests (e.g. produce) for just a single request > > > > > > > > > > > > > (e.g. > > > > > metadata) > > > > > > > > > to > > > > > > > > > > > the > > > > > > > > > > > > > broker, so these two phases are indeed different. > > > > > > > > > > > > > > > > > > > > > > > > > > 2. In the connected phase, connections.max.idle.ms is > > > > > used to > > > > > > > > > > > > > auto-disconnect the socket if no requests has been > > > sent / > > > > > > > > received > > > > > > > > > > > during > > > > > > > > > > > > > that period of time; in the connecting phase, we > > > > > > > > > > > > > always > > > > > try to > > > > > > > > > create > > > > > > > > > > > the > > > > > > > > > > > > > socket via "socketChannel.connect" in a non-blocking > > > call, > > > > > and > > > > > > > > then > > > > > > > > > > > > checks > > > > > > > > > > > > > if the connection has been established, but all the > > > > > callers of > > > > > > > > this > > > > > > > > > > > > > function (in either producer or consumer) has a > > > > > > > > > > > > > timeout > > > > > parameter > > > > > > > > > as > > > > > > > > > > in > > > > > > > > > > > > > `selector.poll()`, and the timeout parameter is set > > > either > > > > > by > > > > > > > > > > > > calculations > > > > > > > > > > > > > based on metadata.expiration.time and backoff for > > > > > > > > producer#sender, > > > > > > > > > or > > > > > > > > > > > by > > > > > > > > > > > > > directly passed values from consumer#poll(timeout), so > > > > > although > > > > > > > > > there > > > > > > > > > > > is > > > > > > > > > > > > no > > > > > > > > > > > > > directly config controlling that, users can still > > > control > > > > > how > > > > > > > > much > > > > > > > > > > time > > > > > > > > > > > > in > > > > > > > > > > > > > maximum to wait for inside code. > > > > > > > > > > > > > > > > > > > > > > > > > > I originally thought your scenarios is more on the > > > > > connected > > > > > > > > phase, > > > > > > > > > > but > > > > > > > > > > > > now > > > > > > > > > > > > > I feel you are talking about the connecting phase. For > > > that > > > > > > > > case, I > > > > > > > > > > > still > > > > > > > > > > > > > feel currently the timeout value passed in > > > > > `selector.poll()` > > > > > > > > which > > > > > > > > > is > > > > > > > > > > > > > controllable from user code should be sufficient? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Guozhang > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Sun, May 14, 2017 at 2:37 AM, ???????? < > > > 254479...@qq.com> > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi Guozhang, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Sorry for the delay, thanks for the question. It > > > seems > > > > > two > > > > > > > > > > different > > > > > > > > > > > > > > parameters to me: > > > > > > > > > > > > > > connect.timeout.ms: only work for the connecting > > > > > phrase, after > > > > > > > > > > > > connected > > > > > > > > > > > > > > phrase this parameter is not used. > > > > > > > > > > > > > > connections.max.idle.ms: currently not work in the > > > > > connecting > > > > > > > > > > phrase > > > > > > > > > > > > > > (only select return readyKeys >0) will add to the > > > expired > > > > > > > > > manager, > > > > > > > > > > > > after > > > > > > > > > > > > > > connected will check if the connection is still > > > alive in > > > > > some > > > > > > > > > time. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Even if we change the connections.max.idle.ms to > > > work > > > > > > > > including > > > > > > > > > > the > > > > > > > > > > > > > > connecting phrase, we can not set this parameter to > > > > > > > > > > > > > > a > > > > > small > > > > > > > > > value, > > > > > > > > > > > such > > > > > > > > > > > > > as > > > > > > > > > > > > > > 5 seconds. Because the client is maybe busy sending > > > > > message to > > > > > > > > > > other > > > > > > > > > > > > > node, > > > > > > > > > > > > > > it will be disconnected in 5 seconds, so the default > > > > > value of > > > > > > > > > > > > > > connections.max.idle.ms is setting to a larger > > > time. We > > > > > should > > > > > > > > > > have > > > > > > > > > > > > two > > > > > > > > > > > > > > parameters to control the connecting phrase behavior > > > and > > > > > the > > > > > > > > > > > connected > > > > > > > > > > > > > > phrase behavior, do you think so? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > David > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ------------------ ???????? ------------------ > > > > > > > > > > > > > > ??????: "Guozhang Wang";<wangg...@gmail.com>; > > > > > > > > > > > > > > ????????: 2017??5??6??(??????) ????7:52 > > > > > > > > > > > > > > ??????: > > > > > > > > > > > > > > "dev@kafka.apache.org"<dev@kafka.apache.org>; > > > > > > > > > > > > > > > > > > > > > > > > > > > > ????: Re: [DISCUSS] KIP-148: Add a connect timeout > > > > > > > > > > > > > > for > > > > > client > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hello David, > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks for the KIP. For the described issue, I'm > > > > > wondering if > > > > > > > > it > > > > > > > > > > can > > > > > > > > > > > be > > > > > > > > > > > > > > resolved by tuning the > > > CONNECTIONS_MAX_IDLE_MS_CONFIG ( > > > > > > > > > > > > > > connections.max.idle.ms) on the client side? > > > Default is > > > > > 9 > > > > > > > > > minutes. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Guozhang > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, May 2, 2017 at 8:22 AM, ???????? < > > > 254479...@qq.com> > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi all, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Currently in our test environment, we found that > > > after > > > > > one of > > > > > > > > > the > > > > > > > > > > > > > broker > > > > > > > > > > > > > > > node crash (reboot or os crash), the client may > > > still > > > > > be > > > > > > > > > > connecting > > > > > > > > > > > > to > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > crash node to send metadata request or other > > > request, > > > > > and it > > > > > > > > > > needs > > > > > > > > > > > > > > several > > > > > > > > > > > > > > > minutes to be aware that the connection is timeout > > > > > then try > > > > > > > > > > another > > > > > > > > > > > > > node > > > > > > > > > > > > > > to > > > > > > > > > > > > > > > connect to send the request. Then the client may > > > still > > > > > not be > > > > > > > > > > aware > > > > > > > > > > > > of > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > metadata change after several minutes. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > So I want to add a connect timeout on the client, > > > > > please > > > > > > > > > take a > > > > > > > > > > > > look > > > > > > > > > > > > > > at?? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/ > > > confluence/display/KAFKA/KIP- > > > > > > > > > > > > > > > 148%3A+Add+a+connect+timeout+for+client > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > David > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > -- Guozhang > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > -- Guozhang > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > [image: cake_logo_strap_screen 400.jpg] < > > > > > http://www.cakesolutions.net> > > > > > > > > > > > > > > > > Simon Souter > > > > > > > > (Office) 0845 617 1200 > > > > > > > > Houldsworth Mill, Houldsworth Street, Reddish, Stockport, SK5 > > > 6DA, UK > > > > > > > > [image: twitter-circle-darkgrey.png] > > > > > > > > <https://twitter.com/cakesolutions> [image: > > > > > > > > facebook-circle-darkgrey.png] > > > > > > > > <https://www.facebook.com/cakesolutionslimited/> [image: > > > > > > > > linkedin-circle-darkgrey.png] > > > > > > > > <https://www.linkedin.com/company/cake-solutions-limited> > > > > > > > > [image: Reactive Applications] > > > > > > > > <https://cakesolutions.sigstr.net/uc/588780e6825be936ed5682e0> > > > > > > > > Company registered in the UK, No. 4184567 If you have received > > > this > > > > > e-mail > > > > > > > > in error, please accept our apologies, destroy it immediately, > > > and > > > > > it would > > > > > > > > be greatly appreciated if you notified the sender. It is your > > > > > > > > responsibility to protect your system from viruses and any other > > > > > harmful > > > > > > > > code or device. We try to eliminate them from e-mails and > > > > > attachments, but > > > > > > > > we accept no liability for any which remain. We may monitor or > > > > > access any > > > > > > > > or all e-mails sent to us. > > > > > > > > [image: Powered by Sigstr] > > > > > > > > <https://cakesolutions.sigstr.net/uc/588780e6825be936ed5682e0/ > > > > > watermark> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > -- Guozhang > > > > > > > > > > > -- > > -- Guozhang