Hi Cheng, 1) Thanks for the update, the KIP now says ` socket.connections.setup.timeout.ms*`*, which sounds good.
2) Not sure 10s is a good default because it unnecessarily times out connections, only to attempt re-connecting to the same broker (except in the leastLoadedNode case where it would be useful to have a lower timeout). Couldn't we just use 30s (request timeout) as the default value, retaining the current behaviour? Or alternatively, only apply the timeout where it makes sense to have a lower timeout (i.e. a timeout for leastLoadedNode)? Regards, Rajini On Thu, May 14, 2020 at 5:00 AM Cheng Tan <c...@confluent.io> wrote: > Hi Rajini, > > Thanks for the comments. > > > I think > > they started off as connection timeouts but now include authentication > time > > as well. Have we considered using similar configs for this case? > > > The new config I proposed is focusing on the connections to unreachable > servers. The timeout count won’t involved the authentication. I think > “socket.” prefix make sense. I’ll change it. Colin mentioned that adding > the key word “setup” can help people understand that this config only > applies to the connection setup. What about “socket.connection.setup.ms”? > > > The KIP proposes 10s as the default. What does this mean for typical > > connections like a produce request going to the leader? > > The new timeout config applies to each connection. It’s at the > NetworkClient level and won’t consider the underlying connection logic. > Specifically, by default, every connection will have 10 seconds to become > “connected” from “connecting”, which implies the corresponding socket > channel is now connected (SocketChanel.finishConnect() returns True), no > matter what the request logic and abstraction is. > > Please let me know if these make sense to you or if you have more > thoughts. Thank you. > > Best, -Cheng > > > 在 2020年5月13日,上午7:11,Rajini Sivaram <rajinisiva...@gmail.com> 写道: > > > > Hi Cheng, > > > > Thanks for the KIP, sounds like a good improvement. A couple of comments: > > > > 1) We currently have client connection timeouts on the broker with > configs > > named `xxx.socket.timeout.ms` (e.g. controller.socket.timeout.ms). I > think > > they started off as connection timeouts but now include authentication > time > > as well. Have we considered using similar configs for this case? We may > > want to prefix the new config with `socket.` anyway - something along the > > lines of `socket.connection.timeout.ms` if it is just the connection > time. > > > > 2) The KIP proposes 10s as the default. What does this mean for typical > > connections like a produce request going to the leader? Instead of one > > connection attempt to the leader, we want three separate connection > > attempts within the request timeout to the leader? > > > > Regards, > > > > Rajini > > > > > >> On Thu, May 7, 2020 at 11:51 PM Jose Garcia Sancio < > jsan...@confluent.io> > >> wrote: > >> Cheng, > >> Thanks for the KIP and the detailed proposal section. LGTM! > >>>> On Thu, May 7, 2020 at 3:38 PM Cheng Tan <c...@confluent.io> wrote: > >>>> I think more about the potential wider use cases, I modified the > >> proposal to target all the connection. Thanks. > >>> - Best, - Cheng Tan > >>>> On May 7, 2020, at 1:41 AM, Cheng Tan <c...@confluent.io> wrote: > >>>> Hi Colin, > >>>> Sorry for the confusion. I’m proposing to implement timeout in the > >> NetworkClient.leastLoadedNode() when iterating all the cached node. The > >> alternative I can think is to implement the timeout in > NetworkClient.poll() > >>>> I’d prefer to implement in the leastLoadedNode(). Here’re the reasons: > >>>> Usually when clients send a request, they will asking the network > >> client to send the request to a specific node. In this case, the > >> connection.setup.timeout won’t matter too much because the client > doesn’t > >> want to try other nodes for that specific request. The request level > >> timeout would be enough. The metadata fetcher fetches the nodes status > >> periodically so the clients can reassign the request to another node > after > >> timeout. > >>>> Consumer, producer, and AdminClient are all using leastLoadedNode() > >> for metadata fetch, where the connection setup timeout can play an > >> important role. Unlike other requests can refer to the metadata for node > >> condition, the metadata requests can only blindly choose a node for > retry > >> in the worst scenario. We want to make sure the client can get the > metadata > >> smoothly and as soon as possible. As a result, we need this > >> connection.setup.timeout. > >>>> Implementing the timeout in poll() or anywhere else might need an > >> extra iteration of all nodes, which might downgrade the network client > >> performance. > >>>> I also updated the KIP content and KIP status. Please let me know if > >> the above ideas make sense. Thanks. > >>>> Best, - Cheng Tan > >>>>> On May 4, 2020, at 5:26 PM, Colin McCabe <cmcc...@apache.org > <mailto: > >> cmcc...@apache.org>> wrote: > >>>>> Hi Cheng, > >>>>> On the KIP page, it lists this KIP as "draft." It seems like "under > >> discussion" is appropriate here, right? > >>>>>> Currently, the initial socket connection timeout is depending on > >> Linux kernel setting > >>>>>> tcp_syn_retries. The timeout value is 2 ^ (tcp_sync_retries + 1) - 1 > >> seconds. For the > >>>>>> reasons below, we want to control the client-side socket timeout > >> directly using > >>>>>> configuration files > >>>>> Linux is just one example of an OS that Kafka could run on, right? > >> You could also be running on MacOS, for example. > >>>>>> I'm proposing to do a lazy socket connection time out. That is, we > >> only check if > >>>>>> we need to timeout a socket when we consider the corresponding node > >> as a > >>>>>> candidate in the node provider. > >>>>> The NodeProvider is an AdminClient abstraction, right? Why wouldn't > >> we implement a connection setup timeout for all clients, not just > >> AdminClient? > >>>>> best, > >>>>> Colin > >>>>> On Mon, May 4, 2020, at 13:18, Colin McCabe wrote: > >>>>>> Hmm. A big part of the reason behind the KIP is that the default > >>>>>> connection timeout behavior of the OS doesn't work for Kafka, right? > >>>>>> For example, on Linux, if we wait 127 seconds for a connection > >> attempt > >>>>>> to time out, we won't get a chance to make another attempt in most > >>>>>> cases. So I think it makes sense to set a shorter default. > >>>>>> best, > >>>>>> Colin > >>>>>> On Mon, May 4, 2020, at 09:44, Jose Garcia Sancio wrote: > >>>>>>> Thanks for the KIP Cheng, > >>>>>>>> The default value will be 10 seconds. > >>>>>>> I think we should make the default the current behavior. Meaning > the > >>>>>>> default should leverage the default connect timeout from the > >> operating > >>>>>>> system. > >>>>>>>> Proposed Changes > >>>>>>> I don't fully understand this section. It seems like it is mainly > >>>>>>> focused on the problem with the current implementation. Can you > >>>>>>> explain how the proposed changes solve the problem? > >>>>>>> Thanks. > >>>>>>> -- > >>>>>>> -Jose > >> -- > >> -Jose >