Hi Rajini, Thanks for the comments.
> I think > they started off as connection timeouts but now include authentication time > as well. Have we considered using similar configs for this case? The new config I proposed is focusing on the connections to unreachable servers. The timeout count won’t involved the authentication. I think “socket.” prefix make sense. I’ll change it. Colin mentioned that adding the key word “setup” can help people understand that this config only applies to the connection setup. What about “socket.connection.setup.ms”? > The KIP proposes 10s as the default. What does this mean for typical > connections like a produce request going to the leader? The new timeout config applies to each connection. It’s at the NetworkClient level and won’t consider the underlying connection logic. Specifically, by default, every connection will have 10 seconds to become “connected” from “connecting”, which implies the corresponding socket channel is now connected (SocketChanel.finishConnect() returns True), no matter what the request logic and abstraction is. Please let me know if these make sense to you or if you have more thoughts. Thank you. Best, -Cheng > 在 2020年5月13日,上午7:11,Rajini Sivaram <rajinisiva...@gmail.com> 写道: > > Hi Cheng, > > Thanks for the KIP, sounds like a good improvement. A couple of comments: > > 1) We currently have client connection timeouts on the broker with configs > named `xxx.socket.timeout.ms` (e.g. controller.socket.timeout.ms). I think > they started off as connection timeouts but now include authentication time > as well. Have we considered using similar configs for this case? We may > want to prefix the new config with `socket.` anyway - something along the > lines of `socket.connection.timeout.ms` if it is just the connection time. > > 2) The KIP proposes 10s as the default. What does this mean for typical > connections like a produce request going to the leader? Instead of one > connection attempt to the leader, we want three separate connection > attempts within the request timeout to the leader? > > Regards, > > Rajini > > >> On Thu, May 7, 2020 at 11:51 PM Jose Garcia Sancio <jsan...@confluent.io> >> wrote: >> Cheng, >> Thanks for the KIP and the detailed proposal section. LGTM! >>>> On Thu, May 7, 2020 at 3:38 PM Cheng Tan <c...@confluent.io> wrote: >>>> I think more about the potential wider use cases, I modified the >> proposal to target all the connection. Thanks. >>> - Best, - Cheng Tan >>>> On May 7, 2020, at 1:41 AM, Cheng Tan <c...@confluent.io> wrote: >>>> Hi Colin, >>>> Sorry for the confusion. I’m proposing to implement timeout in the >> NetworkClient.leastLoadedNode() when iterating all the cached node. The >> alternative I can think is to implement the timeout in NetworkClient.poll() >>>> I’d prefer to implement in the leastLoadedNode(). Here’re the reasons: >>>> Usually when clients send a request, they will asking the network >> client to send the request to a specific node. In this case, the >> connection.setup.timeout won’t matter too much because the client doesn’t >> want to try other nodes for that specific request. The request level >> timeout would be enough. The metadata fetcher fetches the nodes status >> periodically so the clients can reassign the request to another node after >> timeout. >>>> Consumer, producer, and AdminClient are all using leastLoadedNode() >> for metadata fetch, where the connection setup timeout can play an >> important role. Unlike other requests can refer to the metadata for node >> condition, the metadata requests can only blindly choose a node for retry >> in the worst scenario. We want to make sure the client can get the metadata >> smoothly and as soon as possible. As a result, we need this >> connection.setup.timeout. >>>> Implementing the timeout in poll() or anywhere else might need an >> extra iteration of all nodes, which might downgrade the network client >> performance. >>>> I also updated the KIP content and KIP status. Please let me know if >> the above ideas make sense. Thanks. >>>> Best, - Cheng Tan >>>>> On May 4, 2020, at 5:26 PM, Colin McCabe <cmcc...@apache.org <mailto: >> cmcc...@apache.org>> wrote: >>>>> Hi Cheng, >>>>> On the KIP page, it lists this KIP as "draft." It seems like "under >> discussion" is appropriate here, right? >>>>>> Currently, the initial socket connection timeout is depending on >> Linux kernel setting >>>>>> tcp_syn_retries. The timeout value is 2 ^ (tcp_sync_retries + 1) - 1 >> seconds. For the >>>>>> reasons below, we want to control the client-side socket timeout >> directly using >>>>>> configuration files >>>>> Linux is just one example of an OS that Kafka could run on, right? >> You could also be running on MacOS, for example. >>>>>> I'm proposing to do a lazy socket connection time out. That is, we >> only check if >>>>>> we need to timeout a socket when we consider the corresponding node >> as a >>>>>> candidate in the node provider. >>>>> The NodeProvider is an AdminClient abstraction, right? Why wouldn't >> we implement a connection setup timeout for all clients, not just >> AdminClient? >>>>> best, >>>>> Colin >>>>> On Mon, May 4, 2020, at 13:18, Colin McCabe wrote: >>>>>> Hmm. A big part of the reason behind the KIP is that the default >>>>>> connection timeout behavior of the OS doesn't work for Kafka, right? >>>>>> For example, on Linux, if we wait 127 seconds for a connection >> attempt >>>>>> to time out, we won't get a chance to make another attempt in most >>>>>> cases. So I think it makes sense to set a shorter default. >>>>>> best, >>>>>> Colin >>>>>> On Mon, May 4, 2020, at 09:44, Jose Garcia Sancio wrote: >>>>>>> Thanks for the KIP Cheng, >>>>>>>> The default value will be 10 seconds. >>>>>>> I think we should make the default the current behavior. Meaning the >>>>>>> default should leverage the default connect timeout from the >> operating >>>>>>> system. >>>>>>>> Proposed Changes >>>>>>> I don't fully understand this section. It seems like it is mainly >>>>>>> focused on the problem with the current implementation. Can you >>>>>>> explain how the proposed changes solve the problem? >>>>>>> Thanks. >>>>>>> -- >>>>>>> -Jose >> -- >> -Jose