Re: [DISCUSS] KIP-601: Configurable socket connection timeout

Rajini Sivaram Thu, 14 May 2020 06:45:25 -0700

Hi Cheng,

1) Thanks for the update,  the KIP now says `
socket.connections.setup.timeout.ms*`*, which sounds good.


2) Not sure 10s is a good default because it unnecessarily times out
connections, only to attempt re-connecting to the same broker (except in
the leastLoadedNode case where it would be useful to have a lower timeout).
Couldn't we just use 30s (request timeout) as the default value, retaining
the current behaviour? Or alternatively, only apply the timeout where it
makes sense to have a lower timeout (i.e. a timeout for leastLoadedNode)?

Regards,

Rajini

On Thu, May 14, 2020 at 5:00 AM Cheng Tan <c...@confluent.io> wrote:

> Hi Rajini,
>
> Thanks for the comments.
>
> > I think
> > they started off as connection timeouts but now include authentication
> time
> > as well. Have we considered using similar configs for this case?
>
>
> The new config I proposed is focusing on the connections to unreachable
> servers. The timeout count won’t involved the authentication. I think
> “socket.” prefix make sense. I’ll change it. Colin mentioned that adding
> the key word “setup” can help people understand that this config only
> applies to the connection setup. What about “socket.connection.setup.ms”?
>
> > The KIP proposes 10s as the default. What does this mean for typical
> > connections like a produce request going to the leader?
>
> The new timeout config applies to each connection. It’s at the
> NetworkClient level and won’t consider the underlying connection logic.
> Specifically, by default, every connection will have 10 seconds to become
> “connected” from “connecting”, which implies the corresponding socket
> channel is now connected (SocketChanel.finishConnect() returns True), no
> matter what the request logic and abstraction is.
>
> Please let me know if these make sense to you or if you have more
> thoughts. Thank you.
>
> Best, -Cheng
>
> > 在 2020年5月13日，上午7:11，Rajini Sivaram <rajinisiva...@gmail.com> 写道：
> >
> > Hi Cheng,
> >
> > Thanks for the KIP, sounds like a good improvement. A couple of comments:
> >
> > 1) We currently have client connection timeouts on the broker with
> configs
> > named `xxx.socket.timeout.ms` (e.g. controller.socket.timeout.ms). I
> think
> > they started off as connection timeouts but now include authentication
> time
> > as well. Have we considered using similar configs for this case? We may
> > want to prefix the new config with `socket.` anyway - something along the
> > lines of `socket.connection.timeout.ms` if it is just the connection
> time.
> >
> > 2) The KIP proposes 10s as the default. What does this mean for typical
> > connections like a produce request going to the leader? Instead of one
> > connection attempt to the leader, we want three separate connection
> > attempts within the request timeout to the leader?
> >
> > Regards,
> >
> > Rajini
> >
> >
> >> On Thu, May 7, 2020 at 11:51 PM Jose Garcia Sancio <
> jsan...@confluent.io>
> >> wrote:
> >> Cheng,
> >> Thanks for the KIP and the detailed proposal section. LGTM!
> >>>> On Thu, May 7, 2020 at 3:38 PM Cheng Tan <c...@confluent.io> wrote:
> >>>> I think more about the potential wider use cases, I modified the
> >> proposal to target all the connection. Thanks.
> >>> - Best, - Cheng Tan
> >>>> On May 7, 2020, at 1:41 AM, Cheng Tan <c...@confluent.io> wrote:
> >>>> Hi Colin,
> >>>> Sorry for the confusion. I’m proposing to implement timeout in the
> >> NetworkClient.leastLoadedNode() when iterating all the cached node. The
> >> alternative I can think is to implement the timeout in
> NetworkClient.poll()
> >>>> I’d prefer to implement in the leastLoadedNode(). Here’re the reasons:
> >>>> Usually when clients send a request, they will asking the network
> >> client to send the request to a specific node. In this case, the
> >> connection.setup.timeout won’t matter too much because the client
> doesn’t
> >> want to try other nodes for that specific request. The request level
> >> timeout would be enough. The metadata fetcher fetches the nodes status
> >> periodically so the clients can reassign the request to another node
> after
> >> timeout.
> >>>> Consumer, producer, and AdminClient are all using leastLoadedNode()
> >> for metadata fetch, where the connection setup timeout can play an
> >> important role. Unlike other requests can refer to the metadata for node
> >> condition, the metadata requests can only blindly choose a node for
> retry
> >> in the worst scenario. We want to make sure the client can get the
> metadata
> >> smoothly and as soon as possible. As a result, we need this
> >> connection.setup.timeout.
> >>>> Implementing the timeout in poll() or anywhere else might need an
> >> extra iteration of all nodes, which might downgrade the network client
> >> performance.
> >>>> I also updated the KIP content and KIP status. Please let me know if
> >> the above ideas make sense. Thanks.
> >>>> Best, - Cheng Tan
> >>>>> On May 4, 2020, at 5:26 PM, Colin McCabe <cmcc...@apache.org
> <mailto:
> >> cmcc...@apache.org>> wrote:
> >>>>> Hi Cheng,
> >>>>> On the KIP page, it lists this KIP as "draft."  It seems like "under
> >> discussion" is appropriate here, right?
> >>>>>> Currently, the initial socket connection timeout is depending on
> >> Linux kernel setting
> >>>>>> tcp_syn_retries. The timeout value is 2 ^ (tcp_sync_retries + 1) - 1
> >> seconds. For the
> >>>>>> reasons below, we want to control the client-side socket timeout
> >> directly using
> >>>>>> configuration files
> >>>>> Linux is just one example of an OS that Kafka could run on, right?
> >> You could also be running on MacOS, for example.
> >>>>>> I'm proposing to do a lazy socket connection time out. That is, we
> >> only check if
> >>>>>> we need to timeout a socket when we consider the corresponding node
> >> as a
> >>>>>> candidate in the node provider.
> >>>>> The NodeProvider is an AdminClient abstraction, right?  Why wouldn't
> >> we implement a connection setup timeout for all clients, not just
> >> AdminClient?
> >>>>> best,
> >>>>> Colin
> >>>>> On Mon, May 4, 2020, at 13:18, Colin McCabe wrote:
> >>>>>> Hmm.  A big part of the reason behind the KIP is that the default
> >>>>>> connection timeout behavior of the OS doesn't work for Kafka, right?
> >>>>>> For example, on Linux, if we wait 127 seconds for a connection
> >> attempt
> >>>>>> to time out, we won't get a chance to make another attempt in most
> >>>>>> cases.  So I think it makes sense to set a shorter default.
> >>>>>> best,
> >>>>>> Colin
> >>>>>> On Mon, May 4, 2020, at 09:44, Jose Garcia Sancio wrote:
> >>>>>>> Thanks for the KIP Cheng,
> >>>>>>>> The default value will be 10 seconds.
> >>>>>>> I think we should make the default the current behavior. Meaning
> the
> >>>>>>> default should leverage the default connect timeout from the
> >> operating
> >>>>>>> system.
> >>>>>>>> Proposed Changes
> >>>>>>> I don't fully understand this section. It seems like it is mainly
> >>>>>>> focused on the problem with the current implementation. Can you
> >>>>>>> explain how the proposed changes solve the problem?
> >>>>>>> Thanks.
> >>>>>>> --
> >>>>>>> -Jose
> >> --
> >> -Jose
>

Re: [DISCUSS] KIP-601: Configurable socket connection timeout

Reply via email to