Hi Cheng,

Thanks for working on this.  Looks good.

How about "socket.connection.setup.timeout.ms" and 
"socket.connection.setup.timeout.max.ms" (not connections with an S)?

+1 (binding)

best,
Colin


On Wed, Jun 3, 2020, at 06:24, Rajini Sivaram wrote:
> Hi Cheng,
> 
> Thanks for the updates, looks good.
> 
> +1 (binding)
> 
> Regards,
> 
> Rajini
> 
> On Wed, Jun 3, 2020 at 8:53 AM Cheng Tan <c...@confluent.io> wrote:
> 
> > Dear Rajini,
> >
> > Thanks for the feedback.
> >
> > 1)
> > Because "request.timeout.ms" only affects in-flight requests, after the
> > API NetworkClient.ready() is invoked, the connection won't get closed after
> > "request.timeout.ms” hits. Before
> >         a) the SocketChannel is connected
> >         b) ssl handshake finished
> >         c) authentication has finished (sasl)
> > clients cannot invoke NetworkClient.send() to send any request, which
> > means no in-flight request targeting to the connection will be added.
> >
> >
> > 2)
> > I think a default value of 127 seconds make sense, which meets the timeout
> > indirectly specified by the default value of “tcp.syn.retries”. I’ve added
> > this into the KIP proposal.
> >
> >
> > 3)
> > Every time the timeout hits, the timeout value of the next connection try
> > will increase.
> >
> > The timeout will hit iff a connection stays at the `connecting` state
> > longer than the timeout value, as indicated by
> > ClusterConnectionStates.NodeConnectionState. The connection state of a node
> > will change iff `SelectionKey.OP_CONNECT` is detected by
> > `nioSelector.Select()`. The connection state may transit from `connecting`
> > to
> >
> >         a) `disconnected` when SocketChannel.finishConnect() throws
> > IOException.
> >         b) `connected` when SocketChannel.finishConnect() return TRUE.
> >
> > In other words, the timeout will hit and increase iff the interested
> > SelectionKey.OP_CONNECT doesn't trigger before the timeout arrives, which
> > means, for example, network congestion, failure of the ARP request, packet
> > filtering, routing error, or a silent discard may happen. (I didn’t read
> > the Java NIO source code. Please correct me the case when OP_CONNECT won’t
> > get triggered if I’m wrong)
> >
> >
> > 4)
> >
> > A) Connection timeout dominates both request timeout and API timeout
> >
> > When connection timeout hits, the connection will be closed. The client
> > will be notified either by the responses constructed by NetworkClient or
> > the callbacks attached to the request. As a result, the request failure
> > will be handled before either connection timeout or API timeout arrives.
> >
> >
> > B) Neither request timeout or API timeout dominates connection timeout
> >
> > i) Request timeout: Because request timeout only affects in-flight
> > requests, after the API NetworkClient.ready() is invoked, the connection
> > won't get closed after "request.timeout.ms” hits. Before
> >         1. the SocketChannel is connected
> >         2. SSL handshake finished
> >         3. authentication has finished (SASL)
> > , clients won't be able to invoke NetworkClient.send() to send any
> > request, which means no in-flight request targeting to the connection will
> > be added.
> >
> > ii) API timeout: In AdminClient, API timeout acts by putting a smaller and
> > smaller timeout value to the chain of requests in a same API. After the API
> > timeout hits, the retry logic won't close any connection. In consumer, API
> > timeout acts as a whole by putting a limit to the code block executing
> > time. The retry logic won't close any connection as well.
> >
> >
> > Conclusion:
> >
> > Thanks again for the long feedback and I’m always enjoying them. I’ve
> > supplement the above discussion into the KIP proposal. Please let me know
> > what you think.
> >
> >
> > Best, - Cheng Tan
> >
> >
> > > On Jun 2, 2020, at 3:01 AM, Rajini Sivaram <rajinisiva...@gmail.com>
> > wrote:
> > >
> > > Hi Cheng,
> > >
> > > Not sure if the discussion should move back to the DISCUSS thread. I
> > have a
> > > few questions:
> > >
> > > 1) The KIP motivation says that in some cases `request.timeout.ms`
> > doesn't
> > > timeout connections properly and as a result it takes 127s to detect a
> > > connection failure. This sounds like a bug rather than a limitation of
> > the
> > > current approach. Can you explain the scenarios where this occurs?
> > >
> > > 2) I think the current proposal is to use non-exponential 10s connection
> > > timeout as default with the option to use exponential timeout. So
> > > connection timeouts for every connection attempt will be between 8s and
> > 12s
> > > by default. Is that correct? Should we use a default max timeout to
> > enable
> > > exponential timeout by default since 8s seems rather small?
> > >
> > > 3) What is the scope of `failures` used to determine connection timeout
> > > with exponential timeouts? Will we always use 10s followed by 20s every
> > > time a connection is attempted?
> > >
> > > 4) It will be good if we can include two flows with the relationship
> > > between various timeouts in the KIP. One with a fixed node like a typical
> > > produce/consume request to the leader and another that uses
> > > `leastLoadedNode` like a metadata request. Having the comparison between
> > > the current and proposed behaviour w.r.t all configurable timeouts (the
> > two
> > > new connection timeouts, request timeout, api timeout etc.) will be
> > useful.
> > >
> > > Regards,
> > >
> > > Rajini
> > >
> >
>

Reply via email to