TCP_USER_TIMEOUT hard-closes the connection on too much unresponsiveness,
aborting all in-progress requests. This makes it a dance between setting
too short a timeout, and aborting valid requests, or too long a timeout,
and using a dead connection. You allude to this when you say that
TCP_USER_TIMEOUT must not be set to a low value in an unreliable
environment.

We don't want to close a connection with an in-flight request on it. If we
have N long-running requests with no response and the user cancels N-1 of
them, we should maintain the connection until the final request receives a
response or is canceled by the user. However, we might want to create a new
connection for new requests if the existing connection appears to be
unresponsive. TCP_USER_TIMEOUT does not provide a mechanism to do that.

On Mon, Dec 2, 2024 at 4:02 PM Yuri Golobokov <golobo...@google.com> wrote:

>
> A common connection failure mode is for a server to become entirely
>> unresponsive
>
> This should be caught by TCP_USER_TIMEOUT. If you enable gRPC keep-alive,
> then normally TCP_USER_TIMEOUT will be enabled to the value of
> keepAliveTimeout (at least in Java/GO AFAIK). Then you can set
> keepAliveTimeout to say 10 seconds to detect unresponsive connections
> within 10 seconds of sending any frame. But please note it is not
> recommended to set TCP_USER_TIMEOUT to such low values in an unreliable
> (e.g. mobile) network environment.
>
> On Mon, Dec 2, 2024 at 3:51 PM 'Damien Neil' via grpc.io <
> grpc-io@googlegroups.com> wrote:
>
>> Even one minute is really too long.
>>
>> A common connection failure mode is for a server to become entirely
>> unresponsive, due to a backend restarting or load balancing shifting
>> traffic off a cluster entirely. For HTTP/1 traffic, this results in a
>> single failed request on a connection. Abandoning an HTTP/1 request renders
>> the connection unusable for future requests, so the connection is discarded
>> and replaced with a new one. For HTTP/2 traffic, however, there is no
>> natural limit to the number of requests which can be sent to a
>> dead/unresponsive connection: When a request times out, the client sends an
>> RST_STREAM, and the connection becomes immediately available to take an
>> additional request. There's no acknowledgement of RST_STREAM frames, so
>> sending one doesn't provide any information about whether the lack of
>> response to a request is because the server is generally unresponsive, or
>> because the request is still being processed.
>>
>> Sending a PING frame along with an RST_STREAM allows a client to
>> distinguish between an unresponsive server and a slow response.
>>
>> Delay that check by one minute, and we have a one minute period during
>> which we might be directing traffic to a dead server. That's an eternity.
>>
>> I question if that gets you what you need. If you start three requests at
>>> the same time with timeouts of 1s, 2s, 3s, then you'll still run afoul the
>>> limit.
>>
>>
>> Send a PING along with the RST_STREAM for the first request to be
>> cancelled, and the ping response confirms that all three requests have
>> arrived at the server. We can then skip sending a PING when cancelling the
>> remaining requests.
>>
>> On Mon, Dec 2, 2024 at 2:57 PM Eric Anderson <ej...@google.com> wrote:
>>
>>> On Mon, Dec 2, 2024 at 2:19 PM 'Damien Neil' via grpc.io <
>>> grpc-io@googlegroups.com> wrote:
>>>
>>>> I learned of this in https://go.dev/issue/70575, which is an issue
>>>> filed against Go's HTTP/2 client, caused by a new health check we'd added:
>>>> When a request times out or is canceled, we send a RST_STREAM frame for it.
>>>> Servers don't respond to RST_STREAM, so we bundle the RST_STREAM with a
>>>> PING frame to confirm that the server is still alive and responsive. In the
>>>> event many requests are canceled at once, we send only one PING for the
>>>> batch.
>>>>
>>>
>>> Our keepalive does something similar, but is time-based. If it has been
>>> X amount of time since the last receipt, then a PING checking the
>>> connection is fair. The problem is only the "aggressive" PING rate by the
>>> client. The client is doing exactly what the server was wanting to prevent:
>>> "overzealous" connection checking. I do think it is more appropriate to
>>> base it off a connection-level time instead of a per-request time, although
>>> you probably don't have a connection-level time to auto-tune to whereas you
>>> do get feedback from requests timing out.
>>>
>>> I'm wary of tieing keepalive checks to resets/deadlines, as those are
>>> load-shedding operations and people can have aggressive deadlines or cancel
>>> aggressively as part of normal course. In addition, TCP_USER_TIMEOUT with
>>> the RST_STREAM gets you a lot of the same value without requiring
>>> additional ACK packets.
>>>
>>> Note that I do think the 5 minutes is too large, but that's all I was
>>> able to get agreement for. Compared to 2 hours it is short... I really
>>> wanted a bit shy of 1 minute, as 1 minute is the magic inactivity for many
>>> home NATs and some cloud LBs.
>>>
>>> I think that gRPC servers should reset the ping strike count when they
>>>> *receive* a HEADERS or DATA frame.
>>>>
>>>
>>> I'm biased against the idea as that's the rough behavior of a certain
>>> server, and it was nothing but useless and a pain. HEADERS and DATA really
>>> have nothing to do with monitoring the connection, so it seems strange to
>>> let the client choose when to reset the counter. For BDP monitoring, we
>>> need it to be reset when the server sends DATA to use PINGs to adjust the
>>> client's receive window size. And I know of an implementation that sent
>>> unnecessary frames just to reset the counter so it could send PINGs.
>>>
>>> I question if that gets you what you need. If you start three requests
>>> at the same time with timeouts of 1s, 2s, 3s, then you'll still run afoul
>>> the limit.
>>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "grpc.io" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to grpc-io+unsubscr...@googlegroups.com.
>> To view this discussion visit
>> https://groups.google.com/d/msgid/grpc-io/CAGgfL4tyN3y19Pj4NhzeMmXE5O1_merF01UjHfwGM7knx7gyoA%40mail.gmail.com
>> <https://groups.google.com/d/msgid/grpc-io/CAGgfL4tyN3y19Pj4NhzeMmXE5O1_merF01UjHfwGM7knx7gyoA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"grpc.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to grpc-io+unsubscr...@googlegroups.com.
To view this discussion visit 
https://groups.google.com/d/msgid/grpc-io/CAGgfL4vjXZ4O%2BnpcKHi6WPCOM%2BDFYfvEmhpMKbtn47BAKjWb4A%40mail.gmail.com.

Reply via email to