> A common connection failure mode is for a server to become entirely > unresponsive
This should be caught by TCP_USER_TIMEOUT. If you enable gRPC keep-alive, then normally TCP_USER_TIMEOUT will be enabled to the value of keepAliveTimeout (at least in Java/GO AFAIK). Then you can set keepAliveTimeout to say 10 seconds to detect unresponsive connections within 10 seconds of sending any frame. But please note it is not recommended to set TCP_USER_TIMEOUT to such low values in an unreliable (e.g. mobile) network environment. On Mon, Dec 2, 2024 at 3:51 PM 'Damien Neil' via grpc.io < grpc-io@googlegroups.com> wrote: > Even one minute is really too long. > > A common connection failure mode is for a server to become entirely > unresponsive, due to a backend restarting or load balancing shifting > traffic off a cluster entirely. For HTTP/1 traffic, this results in a > single failed request on a connection. Abandoning an HTTP/1 request renders > the connection unusable for future requests, so the connection is discarded > and replaced with a new one. For HTTP/2 traffic, however, there is no > natural limit to the number of requests which can be sent to a > dead/unresponsive connection: When a request times out, the client sends an > RST_STREAM, and the connection becomes immediately available to take an > additional request. There's no acknowledgement of RST_STREAM frames, so > sending one doesn't provide any information about whether the lack of > response to a request is because the server is generally unresponsive, or > because the request is still being processed. > > Sending a PING frame along with an RST_STREAM allows a client to > distinguish between an unresponsive server and a slow response. > > Delay that check by one minute, and we have a one minute period during > which we might be directing traffic to a dead server. That's an eternity. > > I question if that gets you what you need. If you start three requests at >> the same time with timeouts of 1s, 2s, 3s, then you'll still run afoul the >> limit. > > > Send a PING along with the RST_STREAM for the first request to be > cancelled, and the ping response confirms that all three requests have > arrived at the server. We can then skip sending a PING when cancelling the > remaining requests. > > On Mon, Dec 2, 2024 at 2:57 PM Eric Anderson <ej...@google.com> wrote: > >> On Mon, Dec 2, 2024 at 2:19 PM 'Damien Neil' via grpc.io < >> grpc-io@googlegroups.com> wrote: >> >>> I learned of this in https://go.dev/issue/70575, which is an issue >>> filed against Go's HTTP/2 client, caused by a new health check we'd added: >>> When a request times out or is canceled, we send a RST_STREAM frame for it. >>> Servers don't respond to RST_STREAM, so we bundle the RST_STREAM with a >>> PING frame to confirm that the server is still alive and responsive. In the >>> event many requests are canceled at once, we send only one PING for the >>> batch. >>> >> >> Our keepalive does something similar, but is time-based. If it has been X >> amount of time since the last receipt, then a PING checking the connection >> is fair. The problem is only the "aggressive" PING rate by the client. The >> client is doing exactly what the server was wanting to prevent: >> "overzealous" connection checking. I do think it is more appropriate to >> base it off a connection-level time instead of a per-request time, although >> you probably don't have a connection-level time to auto-tune to whereas you >> do get feedback from requests timing out. >> >> I'm wary of tieing keepalive checks to resets/deadlines, as those are >> load-shedding operations and people can have aggressive deadlines or cancel >> aggressively as part of normal course. In addition, TCP_USER_TIMEOUT with >> the RST_STREAM gets you a lot of the same value without requiring >> additional ACK packets. >> >> Note that I do think the 5 minutes is too large, but that's all I was >> able to get agreement for. Compared to 2 hours it is short... I really >> wanted a bit shy of 1 minute, as 1 minute is the magic inactivity for many >> home NATs and some cloud LBs. >> >> I think that gRPC servers should reset the ping strike count when they >>> *receive* a HEADERS or DATA frame. >>> >> >> I'm biased against the idea as that's the rough behavior of a certain >> server, and it was nothing but useless and a pain. HEADERS and DATA really >> have nothing to do with monitoring the connection, so it seems strange to >> let the client choose when to reset the counter. For BDP monitoring, we >> need it to be reset when the server sends DATA to use PINGs to adjust the >> client's receive window size. And I know of an implementation that sent >> unnecessary frames just to reset the counter so it could send PINGs. >> >> I question if that gets you what you need. If you start three requests at >> the same time with timeouts of 1s, 2s, 3s, then you'll still run afoul the >> limit. >> > -- > You received this message because you are subscribed to the Google Groups " > grpc.io" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to grpc-io+unsubscr...@googlegroups.com. > To view this discussion visit > https://groups.google.com/d/msgid/grpc-io/CAGgfL4tyN3y19Pj4NhzeMmXE5O1_merF01UjHfwGM7knx7gyoA%40mail.gmail.com > <https://groups.google.com/d/msgid/grpc-io/CAGgfL4tyN3y19Pj4NhzeMmXE5O1_merF01UjHfwGM7knx7gyoA%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "grpc.io" group. To unsubscribe from this group and stop receiving emails from it, send an email to grpc-io+unsubscr...@googlegroups.com. To view this discussion visit https://groups.google.com/d/msgid/grpc-io/CAA0z_REqHf_DjcqzHKQhAKLW5qX%2B%2B8TQe3XyUwV%2BiTLEHum0Wg%40mail.gmail.com.