Yes, stopSession is the unary request that is failing. I can't really post the full logs of B and C here, as it contains some sensitive data.
The occurrence I am looking at has the following in B log at the time the unary request is made - this is the request coming in from A: 2024-11-19 07:55:05.090 [DEBUG] (grpc-nio-worker-ELG-1-5) io.grpc.netty.shaded.io.grpc.netty.NettyServerHandler: - [id: 0x755bd2e9, L:/10.178.148.210:10004 - R:/10.178.148.95:45842] OUTBOUND SETTINGS: ack=false settings={MAX_CONCURRENT_STREAMS=1000, INITIAL_WINDOW_SIZE=1048576, MAX_HEADER_LIST_SIZE=8192} 2024-11-19 07:55:05.090 [DEBUG] (grpc-nio-worker-ELG-1-5) io.grpc.netty.shaded.io.grpc.netty.NettyServerHandler: - [id: 0x755bd2e9, L:/10.178.148.210:10004 - R:/10.178.148.95:45842] OUTBOUND WINDOW_UPDATE: streamId=0 windowSizeIncrement=983041 2024-11-19 07:55:05.090 [DEBUG] (grpc-nio-worker-ELG-1-5) io.grpc.netty.shaded.io.grpc.netty.NettyServerHandler: - [id: 0x755bd2e9, L:/10.178.148.210:10004 - R:/10.178.148.95:45842] INBOUND SETTINGS: ack=false settings=[id: 0x755bd2e9, L:/10.178.148.210:10004 - R:/10.178.148.95:45842] 2024-11-19 07:55:05.090 [DEBUG] (grpc-nio-worker-ELG-1-5) io.grpc.netty.shaded.io.grpc.netty.NettyServerHandler: - [id: 0x755bd2e9, L:/10.178.148.210:10004 - R:/10.178.148.95:45842] OUTBOUND SETTINGS: ack=true 2024-11-19 07:55:05.090 [DEBUG] (grpc-nio-worker-ELG-1-5) io.grpc.netty.shaded.io.grpc.netty.NettyServerHandler: - [id: 0x755bd2e9, L:/10.178.148.210:10004 - R:/10.178.148.95:45842] INBOUND HEADERS: streamId=1 headers=GrpcHttp2RequestHeaders[:path: /ra.hub.grpc.api.v1.SessionService/stopSession, :authority: 10.178.148.210:10004, :method: POST, :scheme: http, te: trailers, grpc-accept-encoding: identity,deflate,gzip, accept-encoding: identity, grpc-timeout: 59996m, user-agent: grpc-node-js/1.9.12, content-type: application/grpc] padding=0 endStream=false 2024-11-19 07:55:05.090 [DEBUG] (grpc-nio-worker-ELG-1-5) io.grpc.netty.shaded.io.grpc.netty.NettyServerHandler: - [id: 0x755bd2e9, L:/10.178.148.210:10004 - R:/10.178.148.95:45842] INBOUND DATA: streamId=1 padding=0 endStream=false length=63 bytes=000000003a0a2434623235373935342d626161632d336538332d396132612d65396536323763343362393612120a0a6875622d6575732d3261120455532d32 2024-11-19 07:55:05.090 [DEBUG] (grpc-nio-worker-ELG-1-5) io.grpc.netty.shaded.io.grpc.netty.NettyServerHandler: - [id: 0x755bd2e9, L:/10.178.148.210:10004 - R:/10.178.148.95:45842] OUTBOUND PING: ack=false bytes=1234 2024-11-19 07:55:05.091 [DEBUG] (grpc-nio-worker-ELG-1-5) io.grpc.netty.shaded.io.grpc.netty.NettyServerHandler: - [id: 0x755bd2e9, L:/10.178.148.210:10004 - R:/10.178.148.95:45842] INBOUND DATA: streamId=1 padding=0 endStream=true length=0 bytes= 2024-11-19 07:55:05.091 [DEBUG] (grpc-nio-worker-ELG-1-5) io.grpc.netty.shaded.io.grpc.netty.NettyServerHandler: - [id: 0x755bd2e9, L:/10.178.148.210:10004 - R:/10.178.148.95:45842] INBOUND SETTINGS: ack=true 2024-11-19 07:55:05.091 [DEBUG] (grpc-default-executor-284) org.dett.common.grpc.communication.MessageLoggingServerInterceptor: - {"method":"ra.hub.grpc.api.v1.SessionService/stopSession","logId":"ad24cfeb-c99b-43f1-9891-634c08e1a7af","message-direction":"in","message":"{\n \"uuid\": \"4b257954-baac-3e83-9a2a-e9e627c43b96\",\n \"hubId\": {\n \"name\": \"C2a\",\n \"cluster\": \"US-2\"\n }\n}","address":"/10.178.148.95:45842"} 2024-11-19 07:55:05.091 [INFO] (grpc-default-executor-284) org.dett.business.server.hub.grpc.BusinessServerSessionGrpcService: - Processing stopSession request #31, uuid: "4b257954-baac-3e83-9a2a-e9e627c43b96" hubId { name: "C2a" cluster: "US-2" } 2024-11-19 07:55:05.091 [INFO] (grpc-default-executor-284) org.dett.business.server.communication.SystemChannelProvider: - Provided channel: C2a/US-2, keep alive time: 20000, keep alive timeout: 2000 2024-11-19 07:55:05.091 [INFO] (grpc-default-executor-284) org.dett.business.server.hub.grpc.BusinessServerSessionGrpcService: - Stream stub created for C2a/US-2; channel state: READY 2024-11-19 07:55:05.091 [DEBUG] (grpc-nio-worker-ELG-1-5) io.grpc.netty.shaded.io.grpc.netty.NettyServerHandler: - [id: 0x755bd2e9, L:/10.178.148.210:10004 - R:/10.178.148.95:45842] INBOUND PING: ack=true bytes=1234 Then a number of INBOUND DATA messages from C's IP address with two different stream IDs, as the streaming call updates come in, but no outgoing messages to C are logged. Later B says: 2024-11-19 07:55:25.085 [INFO] (grpc-default-executor-284) org.dett.business.server.hub.grpc.BusinessServerSessionGrpcService: - onError() invoked for stopSession request #31: io.grpc.StatusRuntimeException: CANCELLED: RPC cancelled On the C side, I can see a lot of outgoing messages on the 2 streams mentioned above being sent to B. All of the OUTBOUND DATA messages have the same id and only the 2 streams. Nothing else. I guess I could post that traffic here, but it's rather trivial. This traffic from C to B goes on during the whole 20 seconds of the duration of the cancelled RPC unary call. Is there anything specific we are looking for in the debug logs? Cheers Jiri Dne úterý 3. prosince 2024 v 20:03:21 UTC+1 uživatel Yuri Golobokov napsal: > Is the /ra.hub.grpc.api.v1.SessionService/stopSession the unary call in > question? > From the log it looks completely normal and successfully processed. > It would be nice to have debug logs from C _and_ B at the time when the > unary call is received on B until it is canceled. > > re: keepAliveWithoutCalls -- the streaming calls you have between B-C are > the in-flight calls, so you should always have some calls, that's why > changing keepAliveWithoutCalls does not make a difference in your case. > > What is your client LB policy? > > On Tue, Dec 3, 2024 at 1:38 AM Jiří Krutil <kruti...@gmail.com> wrote: > >> A quick update: turns out enabling *keepAliveWithoutCalls* for component >> B does not resolve the issue. >> >> I would like to understand what kind of guarantees gRPC actually provides >> here. Is RPC call delivery on a channel on best effort basis? >> If that is the case, sounds like the only way to resolve this is to add >> application-level health checking? >> >> Cheers >> Jiri >> >> Dne středa 27. listopadu 2024 v 13:44:05 UTC+1 uživatel Jiří Krutil >> napsal: >> >>> I cannot confirm the exact number of open streaming calls on C during >>> the last occurrence, but my assumption is just a few. >>> >>> I have tried processing one of the unary calls now with debug logging >>> enabled on C and this shows: >>> >>> 2024-11-27 13:17:37.088 [DEBUG] (grpc-default-worker-ELG-3-2) >>> io.grpc.netty.shaded.io.grpc.netty.NettyServerHandler: - [id: 0xab7eb2ae, >>> L:/10.40.80.78:10010 - R:/10.40.80.77:49754] OUTBOUND SETTINGS: >>> ack=false settings={MAX_CONCURRENT_STREAMS=2147483647 <(214)%20748-3647>, >>> INITIAL_WINDOW_SIZE=1048576, MAX_HEADER_LIST_SIZE=8192} >>> 2024-11-27 13:17:37.089 [DEBUG] (grpc-default-worker-ELG-3-2) >>> io.grpc.netty.shaded.io.grpc.netty.NettyServerHandler: - [id: 0xab7eb2ae, >>> L:/10.40.80.78:10010 - R:/10.40.80.77:49754] OUTBOUND WINDOW_UPDATE: >>> streamId=0 windowSizeIncrement=983041 >>> 2024-11-27 13:17:37.090 [DEBUG] (grpc-default-worker-ELG-3-2) >>> io.grpc.netty.shaded.io.grpc.netty.NettyServerHandler: - [id: 0xab7eb2ae, >>> L:/10.40.80.78:10010 - R:/10.40.80.77:49754] INBOUND SETTINGS: >>> ack=false settings={ENABLE_PUSH=0, MAX_CONCURRENT_STREAMS=0, >>> INITIAL_WINDOW_SIZE=1048576, MAX_HEADER_LIST_SIZE=8192} >>> 2024-11-27 13:17:37.090 [DEBUG] (grpc-default-worker-ELG-3-2) >>> io.grpc.netty.shaded.io.grpc.netty.NettyServerHandler: - [id: 0xab7eb2ae, >>> L:/10.40.80.78:10010 - R:/10.40.80.77:49754] OUTBOUND SETTINGS: ack=true >>> 2024-11-27 13:17:37.090 [DEBUG] (grpc-default-worker-ELG-3-2) >>> io.grpc.netty.shaded.io.grpc.netty.NettyServerHandler: - [id: 0xab7eb2ae, >>> L:/10.40.80.78:10010 - R:/10.40.80.77:49754] INBOUND WINDOW_UPDATE: >>> streamId=0 windowSizeIncrement=983041 >>> 2024-11-27 13:17:37.091 [DEBUG] (grpc-default-worker-ELG-3-2) >>> io.grpc.netty.shaded.io.grpc.netty.NettyServerHandler: - [id: 0xab7eb2ae, >>> L:/10.40.80.78:10010 - R:/10.40.80.77:49754] INBOUND SETTINGS: ack=true >>> 2024-11-27 13:17:37.092 [DEBUG] (grpc-default-worker-ELG-3-2) >>> io.grpc.netty.shaded.io.grpc.netty.NettyServerHandler: - [id: 0xab7eb2ae, >>> L:/10.40.80.78:10010 - R:/10.40.80.77:49754] INBOUND HEADERS: >>> streamId=3 headers=GrpcHttp2RequestHeaders[:path: >>> /ra.hub.grpc.api.v1.SessionService/stopSession, :authority: 10.40.80.78, >>> :method: POST, :scheme: http, te: trailers, content-type: application/grpc, >>> user-agent: grpc-java-netty/1.66.0, grpc-accept-encoding: gzip, >>> grpc-timeout: 59969004u] streamDependency=0 weight=16 exclusive=false >>> padding=0 endStream=false >>> 2024-11-27 13:17:37.110 [DEBUG] (grpc-default-worker-ELG-3-2) >>> io.grpc.netty.shaded.io.grpc.netty.NettyServerHandler: - [id: 0xab7eb2ae, >>> L:/10.40.80.78:10010 - R:/10.40.80.77:49754] INBOUND DATA: streamId=3 >>> padding=0 endStream=true length=63 >>> bytes=000000003a0a2435373239663665342d396633662d343738382d383132372d32343633636338663339666412120a0a6875622d6575732d3161120455532d31 >>> 2024-11-27 13:17:37.110 [DEBUG] (grpc-default-worker-ELG-3-2) >>> io.grpc.netty.shaded.io.grpc.netty.NettyServerHandler: - [id: 0xab7eb2ae, >>> L:/10.40.80.78:10010 - R:/10.40.80.77:49754] OUTBOUND PING: ack=false >>> bytes=1234 >>> 2024-11-27 13:17:37.111 [DEBUG] (grpc-default-worker-ELG-3-2) >>> io.grpc.netty.shaded.io.grpc.netty.NettyServerHandler: - [id: 0xab7eb2ae, >>> L:/10.40.80.78:10010 - R:/10.40.80.77:49754] INBOUND PING: ack=true >>> bytes=1234 >>> 2024-11-27 13:17:37.112 [INFO] (grpc-default-executor-3) >>> org.dett.router.grpc.RouterSessionGrpcService: - Processing stopSession >>> request {uuid: "5729f6e4-9f3f-4788-8127-2463cc8f39fd" hubId { name: >>> "hub-eus-1a" cluster: "US-1" }} >>> 2024-11-27 13:17:37.115 [DEBUG] (grpc-default-worker-ELG-3-2) >>> io.grpc.netty.shaded.io.grpc.netty.NettyServerHandler: - [id: 0xab7eb2ae, >>> L:/10.40.80.78:10010 - R:/10.40.80.77:49754] OUTBOUND HEADERS: >>> streamId=3 headers=GrpcHttp2OutboundHeaders[:status: 200, content-type: >>> application/grpc, grpc-encoding: identity, grpc-accept-encoding: gzip] >>> padding=0 endStream=false >>> 2024-11-27 13:17:37.115 [DEBUG] (grpc-default-worker-ELG-3-2) >>> io.grpc.netty.shaded.io.grpc.netty.NettyServerHandler: - [id: 0xab7eb2ae, >>> L:/10.40.80.78:10010 - R:/10.40.80.77:49754] OUTBOUND DATA: streamId=3 >>> padding=0 endStream=false length=5 bytes=0000000000 >>> 2024-11-27 13:17:37.116 [DEBUG] (grpc-default-worker-ELG-3-2) >>> io.grpc.netty.shaded.io.grpc.netty.NettyServerHandler: - [id: 0xab7eb2ae, >>> L:/10.40.80.78:10010 - R:/10.40.80.77:49754] OUTBOUND HEADERS: >>> streamId=3 headers=GrpcHttp2OutboundHeaders[grpc-status: 0] padding=0 >>> endStream=true >>> 2024-11-27 13:17:37.118 [DEBUG] (grpc-default-worker-ELG-3-2) >>> io.grpc.netty.shaded.io.grpc.netty.NettyServerHandler: - [id: 0xab7eb2ae, >>> L:/10.40.80.78:10010 - R:/10.40.80.77:49754] INBOUND PING: ack=false >>> bytes=1234 >>> 2024-11-27 13:17:37.118 [DEBUG] (grpc-default-worker-ELG-3-2) >>> io.grpc.netty.shaded.io.grpc.netty.NettyServerHandler: - [id: 0xab7eb2ae, >>> L:/10.40.80.78:10010 - R:/10.40.80.77:49754] OUTBOUND PING: ack=true >>> bytes=1234 >>> 2024-11-27 13:17:57.118 [DEBUG] (grpc-default-worker-ELG-3-2) >>> io.grpc.netty.shaded.io.grpc.netty.NettyServerHandler: - [id: 0xab7eb2ae, >>> L:/10.40.80.78:10010 - R:/10.40.80.77:49754] INBOUND PING: ack=false >>> bytes=1111 >>> 2024-11-27 13:17:57.118 [DEBUG] (grpc-default-worker-ELG-3-2) >>> io.grpc.netty.shaded.io.grpc.netty.NettyServerHandler: - [id: 0xab7eb2ae, >>> L:/10.40.80.78:10010 - R:/10.40.80.77:49754] OUTBOUND PING: ack=true >>> bytes=1111 >>> >>> I think this indicates *max_concurrent_streams* for C as a server is >>> unlimited. >>> >>> Maybe I have oversimplified in the original post. We actually have 2 >>> instances of A, 2 instances of B and N instances of C. >>> One observation was that at one time, we had stale channel B1>C2 and >>> stale channel B2>C4, while all other channels were working. >>> In other words, the problem does not affect all channels from given B >>> instance or all channels to given C instance. It seems to be channel >>> specific. >>> >>> Just to check one related idea I had -- is it possible that enabling >>> *keepAliveWithoutCalls* on B could make this issue go away? The >>> reasoning is that if gRPC client on B sees incoming traffic from C, it may >>> never probe the channel to C, thus won't detect the issue without this >>> option. >>> Dne úterý 26. listopadu 2024 v 20:53:01 UTC+1 uživatel Yuri Golobokov >>> napsal: >>> >>>> How many server streaming calls are open to C at the time of the issue? >>>> Does C limits the number of maximum concurrent streams? (you can see it >>>> in the debug logs for the SETTINGS frame coming from C) >>>> On Tuesday, November 26, 2024 at 7:22:16 AM UTC-8 Jiří Krutil wrote: >>>> >>>>> When running our Java application in MS Azure, we sometimes observe >>>>> very strange behavior, which appears as if a long-lived gRPC channel was >>>>> working only in one direction and was not delivering any RPC calls in the >>>>> opposite direction. >>>>> >>>>> Out setup is that we have three apps connected via gRPC: >>>>> >>>>> A -> B -> C >>>>> >>>>> B usually has a long-lived server streaming gRPC requests to C open, >>>>> consuming updates from C. When the issue occurs, updates from C are still >>>>> streaming to B, but no new unary requests made by B make it to C. >>>>> >>>>> The unary requests made by B are originated by A. B receives the >>>>> request from A and sends an unary request to C with a deadline copied >>>>> from >>>>> the original request. After 20 seconds, B sees an "RPC cancelled" event, >>>>> which I believe comes from A in response to some kind of timeout. >>>>> >>>>> The problem occurs randomly and when it occurs, the channel never >>>>> recovers. >>>>> >>>>> Debug logging seems to show that when B receives the request from A, >>>>> it creates a new stub using an existing (cached) channel and attempts to >>>>> send a request to C, but that request is actually never sent. >>>>> >>>>> If I make B forget the cached channel and create a new one, the unary >>>>> request to C works fine. >>>>> >>>>> We have keepAlive enabled on these channels, so I am surprised that >>>>> potential issues with the underlying connection are not detected by the >>>>> keepAlive mechanism. Is it possible that since traffic is steadily >>>>> flowing >>>>> in the direction from C to B, that B never pings C to see if >>>>> communication >>>>> in the opposite direction works as well? >>>>> >>>>> I suppose we could work around this by adding application-level health >>>>> checking for every channel, but I thought this is already taken care of >>>>> by >>>>> gRPC. >>>>> >>>>> Any suggestions would be appreciated. >>>>> >>>> -- >> You received this message because you are subscribed to the Google Groups >> "grpc.io" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to grpc-io+u...@googlegroups.com. >> To view this discussion visit >> https://groups.google.com/d/msgid/grpc-io/9a7cf13f-6545-429f-b68b-f880fb967ccbn%40googlegroups.com >> >> <https://groups.google.com/d/msgid/grpc-io/9a7cf13f-6545-429f-b68b-f880fb967ccbn%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "grpc.io" group. To unsubscribe from this group and stop receiving emails from it, send an email to grpc-io+unsubscr...@googlegroups.com. To view this discussion visit https://groups.google.com/d/msgid/grpc-io/eb4e5aba-9c34-4c3f-8a9b-cbb351c8adebn%40googlegroups.com.