[ https://issues.apache.org/jira/browse/KUDU-3582 ]
Michael Smith deleted comment on KUDU-3582: ------------------------------------- was (Author: JIRAUSER288956): I think this is probably a case where we tried to use sidecars after calling DiscardTransfer - https://github.com/apache/kudu/blob/1.17.0/src/kudu/rpc/inbound_call.h#L218-L220 - which has some big caveats. > Incomplete sidecar data returned by RpcContext::GetInboundSidecar() > ------------------------------------------------------------------- > > Key: KUDU-3582 > URL: https://issues.apache.org/jira/browse/KUDU-3582 > Project: Kudu > Issue Type: Bug > Components: rpc > Reporter: Wenzhe Zhou > Priority: Major > > Impala executor calls KRPC sidecar API RpcContext::GetInboundSidecar() to > read serialized thrift object from KRPC, then do thrift deserialization. (See > GetSidecar() at > https://github.com/apache/impala/blob/master/be/src/rpc/sidecar-util.h#L60-L67) > In a customer reported cases, extra workloads were added to Impala cluster, > which caused long delay for KRPCs between Impala daemons. The long delay > caused KRPCs been cancelled, hence impala query failures. > {code:java} > impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521 > 05:40:09.383988 1182322 krpc-data-stream-sender.cc:405] Slow EndDataStream > RPC to 10.243.38.160:27000 > (fragment_instance_id=9940332ce09828fd:b751966300000632): took 59m57s. Error: > Aborted: EndDataStream RPC to 10.243.38.160:27000 is cancelled in state > ON_OUTBOUND_QUEUE > impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521 > 05:40:09.384006 1182322 kudu-status-util.h:55] EndDataStream() to > 10.243.38.160:27000 failed: Aborted: EndDataStream RPC to 10.243.38.160:27000 > is cancelled in state ON_OUTBOUND_QUEUE > impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521 > 05:40:09.384631 1182314 krpc-data-stream-sender.cc:405] Slow EndDataStream > RPC to 10.34.163.32:27000 > (fragment_instance_id=9940332ce09828fd:b751966300000735): took 59m57s. Error: > Aborted: EndDataStream RPC to 10.34.163.32:27000 is cancelled in state > ON_OUTBOUND_QUEUE > impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521 > 05:40:09.384668 1182314 kudu-status-util.h:55] EndDataStream() to > 10.34.163.32:27000 failed: Aborted: EndDataStream RPC to 10.34.163.32:27000 > is cancelled in state ON_OUTBOUND_QUEUE > impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521 > 05:40:09.420662 1182313 krpc-data-stream-sender.cc:405] Slow EndDataStream > RPC to 10.243.36.21:27000 > (fragment_instance_id=9940332ce09828fd:b75196630000033a): took 1h. Error: > Aborted: EndDataStream RPC to 10.243.36.21:27000 is cancelled in state > ON_OUTBOUND_QUEUE > impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521 > 05:40:09.420683 1182313 kudu-status-util.h:55] EndDataStream() to > 10.243.36.21:27000 failed: Aborted: EndDataStream RPC to 10.243.36.21:27000 > is cancelled in state ON_OUTBOUND_QUEUE > impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521 > 05:40:09.420779 1182322 krpc-data-stream-sender.cc:405] Slow EndDataStream > RPC to 10.243.38.160:27000 > (fragment_instance_id=9940332ce09828fd:b75196630000033b): took 1h. Error: > Aborted: EndDataStream RPC to 10.243.38.160:27000 is cancelled in state > ON_OUTBOUND_QUEUE > impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521 > 05:40:09.420799 1182322 kudu-status-util.h:55] EndDataStream() to > 10.243.38.160:27000 failed: Aborted: EndDataStream RPC to 10.243.38.160:27000 > is cancelled in state ON_OUTBOUND_QUEUE > impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521 > 05:40:09.421937 1182314 krpc-data-stream-sender.cc:405] Slow EndDataStream > RPC to 10.34.163.32:27000 > (fragment_instance_id=9940332ce09828fd:b75196630000043e): took 1h. Error: > Aborted: > {code} > Then extra workloads were removed and Impala cluster was restarted. During > restarting Impala cluster, lots of Impala daemon crashed. The stacktraces of > core files and log messages shows that impala daemons received incomplete > data from KRPC sidecar. The incomplete data did not cause thrift > deserialization failure so the valid but incomplete data was not captured and > handled properly. > See impala Jira: IMPALA-13107. The issue could not be re-produced locally. > A quick fixing from Impala side was merged to mitigate the crash issue. Need > to look into this issue further from KRPC internal. -- This message was sent by Atlassian Jira (v8.20.10#820010)