[ https://issues.apache.org/jira/browse/KUDU-3582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wenzhe Zhou updated KUDU-3582: ------------------------------ Description: Impala executor calls KRPC sidecar API RpcContext::GetInboundSidecar() to read serialized thrift object from KRPC, then do thrift deserialization. (See GetSidecar() at https://github.com/apache/impala/blob/master/be/src/rpc/sidecar-util.h#L60-L67) In a customer reported cases, extra workloads were added to Impala cluster, which caused long delay for KRPCs between Impala daemons. The long delay caused KRPCs been cancelled, hence impala query failures. {code:java} impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521 05:40:09.383988 1182322 krpc-data-stream-sender.cc:405] Slow EndDataStream RPC to 10.243.38.160:27000 (fragment_instance_id=9940332ce09828fd:b751966300000632): took 59m57s. Error: Aborted: EndDataStream RPC to 10.243.38.160:27000 is cancelled in state ON_OUTBOUND_QUEUE impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521 05:40:09.384006 1182322 kudu-status-util.h:55] EndDataStream() to 10.243.38.160:27000 failed: Aborted: EndDataStream RPC to 10.243.38.160:27000 is cancelled in state ON_OUTBOUND_QUEUE impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521 05:40:09.384631 1182314 krpc-data-stream-sender.cc:405] Slow EndDataStream RPC to 10.34.163.32:27000 (fragment_instance_id=9940332ce09828fd:b751966300000735): took 59m57s. Error: Aborted: EndDataStream RPC to 10.34.163.32:27000 is cancelled in state ON_OUTBOUND_QUEUE impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521 05:40:09.384668 1182314 kudu-status-util.h:55] EndDataStream() to 10.34.163.32:27000 failed: Aborted: EndDataStream RPC to 10.34.163.32:27000 is cancelled in state ON_OUTBOUND_QUEUE impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521 05:40:09.420662 1182313 krpc-data-stream-sender.cc:405] Slow EndDataStream RPC to 10.243.36.21:27000 (fragment_instance_id=9940332ce09828fd:b75196630000033a): took 1h. Error: Aborted: EndDataStream RPC to 10.243.36.21:27000 is cancelled in state ON_OUTBOUND_QUEUE impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521 05:40:09.420683 1182313 kudu-status-util.h:55] EndDataStream() to 10.243.36.21:27000 failed: Aborted: EndDataStream RPC to 10.243.36.21:27000 is cancelled in state ON_OUTBOUND_QUEUE impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521 05:40:09.420779 1182322 krpc-data-stream-sender.cc:405] Slow EndDataStream RPC to 10.243.38.160:27000 (fragment_instance_id=9940332ce09828fd:b75196630000033b): took 1h. Error: Aborted: EndDataStream RPC to 10.243.38.160:27000 is cancelled in state ON_OUTBOUND_QUEUE impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521 05:40:09.420799 1182322 kudu-status-util.h:55] EndDataStream() to 10.243.38.160:27000 failed: Aborted: EndDataStream RPC to 10.243.38.160:27000 is cancelled in state ON_OUTBOUND_QUEUE impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521 05:40:09.421937 1182314 krpc-data-stream-sender.cc:405] Slow EndDataStream RPC to 10.34.163.32:27000 (fragment_instance_id=9940332ce09828fd:b75196630000043e): took 1h. Error: Aborted: {code} Then extra workloads were removed and Impala cluster was restarted. During restarting Impala cluster, lots of Impala daemon crashed. The stacktraces of core files and log messages shows that impala daemons received incomplete data from KRPC sidecar. The incomplete data did not cause thrift deserialization failure so the valid but incomplete data was not captured and handled properly. See impala Jira: IMPALA-13107. The issue could not be re-produced locally. A quick fixing from Impala side was merged to mitigate the crash issue. Need to look into this issue further from KRPC internal. was: Impala executor calls KRPC sidecar API RpcContext::GetInboundSidecar() to read serialized thrift object from KRPC, then do thrift deserialization. (See GetSidecar() at https://github.com/apache/impala/blob/master/be/src/rpc/sidecar-util.h#L60-L67) In a customer reported cases, extra workloads were added to Impala cluster, which caused long delay for KRPCs between Impala daemons. The long delay caused KRPCs been cancelled, hence impala query failures. {code:java} impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521 05:40:09.383988 1182322 krpc-data-stream-sender.cc:405] Slow EndDataStream RPC to 10.243.38.160:27000 (fragment_instance_id=9940332ce09828fd:b751966300000632): took 59m57s. Error: Aborted: EndDataStream RPC to 10.243.38.160:27000 is cancelled in state ON_OUTBOUND_QUEUE impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521 05:40:09.384006 1182322 kudu-status-util.h:55] EndDataStream() to 10.243.38.160:27000 failed: Aborted: EndDataStream RPC to 10.243.38.160:27000 is cancelled in state ON_OUTBOUND_QUEUE impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521 05:40:09.384631 1182314 krpc-data-stream-sender.cc:405] Slow EndDataStream RPC to 10.34.163.32:27000 (fragment_instance_id=9940332ce09828fd:b751966300000735): took 59m57s. Error: Aborted: EndDataStream RPC to 10.34.163.32:27000 is cancelled in state ON_OUTBOUND_QUEUE impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521 05:40:09.384668 1182314 kudu-status-util.h:55] EndDataStream() to 10.34.163.32:27000 failed: Aborted: EndDataStream RPC to 10.34.163.32:27000 is cancelled in state ON_OUTBOUND_QUEUE impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521 05:40:09.420662 1182313 krpc-data-stream-sender.cc:405] Slow EndDataStream RPC to 10.243.36.21:27000 (fragment_instance_id=9940332ce09828fd:b75196630000033a): took 1h. Error: Aborted: EndDataStream RPC to 10.243.36.21:27000 is cancelled in state ON_OUTBOUND_QUEUE impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521 05:40:09.420683 1182313 kudu-status-util.h:55] EndDataStream() to 10.243.36.21:27000 failed: Aborted: EndDataStream RPC to 10.243.36.21:27000 is cancelled in state ON_OUTBOUND_QUEUE impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521 05:40:09.420779 1182322 krpc-data-stream-sender.cc:405] Slow EndDataStream RPC to 10.243.38.160:27000 (fragment_instance_id=9940332ce09828fd:b75196630000033b): took 1h. Error: Aborted: EndDataStream RPC to 10.243.38.160:27000 is cancelled in state ON_OUTBOUND_QUEUE impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521 05:40:09.420799 1182322 kudu-status-util.h:55] EndDataStream() to 10.243.38.160:27000 failed: Aborted: EndDataStream RPC to 10.243.38.160:27000 is cancelled in state ON_OUTBOUND_QUEUE impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521 05:40:09.421937 1182314 krpc-data-stream-sender.cc:405] Slow EndDataStream RPC to 10.34.163.32:27000 (fragment_instance_id=9940332ce09828fd:b75196630000043e): took 1h. Error: Aborted: {code} Then extra workloads are removed and Impala cluster was restarted. During restarting Impala cluster, lots of Impala daemon crashed. The stacktrace of core files and log messages shows that impala daemons received incomplete data from KRPC sidecar. The incomplete data did not cause thrift deserialization failure so the valid but incomplete data was not captured and handled properly. See impala Jira: IMPALA-13107. The issue could not be re-produced locally. A quick fixing from Impala side was merged to mitigate the crash issue. Need to look into this issue further from KRPC internal. > Incomplete sidecar data returned by RpcContext::GetInboundSidecar() > ------------------------------------------------------------------- > > Key: KUDU-3582 > URL: https://issues.apache.org/jira/browse/KUDU-3582 > Project: Kudu > Issue Type: Bug > Components: rpc > Reporter: Wenzhe Zhou > Priority: Major > > Impala executor calls KRPC sidecar API RpcContext::GetInboundSidecar() to > read serialized thrift object from KRPC, then do thrift deserialization. (See > GetSidecar() at > https://github.com/apache/impala/blob/master/be/src/rpc/sidecar-util.h#L60-L67) > In a customer reported cases, extra workloads were added to Impala cluster, > which caused long delay for KRPCs between Impala daemons. The long delay > caused KRPCs been cancelled, hence impala query failures. > {code:java} > impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521 > 05:40:09.383988 1182322 krpc-data-stream-sender.cc:405] Slow EndDataStream > RPC to 10.243.38.160:27000 > (fragment_instance_id=9940332ce09828fd:b751966300000632): took 59m57s. Error: > Aborted: EndDataStream RPC to 10.243.38.160:27000 is cancelled in state > ON_OUTBOUND_QUEUE > impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521 > 05:40:09.384006 1182322 kudu-status-util.h:55] EndDataStream() to > 10.243.38.160:27000 failed: Aborted: EndDataStream RPC to 10.243.38.160:27000 > is cancelled in state ON_OUTBOUND_QUEUE > impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521 > 05:40:09.384631 1182314 krpc-data-stream-sender.cc:405] Slow EndDataStream > RPC to 10.34.163.32:27000 > (fragment_instance_id=9940332ce09828fd:b751966300000735): took 59m57s. Error: > Aborted: EndDataStream RPC to 10.34.163.32:27000 is cancelled in state > ON_OUTBOUND_QUEUE > impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521 > 05:40:09.384668 1182314 kudu-status-util.h:55] EndDataStream() to > 10.34.163.32:27000 failed: Aborted: EndDataStream RPC to 10.34.163.32:27000 > is cancelled in state ON_OUTBOUND_QUEUE > impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521 > 05:40:09.420662 1182313 krpc-data-stream-sender.cc:405] Slow EndDataStream > RPC to 10.243.36.21:27000 > (fragment_instance_id=9940332ce09828fd:b75196630000033a): took 1h. Error: > Aborted: EndDataStream RPC to 10.243.36.21:27000 is cancelled in state > ON_OUTBOUND_QUEUE > impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521 > 05:40:09.420683 1182313 kudu-status-util.h:55] EndDataStream() to > 10.243.36.21:27000 failed: Aborted: EndDataStream RPC to 10.243.36.21:27000 > is cancelled in state ON_OUTBOUND_QUEUE > impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521 > 05:40:09.420779 1182322 krpc-data-stream-sender.cc:405] Slow EndDataStream > RPC to 10.243.38.160:27000 > (fragment_instance_id=9940332ce09828fd:b75196630000033b): took 1h. Error: > Aborted: EndDataStream RPC to 10.243.38.160:27000 is cancelled in state > ON_OUTBOUND_QUEUE > impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521 > 05:40:09.420799 1182322 kudu-status-util.h:55] EndDataStream() to > 10.243.38.160:27000 failed: Aborted: EndDataStream RPC to 10.243.38.160:27000 > is cancelled in state ON_OUTBOUND_QUEUE > impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521 > 05:40:09.421937 1182314 krpc-data-stream-sender.cc:405] Slow EndDataStream > RPC to 10.34.163.32:27000 > (fragment_instance_id=9940332ce09828fd:b75196630000043e): took 1h. Error: > Aborted: > {code} > Then extra workloads were removed and Impala cluster was restarted. During > restarting Impala cluster, lots of Impala daemon crashed. The stacktraces of > core files and log messages shows that impala daemons received incomplete > data from KRPC sidecar. The incomplete data did not cause thrift > deserialization failure so the valid but incomplete data was not captured and > handled properly. > See impala Jira: IMPALA-13107. The issue could not be re-produced locally. > A quick fixing from Impala side was merged to mitigate the crash issue. Need > to look into this issue further from KRPC internal. -- This message was sent by Atlassian Jira (v8.20.10#820010)