Wenzhe Zhou created KUDU-3582:
---------------------------------

             Summary: Incomplete sidecar data returned by 
RpcContext::GetInboundSidecar()
                 Key: KUDU-3582
                 URL: https://issues.apache.org/jira/browse/KUDU-3582
             Project: Kudu
          Issue Type: Bug
          Components: rpc
            Reporter: Wenzhe Zhou


Impala executor calls KRPC sidecar API RpcContext::GetInboundSidecar() to read 
serialized thrift object from KRPC, then do thrift deserialization. (See  
GetSidecar() at
https://github.com/apache/impala/blob/master/be/src/rpc/sidecar-util.h#L60-L67)

In a customer reported cases, extra workloads were added to Impala cluster, 
which caused long delay for KRPCs between Impala daemons. The long delay caused 
KRPCs been cancelled, hence impala query failures.
{code:java}
impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521
 05:40:09.383988 1182322 krpc-data-stream-sender.cc:405] Slow EndDataStream RPC 
to 10.243.38.160:27000 
(fragment_instance_id=9940332ce09828fd:b751966300000632): took 59m57s. Error: 
Aborted: EndDataStream RPC to 10.243.38.160:27000 is cancelled in state 
ON_OUTBOUND_QUEUE 
impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521
 05:40:09.384006 1182322 kudu-status-util.h:55] EndDataStream() to 
10.243.38.160:27000 failed: Aborted: EndDataStream RPC to 10.243.38.160:27000 
is cancelled in state ON_OUTBOUND_QUEUE 
impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521
 05:40:09.384631 1182314 krpc-data-stream-sender.cc:405] Slow EndDataStream RPC 
to 10.34.163.32:27000 (fragment_instance_id=9940332ce09828fd:b751966300000735): 
took 59m57s. Error: Aborted: EndDataStream RPC to 10.34.163.32:27000 is 
cancelled in state ON_OUTBOUND_QUEUE 
impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521
 05:40:09.384668 1182314 kudu-status-util.h:55] EndDataStream() to 
10.34.163.32:27000 failed: Aborted: EndDataStream RPC to 10.34.163.32:27000 is 
cancelled in state ON_OUTBOUND_QUEUE 
impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521
 05:40:09.420662 1182313 krpc-data-stream-sender.cc:405] Slow EndDataStream RPC 
to 10.243.36.21:27000 (fragment_instance_id=9940332ce09828fd:b75196630000033a): 
took 1h. Error: Aborted: EndDataStream RPC to 10.243.36.21:27000 is cancelled 
in state ON_OUTBOUND_QUEUE 
impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521
 05:40:09.420683 1182313 kudu-status-util.h:55] EndDataStream() to 
10.243.36.21:27000 failed: Aborted: EndDataStream RPC to 10.243.36.21:27000 is 
cancelled in state ON_OUTBOUND_QUEUE 
impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521
 05:40:09.420779 1182322 krpc-data-stream-sender.cc:405] Slow EndDataStream RPC 
to 10.243.38.160:27000 
(fragment_instance_id=9940332ce09828fd:b75196630000033b): took 1h. Error: 
Aborted: EndDataStream RPC to 10.243.38.160:27000 is cancelled in state 
ON_OUTBOUND_QUEUE 
impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521
 05:40:09.420799 1182322 kudu-status-util.h:55] EndDataStream() to 
10.243.38.160:27000 failed: Aborted: EndDataStream RPC to 10.243.38.160:27000 
is cancelled in state ON_OUTBOUND_QUEUE 
impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521
 05:40:09.421937 1182314 krpc-data-stream-sender.cc:405] Slow EndDataStream RPC 
to 10.34.163.32:27000 (fragment_instance_id=9940332ce09828fd:b75196630000043e): 
took 1h. Error: Aborted:
{code}

Then extra workloads are removed and Impala cluster was restarted. During 
restarting Impala cluster, lots of Impala daemon crashed. The stacktrace of 
core files and log messages shows that impala daemons received incomplete data 
from KRPC sidecar. The incomplete data did not cause thrift deserialization 
failure so the valid but incomplete data was not captured and handled properly.
See impala Jira: IMPALA-13107. The issue could not be re-produced locally. 

A quick fixing from Impala side was merged to mitigate the crash issue. Need to 
look into this issue further from KRPC internal. 






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to