[ 
https://issues.apache.org/jira/browse/KUDU-3582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17851227#comment-17851227
 ] 

Wenzhe Zhou commented on KUDU-3582:
-----------------------------------

Copied from slack channel:

Alexey Serbin:  I took a quick look at 
https://issues.apache.org/jira/browse/KUDU-3582 and it's not clear to me what 
the bug is about.  Did you expect to get complete and consistent data from a 
cancelled RPC?
Wenzhe Zhou:  We don't need complete and consistent data for cancelled RPC, but 
expect sidecar to return error if data is incomplete.
Alexey Serbin:  Ah, I see.  Is that something about about getting non-OK from 
RpcSidecar::ParseSidecars() CallResponse::GetSidecar() ?  Or the idea is to 
introduce some extra API to report on non-consistent sidecar's data? (edited) 
Wenzhe Zhou:  RpcSidecar::ParseSidecars()  is not directly called by Impala 
code. Impala calls RpcContext.GetSidecar(). It's good to return error from 
RpcContext.GetSidecar() for non consistent sidecar's data.
Alexey Serbin:  OK, I see.  Do you know for sure that the inconsistency in the 
sidecar's data was due to RPC being cancelled, or it's not yet clear what 
actually happened?
Wenzhe Zhou: It's not yet clear what actually happened. But there were lots of 
RPC being cancelled due to long delay. I guess it might be the cause.


> Incomplete sidecar data returned by RpcContext::GetInboundSidecar()
> -------------------------------------------------------------------
>
>                 Key: KUDU-3582
>                 URL: https://issues.apache.org/jira/browse/KUDU-3582
>             Project: Kudu
>          Issue Type: Bug
>          Components: rpc
>            Reporter: Wenzhe Zhou
>            Priority: Major
>
> Impala executor calls KRPC sidecar API RpcContext::GetInboundSidecar() to 
> read serialized thrift object from KRPC, then do thrift deserialization. (See 
>  GetSidecar() at
> https://github.com/apache/impala/blob/master/be/src/rpc/sidecar-util.h#L60-L67)
> In a customer reported cases, extra workloads were added to Impala cluster, 
> which caused long delay for KRPCs between Impala daemons. The long delay 
> caused KRPCs been cancelled, hence impala query failures.
> {code:java}
> impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521
>  05:40:09.383988 1182322 krpc-data-stream-sender.cc:405] Slow EndDataStream 
> RPC to 10.243.38.160:27000 
> (fragment_instance_id=9940332ce09828fd:b751966300000632): took 59m57s. Error: 
> Aborted: EndDataStream RPC to 10.243.38.160:27000 is cancelled in state 
> ON_OUTBOUND_QUEUE 
> impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521
>  05:40:09.384006 1182322 kudu-status-util.h:55] EndDataStream() to 
> 10.243.38.160:27000 failed: Aborted: EndDataStream RPC to 10.243.38.160:27000 
> is cancelled in state ON_OUTBOUND_QUEUE 
> impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521
>  05:40:09.384631 1182314 krpc-data-stream-sender.cc:405] Slow EndDataStream 
> RPC to 10.34.163.32:27000 
> (fragment_instance_id=9940332ce09828fd:b751966300000735): took 59m57s. Error: 
> Aborted: EndDataStream RPC to 10.34.163.32:27000 is cancelled in state 
> ON_OUTBOUND_QUEUE 
> impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521
>  05:40:09.384668 1182314 kudu-status-util.h:55] EndDataStream() to 
> 10.34.163.32:27000 failed: Aborted: EndDataStream RPC to 10.34.163.32:27000 
> is cancelled in state ON_OUTBOUND_QUEUE 
> impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521
>  05:40:09.420662 1182313 krpc-data-stream-sender.cc:405] Slow EndDataStream 
> RPC to 10.243.36.21:27000 
> (fragment_instance_id=9940332ce09828fd:b75196630000033a): took 1h. Error: 
> Aborted: EndDataStream RPC to 10.243.36.21:27000 is cancelled in state 
> ON_OUTBOUND_QUEUE 
> impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521
>  05:40:09.420683 1182313 kudu-status-util.h:55] EndDataStream() to 
> 10.243.36.21:27000 failed: Aborted: EndDataStream RPC to 10.243.36.21:27000 
> is cancelled in state ON_OUTBOUND_QUEUE 
> impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521
>  05:40:09.420779 1182322 krpc-data-stream-sender.cc:405] Slow EndDataStream 
> RPC to 10.243.38.160:27000 
> (fragment_instance_id=9940332ce09828fd:b75196630000033b): took 1h. Error: 
> Aborted: EndDataStream RPC to 10.243.38.160:27000 is cancelled in state 
> ON_OUTBOUND_QUEUE 
> impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521
>  05:40:09.420799 1182322 kudu-status-util.h:55] EndDataStream() to 
> 10.243.38.160:27000 failed: Aborted: EndDataStream RPC to 10.243.38.160:27000 
> is cancelled in state ON_OUTBOUND_QUEUE 
> impalad.bdnyr019x21t1.nam.nsroot.net.impala.log.INFO.20240520-210047.1181040:I0521
>  05:40:09.421937 1182314 krpc-data-stream-sender.cc:405] Slow EndDataStream 
> RPC to 10.34.163.32:27000 
> (fragment_instance_id=9940332ce09828fd:b75196630000043e): took 1h. Error: 
> Aborted:
> {code}
> Then extra workloads were removed and Impala cluster was restarted. During 
> restarting Impala cluster, lots of Impala daemon crashed. The stacktraces of 
> core files and log messages shows that impala daemons received incomplete 
> data from KRPC sidecar. The incomplete data did not cause thrift 
> deserialization failure so the valid but incomplete data was not captured and 
> handled properly.
> See impala Jira: IMPALA-13107. The issue could not be re-produced locally. 
> A quick fixing from Impala side was merged to mitigate the crash issue. Need 
> to look into this issue further from KRPC internal. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to