[ 
https://issues.apache.org/jira/browse/HIVE-8956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14225082#comment-14225082
 ] 

Marcelo Vanzin commented on HIVE-8956:
--------------------------------------

This is ok if it unblocks something right now. For the code, I'd suggest using 
{{System.nanoTime()}} to calculate durations, since it's monotonic. And use 
{{long}} instead of {{int}}.

But I think a better approach is needed here. Currently the {{JobSubmitted}} 
message seems to only be sent when you use Spark's async APIs to submit a Spark 
job. A remote client job that does not use those APIs would never generate that 
message. Also, the backend uses a thread pool to execute jobs - so if you're 
queueing up multiple jobs, you may hit this timeout.

I think we need more fine-grained remote client-level events for tracking job 
progress. e.g., adding {{JobReceived}} and {{JobStarted}} messages would be a 
good start ({{JobResult}} already covers the "job finished" case). I think 
these two extra messages should be enough to cover the problems described in 
this bug.

> Hive hangs while some error/exception happens beyond job execution[Spark 
> Branch]
> --------------------------------------------------------------------------------
>
>                 Key: HIVE-8956
>                 URL: https://issues.apache.org/jira/browse/HIVE-8956
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Spark
>            Reporter: Chengxiang Li
>            Assignee: Rui Li
>              Labels: Spark-M3
>         Attachments: HIVE-8956.1-spark.patch
>
>
> Remote spark client communicate with remote spark context asynchronously, if 
> error/exception is throw out during job execution in remote spark context, it 
> would be wrapped and send back to remote spark client, but if error/exception 
> is throw out beyond job execution, such as job serialized failed, remote 
> spark client would never know what's going on in remote spark context, and it 
> would hangs there.
> Set a timeout in remote spark client side may not a great idea, as we are not 
> sure how long the query executed in spark cluster. we need find a way to 
> check whether job has failed(whole life cycle) in remote spark context.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to