[ https://issues.apache.org/jira/browse/HIVE-8956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14225082#comment-14225082 ]
Marcelo Vanzin commented on HIVE-8956: -------------------------------------- This is ok if it unblocks something right now. For the code, I'd suggest using {{System.nanoTime()}} to calculate durations, since it's monotonic. And use {{long}} instead of {{int}}. But I think a better approach is needed here. Currently the {{JobSubmitted}} message seems to only be sent when you use Spark's async APIs to submit a Spark job. A remote client job that does not use those APIs would never generate that message. Also, the backend uses a thread pool to execute jobs - so if you're queueing up multiple jobs, you may hit this timeout. I think we need more fine-grained remote client-level events for tracking job progress. e.g., adding {{JobReceived}} and {{JobStarted}} messages would be a good start ({{JobResult}} already covers the "job finished" case). I think these two extra messages should be enough to cover the problems described in this bug. > Hive hangs while some error/exception happens beyond job execution[Spark > Branch] > -------------------------------------------------------------------------------- > > Key: HIVE-8956 > URL: https://issues.apache.org/jira/browse/HIVE-8956 > Project: Hive > Issue Type: Sub-task > Components: Spark > Reporter: Chengxiang Li > Assignee: Rui Li > Labels: Spark-M3 > Attachments: HIVE-8956.1-spark.patch > > > Remote spark client communicate with remote spark context asynchronously, if > error/exception is throw out during job execution in remote spark context, it > would be wrapped and send back to remote spark client, but if error/exception > is throw out beyond job execution, such as job serialized failed, remote > spark client would never know what's going on in remote spark context, and it > would hangs there. > Set a timeout in remote spark client side may not a great idea, as we are not > sure how long the query executed in spark cluster. we need find a way to > check whether job has failed(whole life cycle) in remote spark context. -- This message was sent by Atlassian JIRA (v6.3.4#6332)