[ 
https://issues.apache.org/jira/browse/HIVE-20506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16607431#comment-16607431
 ] 

Sahil Takiar commented on HIVE-20506:
-------------------------------------

The general idea makes sense to me. To confirm my understanding this change 
will essentially do the following:
* Parse the {{spark-submit}} logs and look for the YARN application id
* Create a {{YarnClient}} and check the state of the YARN app
* If the app is in {{ACCEPTED}} state (which means it has been acknowledged by 
YARN, but hasn't actually been started yet)
* As long as the app is in {{ACCEPTED}} state, extend the timeout until it 
transitions out of this state

Is that correct?

If thats the case, then I just have a few comments:
* Rather than extending the timeout, why not just create two separate ones? One 
timeout for launching {{bin/spark-submit}} --> app = ACCEPTED and another from 
app = RUNNING --> connection established.
** We probably don't want to change the meaning of the current timeout for 
backwards compatibility, so maybe we could deprecate the existing one and 
replace it with two new ones?
* Is there any way to avoid creating a {{YarnClient}}? I guess this is 
mitigated slightly by the fact that you only create the client if the timeout 
is triggered
** Just concerned about the overhead of creating a {{YarnClient}} + would this 
work on a secure cluster?
** {{bin/spark-submit}} should print out something like {{Application report 
for ... (state: ACCEPTED)}} perhaps we can parse the state from the logs?
* Can we move all the changes in {{RpcServer}} to a separate class? That class 
is really meant to act as a generic RPC framework that is relatively 
independent of the HoS logic

> HOS times out when cluster is full while Hive-on-MR waits
> ---------------------------------------------------------
>
>                 Key: HIVE-20506
>                 URL: https://issues.apache.org/jira/browse/HIVE-20506
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Brock Noland
>            Assignee: Brock Noland
>            Priority: Major
>         Attachments: HIVE-20506-CDH5.14.2.patch, HIVE-20506.1.patch, Screen 
> Shot 2018-09-07 at 8.10.37 AM.png
>
>
> My understanding is as follows:
> Hive-on-MR when the cluster is full will wait for resources to be available 
> before submitting a job. This is because the hadoop jar command is the 
> primary mechanism Hive uses to know if a job is complete or failed.
>  
> Hive-on-Spark will timeout after {{SPARK_RPC_CLIENT_CONNECT_TIMEOUT}} because 
> the RPC client in the AppMaster doesn't connect back to the RPC Server in 
> HS2. 
> This is a behavior difference it'd be great to close.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to