[ https://issues.apache.org/jira/browse/HIVE-20506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16607431#comment-16607431 ]
Sahil Takiar commented on HIVE-20506: ------------------------------------- The general idea makes sense to me. To confirm my understanding this change will essentially do the following: * Parse the {{spark-submit}} logs and look for the YARN application id * Create a {{YarnClient}} and check the state of the YARN app * If the app is in {{ACCEPTED}} state (which means it has been acknowledged by YARN, but hasn't actually been started yet) * As long as the app is in {{ACCEPTED}} state, extend the timeout until it transitions out of this state Is that correct? If thats the case, then I just have a few comments: * Rather than extending the timeout, why not just create two separate ones? One timeout for launching {{bin/spark-submit}} --> app = ACCEPTED and another from app = RUNNING --> connection established. ** We probably don't want to change the meaning of the current timeout for backwards compatibility, so maybe we could deprecate the existing one and replace it with two new ones? * Is there any way to avoid creating a {{YarnClient}}? I guess this is mitigated slightly by the fact that you only create the client if the timeout is triggered ** Just concerned about the overhead of creating a {{YarnClient}} + would this work on a secure cluster? ** {{bin/spark-submit}} should print out something like {{Application report for ... (state: ACCEPTED)}} perhaps we can parse the state from the logs? * Can we move all the changes in {{RpcServer}} to a separate class? That class is really meant to act as a generic RPC framework that is relatively independent of the HoS logic > HOS times out when cluster is full while Hive-on-MR waits > --------------------------------------------------------- > > Key: HIVE-20506 > URL: https://issues.apache.org/jira/browse/HIVE-20506 > Project: Hive > Issue Type: Improvement > Reporter: Brock Noland > Assignee: Brock Noland > Priority: Major > Attachments: HIVE-20506-CDH5.14.2.patch, HIVE-20506.1.patch, Screen > Shot 2018-09-07 at 8.10.37 AM.png > > > My understanding is as follows: > Hive-on-MR when the cluster is full will wait for resources to be available > before submitting a job. This is because the hadoop jar command is the > primary mechanism Hive uses to know if a job is complete or failed. > > Hive-on-Spark will timeout after {{SPARK_RPC_CLIENT_CONNECT_TIMEOUT}} because > the RPC client in the AppMaster doesn't connect back to the RPC Server in > HS2. > This is a behavior difference it'd be great to close. -- This message was sent by Atlassian JIRA (v7.6.3#76005)