Re: [OMPI users] Error launching single-node tasks from multiple-node job.

Gustavo Correa Sat, 10 Aug 2013 15:23:43 -0400 (EDT)

... from a (probably obsolete) Q-Chem user guide found on the Web:

***
" To run parallel Q-Chem
using a batch scheduler such as PBS, users may have to modify the
mpirun command in $QC/bin/parallel.csh
depending on whether or not the MPI implementation requires the
-machinefile option to be given. 
For further details users should read the $QC/README.Parallelle, and contact
Q-Chem if any problems are encountered (email: support@q-
chem.com). 
Parallel users should also read the above section on using serial Q-Chem.
Users can also run Q-Chem's coupled-cluster calculations in parallel on 
multi-core architectures.
Please see section 5.12 for detail"
***


Guesses:
1) Q-Chem is launched by a bunch of scripts provided by Q-chem.com or something 
the like,
and the mpiexec command line is buried somewhere in those scripts,
not directly visible by the user.  Right?

2) Look for the -machinefile switch in their script $QC/bin/parallel.csh
and replace it by 
-hostfile $PBS_NODEFILE.

My two cents,
Gus Correa

On Aug 10, 2013, at 3:03 PM, Gustavo Correa wrote:

> Hi Lee-Ping
> 
> I know nothing about Q-Chem, but I was confused by these sentences:
> 
> "That is to say, these tasks are intended to use OpenMPI parallelism on each 
> node, but no parallelism across nodes. "
> 
> "I do not observe this error when submitting single-node jobs."
> 
> "Since my jobs are only parallel over the node they’re running on, I believe 
> that a node file of any kind is unnecessary. "
> 
> Are you trying to run MPI jobs across several nodes or inside a single node?
> 
> ***
> 
> Anyway, as far as I know, 
> if your OpenMPI was compiled with Torque/PBS support, the mpiexec/mpirun 
> command
> will look for the $PBS_NODEFILE to learn in which node(s) it should launch 
> the MPI
> processes, regardless of whether you are using one node or more than one node.
> 
> You didn't send your mpiexec command line (which would help), but assuming 
> that
> Q-Chem allows some level of standard mpiexec command options, you could force
> passing the $PBS_NODEFILE to it.
> 
> Something like this (for two nodes with 8 cores each):
> 
> #PBS -q myqueue
> #PBS -l nodes=2:ppn=8
> #PBS -N myjob
> cd $PBS_O_WORKDIR
> ls -l $PBS_NODEFILE
> cat $PBS_NODEFILE
> 
> mpiexec -hostfile $PBS_NODEFILE -np 16 ./my-Q-chem-executable <parameters to 
> Q-chem>
> 
> I hope this helps,
> Gus Correa
> 
> On Aug 10, 2013, at 1:51 PM, Lee-Ping Wang wrote:
> 
>> Hi there,
>> 
>> Recently, I’ve begun some calculations on a cluster where I submit a 
>> multiple node job to the Torque batch system, and the job executes multiple 
>> single-node parallel tasks.  That is to say, these tasks are intended to use 
>> OpenMPI parallelism on each node, but no parallelism across nodes. 
>> 
>> Some background: The actual program being executed is Q-Chem 4.0.  I use 
>> OpenMPI 1.4.2 for this, because Q-Chem is notoriously difficult to compile 
>> and this is the last known version of OpenMPI that this version of Q-Chem is 
>> known to work with.
>> 
>> My jobs are failing with the error message below; I do not observe this 
>> error when submitting single-node jobs.  From reading the mailing list 
>> archives (http://www.open-mpi.org/community/lists/users/2010/03/12348.php), 
>> I believe it is looking for a PBS node file somewhere.  Since my jobs are 
>> only parallel over the node they’re running on, I believe that a node file 
>> of any kind is unnecessary. 
>> 
>> My question is: Why is OpenMPI behaving differently when I submit a 
>> multi-node job compared to a single-node job?  How does OpenMPI detect that 
>> it is running under a multi-node allocation?  Is there a way I can change 
>> OpenMPI’s behavior so it always thinks it’s running on a single node, 
>> regardless of the type of job I submit to the batch system?
>> 
>> Thank you,
>> 
>> -          Lee-Ping Wang (Postdoc in Dept. of Chemistry, Stanford University)
>> 
>> [compute-1-1.local:10910] [[42010,0],0] ORTE_ERROR_LOG: File open failure in 
>> file ras_tm_module.c at line 153
>> [compute-1-1.local:10909] [[42009,0],0] ORTE_ERROR_LOG: File open failure in 
>> file ras_tm_module.c at line 153
>> [compute-1-1.local:10911] [[42011,0],0] ORTE_ERROR_LOG: File open failure in 
>> file ras_tm_module.c at line 153
>> [compute-1-1.local:10910] [[42010,0],0] ORTE_ERROR_LOG: File open failure in 
>> file ras_tm_module.c at line 87
>> [compute-1-1.local:10909] [[42009,0],0] ORTE_ERROR_LOG: File open failure in 
>> file ras_tm_module.c at line 87
>> [compute-1-1.local:10911] [[42011,0],0] ORTE_ERROR_LOG: File open failure in 
>> file ras_tm_module.c at line 87
>> [compute-1-1.local:10910] [[42010,0],0] ORTE_ERROR_LOG: File open failure in 
>> file base/ras_base_allocate.c at line 133
>> [compute-1-1.local:10909] [[42009,0],0] ORTE_ERROR_LOG: File open failure in 
>> file base/ras_base_allocate.c at line 133
>> [compute-1-1.local:10911] [[42011,0],0] ORTE_ERROR_LOG: File open failure in 
>> file base/ras_base_allocate.c at line 133
>> [compute-1-1.local:10910] [[42010,0],0] ORTE_ERROR_LOG: File open failure in 
>> file base/plm_base_launch_support.c at line 72
>> [compute-1-1.local:10909] [[42009,0],0] ORTE_ERROR_LOG: File open failure in 
>> file base/plm_base_launch_support.c at line 72
>> [compute-1-1.local:10911] [[42011,0],0] ORTE_ERROR_LOG: File open failure in 
>> file base/plm_base_launch_support.c at line 72
>> [compute-1-1.local:10910] [[42010,0],0] ORTE_ERROR_LOG: File open failure in 
>> file plm_tm_module.c at line 167
>> [compute-1-1.local:10909] [[42009,0],0] ORTE_ERROR_LOG: File open failure in 
>> file plm_tm_module.c at line 167
>> [compute-1-1.local:10911] [[42011,0],0] ORTE_ERROR_LOG: File open failure in 
>> file plm_tm_module.c at line 167
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Error launching single-node tasks from multiple-node job.

Reply via email to