Hi Gus,

Thank you for your reply.  I want to run MPI jobs inside a single node, but
due to the resource allocation policies on the clusters, I could get many
more resources if I submit multiple-node "batch jobs".  Once I have a
multiple-node batch job, then I can use a command like "pbsdsh" to run
single node MPI jobs on each node that is allocated to me.  Thus, the MPI
jobs on each node are running independently of each other and unaware of one
another.

The actual call to mpirun is nontrivial to get, because Q-Chem has a
complicated series of wrapper scripts which ultimately calls mpirun.  If the
jobs are failing immediately, then I only have a small window to view the
actual command through "ps" or something.

Another option is for me to compile OpenMPI without Torque / PBS support.
If I do that, then it won't look for the node file anymore.  Is this
correct? 

I will try your suggestions and get back to you.  Thanks!

- Lee-Ping

-----Original Message-----
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gustavo Correa
Sent: Saturday, August 10, 2013 12:04 PM
To: Open MPI Users
Subject: Re: [OMPI users] Error launching single-node tasks from
multiple-node job.

Hi Lee-Ping

I know nothing about Q-Chem, but I was confused by these sentences:

"That is to say, these tasks are intended to use OpenMPI parallelism on each
node, but no parallelism across nodes. "

"I do not observe this error when submitting single-node jobs."

"Since my jobs are only parallel over the node they're running on, I believe
that a node file of any kind is unnecessary. "

Are you trying to run MPI jobs across several nodes or inside a single node?

***

Anyway, as far as I know,
if your OpenMPI was compiled with Torque/PBS support, the mpiexec/mpirun
command will look for the $PBS_NODEFILE to learn in which node(s) it should
launch the MPI processes, regardless of whether you are using one node or
more than one node.

You didn't send your mpiexec command line (which would help), but assuming
that Q-Chem allows some level of standard mpiexec command options, you could
force passing the $PBS_NODEFILE to it.

Something like this (for two nodes with 8 cores each):

#PBS -q myqueue
#PBS -l nodes=2:ppn=8
#PBS -N myjob
cd $PBS_O_WORKDIR
ls -l $PBS_NODEFILE
cat $PBS_NODEFILE

mpiexec -hostfile $PBS_NODEFILE -np 16 ./my-Q-chem-executable <parameters to
Q-chem>

I hope this helps,
Gus Correa

On Aug 10, 2013, at 1:51 PM, Lee-Ping Wang wrote:

> Hi there,
>  
> Recently, I've begun some calculations on a cluster where I submit a
multiple node job to the Torque batch system, and the job executes multiple
single-node parallel tasks.  That is to say, these tasks are intended to use
OpenMPI parallelism on each node, but no parallelism across nodes. 
>  
> Some background: The actual program being executed is Q-Chem 4.0.  I use
OpenMPI 1.4.2 for this, because Q-Chem is notoriously difficult to compile
and this is the last known version of OpenMPI that this version of Q-Chem is
known to work with.
>  
> My jobs are failing with the error message below; I do not observe this
error when submitting single-node jobs.  From reading the mailing list
archives (http://www.open-mpi.org/community/lists/users/2010/03/12348.php),
I believe it is looking for a PBS node file somewhere.  Since my jobs are
only parallel over the node they're running on, I believe that a node file
of any kind is unnecessary. 
>  
> My question is: Why is OpenMPI behaving differently when I submit a
multi-node job compared to a single-node job?  How does OpenMPI detect that
it is running under a multi-node allocation?  Is there a way I can change
OpenMPI's behavior so it always thinks it's running on a single node,
regardless of the type of job I submit to the batch system?
>  
> Thank you,
>  
> -          Lee-Ping Wang (Postdoc in Dept. of Chemistry, Stanford
University)
>  
> [compute-1-1.local:10910] [[42010,0],0] ORTE_ERROR_LOG: File open 
> failure in file ras_tm_module.c at line 153 [compute-1-1.local:10909] 
> [[42009,0],0] ORTE_ERROR_LOG: File open failure in file 
> ras_tm_module.c at line 153 [compute-1-1.local:10911] [[42011,0],0] 
> ORTE_ERROR_LOG: File open failure in file ras_tm_module.c at line 153 
> [compute-1-1.local:10910] [[42010,0],0] ORTE_ERROR_LOG: File open 
> failure in file ras_tm_module.c at line 87 [compute-1-1.local:10909] 
> [[42009,0],0] ORTE_ERROR_LOG: File open failure in file 
> ras_tm_module.c at line 87 [compute-1-1.local:10911] [[42011,0],0] 
> ORTE_ERROR_LOG: File open failure in file ras_tm_module.c at line 87 
> [compute-1-1.local:10910] [[42010,0],0] ORTE_ERROR_LOG: File open 
> failure in file base/ras_base_allocate.c at line 133 
> [compute-1-1.local:10909] [[42009,0],0] ORTE_ERROR_LOG: File open 
> failure in file base/ras_base_allocate.c at line 133 
> [compute-1-1.local:10911] [[42011,0],0] ORTE_ERROR_LOG: File open 
> failure in file base/ras_base_allocate.c at line 133 
> [compute-1-1.local:10910] [[42010,0],0] ORTE_ERROR_LOG: File open 
> failure in file base/plm_base_launch_support.c at line 72 
> [compute-1-1.local:10909] [[42009,0],0] ORTE_ERROR_LOG: File open 
> failure in file base/plm_base_launch_support.c at line 72 
> [compute-1-1.local:10911] [[42011,0],0] ORTE_ERROR_LOG: File open 
> failure in file base/plm_base_launch_support.c at line 72 
> [compute-1-1.local:10910] [[42010,0],0] ORTE_ERROR_LOG: File open 
> failure in file plm_tm_module.c at line 167 [compute-1-1.local:10909] 
> [[42009,0],0] ORTE_ERROR_LOG: File open failure in file 
> plm_tm_module.c at line 167 [compute-1-1.local:10911] [[42011,0],0] 
> ORTE_ERROR_LOG: File open failure in file plm_tm_module.c at line 167 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to