Hi Gijsbert

This may be more on the Torque side, but not necessarily so.
ClusterResources has decent documentation:
http://www.clusterresources.com/pages/products/torque-resource-manager.php

1) To verify Torque+OpenMPI functionality/support try first
a non-mpi executable, e.g.:

#PBS -lnodes=4:ppn=4

mpiexec -np 16 hostname

Use full path to

2) Check your ${TORQUE}/server_priv/nodes file.
It should be something like this:

node01 np=4
node02 np=4
node03 np=4
node04 np=4

3) Verify that the pbs_mom daemons are working on all nodes
(service pbs_mom status)
In some setups this is called "pbs" instead of "pbs_mom".

4) Check that your PATH is being set correctly
for both Torque and OpenMPI, if they are installed in non-standard places.
You can try for instance:

mpirun -np 16 "hostname"
mpirun -np 16 "echo $PATH"
mpirun -np 16 "which mpiexec"
mpirun -np 16 "which qsub"

Your PATH may be set perhaps on your .bashrc/.cshrc file,
*on all nodes*.
If your home directory is mounted over NFS on the nodes,
this should be one single file.
However, if the home directories are local to all nodes,
then one file per node.
Sometimes this is done alternatively in specific files on
/etc/profile.d, e.g.torque.[sh,csh].
Yet another alternative is via "environment modules".

I hope this helps,
Gus Correa

Gijsbert Wiesenekker wrote:
I have a four-node quad core cluster. I am running OpenMPI (version 1.4.2) jobs 
with Torque (version 2.4.8). I can submit jobs using
#PBS -lnodes=4:ppn=4
And 16 processes are launched. However if I use
#PBS -lnodes=4:ppn=1 or
#PBS -lnodes=4
The call to MPI_Init is succesful, but the call to MPI_Comm_size(MPI_COMM_WORLD, &mpi_nprocs)
hangs and never returns.

Any ideas? Any workarounds?

Gijsbert


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to