Umm...are you saying that your $PBS_NODEFILE contains the following:
xserve01.local np=8
xserve02.local np=8
If so, that could be part of the problem - it isn't the standard
notation we are expecting to see in that file. What Torque normally
provides is one line for each slot, so we would expect to see
"xserve01.local" repeated 8 times, followed by "xserve02.local"
repeated 8 times. Given the different syntax, we may not be parsing
the file correctly. How was this file created?
Also, could you clarify what node mpirun is executing on?
Ralph
On Aug 10, 2009, at 1:43 PM, Jody Klymak wrote:
Hi All,
I've been trying to get torque pbs to work on my OS X 10.5.7 cluster
with openMPI (after finding that Xgrid was pretty flaky about
connections). I *think* this is an MPI problem (perhaps via
operator error!)
If I submit openMPI with:
#PBS -l nodes=2:ppn=8
mpirun MyProg
pbs locks off two of the processors, checked via "pbsnodes -a", and
the job output. But mpirun runs the whole job on the second of the
two processors.
If I run the same job w/o qsub (i.e. using ssh)
mpirun -n 16 -host xserve01,xserve02 MyProg
it runs fine on all the nodes....
My /var/spool/toque/server_priv/nodes file looks like:
xserve01.local np=8
xserve02.local np=8
Any idea what could be going wrong or how to debu this properly?
There is nothing suspicious in the server or mom logs.
Thanks for any help,
Jody
--
Jody Klymak
http://web.uvic.ca/~jklymak/
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users