Umm...are you saying that your $PBS_NODEFILE contains the following:

xserve01.local np=8
xserve02.local np=8

If so, that could be part of the problem - it isn't the standard notation we are expecting to see in that file. What Torque normally provides is one line for each slot, so we would expect to see "xserve01.local" repeated 8 times, followed by "xserve02.local" repeated 8 times. Given the different syntax, we may not be parsing the file correctly. How was this file created?

Also, could you clarify what node mpirun is executing on?

Ralph

On Aug 10, 2009, at 1:43 PM, Jody Klymak wrote:


Hi All,

I've been trying to get torque pbs to work on my OS X 10.5.7 cluster with openMPI (after finding that Xgrid was pretty flaky about connections). I *think* this is an MPI problem (perhaps via operator error!)

If I submit openMPI with:


#PBS -l nodes=2:ppn=8

mpirun MyProg


pbs locks off two of the processors, checked via "pbsnodes -a", and the job output. But mpirun runs the whole job on the second of the two processors.

If I run the same job w/o qsub (i.e. using ssh)
mpirun -n 16 -host xserve01,xserve02 MyProg
it runs fine on all the nodes....

My /var/spool/toque/server_priv/nodes file looks like:

xserve01.local np=8
xserve02.local np=8


Any idea what could be going wrong or how to debu this properly? There is nothing suspicious in the server or mom logs.

Thanks for any help,

Jody





--
Jody Klymak
http://web.uvic.ca/~jklymak/




_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to