On Aug 10, 2009, at 3:25 PM, Jody Klymak wrote:

Hi Ralph,

On Aug 10, 2009, at  13:04 PM, Ralph Castain wrote:

Umm...are you saying that your $PBS_NODEFILE contains the following:

No, if I put cat $PBS_NODEFILE in the pbs script I get
xserve02.local
...
xserve02.local
xserve01.local
...
xserve01.local

each repeated 8 times.  So that seems to be working....

Good!



xserve01.local np=8
xserve02.local np=8

If so, that could be part of the problem - it isn't the standard notation we are expecting to see in that file. What Torque normally provides is one line for each slot, so we would expect to see "xserve01.local" repeated 8 times, followed by "xserve02.local" repeated 8 times. Given the different syntax, we may not be parsing the file correctly. How was this file created?

The file I am referring to above is the $TORQUEHOME/server_priv/ nodes file, that I created it by hand based on my understanding of the docs at:

http://www.clusterresources.com/torquedocs/nodeconfig.shtml

OMPI doesn't care about that file - only Torque looks at it.



Also, could you clarify what node mpirun is executing on?

mpirun seems to only run on xserve02

The job I'm running is just creating a file:

#!/bin/bash

H=`hostname`
echo $H
sleep 10
uptime >&  $H.txt

In the stdout, the echo $H returns
"xserve02.local" 16 times and only xsever02.local.txt gets created...

Again, if I run with "ssh" outside of pbs I get the expected response.

Try running:

mpirun --display-allocation -pernode --display-map hostname

This will tell us what OMPI is seeing in terms of the nodes available to it. Based on what you show above, it should see both of your nodes. By forcing OMPI to put one proc/node, you'll be directing it to use both nodes for the job. You should see this in the reported map.

If we then see both procs run on the same node, I would suggest reconfiguring OMPI with --enable-debug, and then rerunning the above command with an additional flag:

-mca plm_base_verbose 5

which will show us all the ugly details of what OMPI is telling Torque to do. Since OMPI works fine with Torque on Linux, my guess is that there is something about the Torque for Mac that is a little different and thus causing problems.

Ralph




Thanks,  Jody




Ralph

On Aug 10, 2009, at 1:43 PM, Jody Klymak wrote:


Hi All,

I've been trying to get torque pbs to work on my OS X 10.5.7 cluster with openMPI (after finding that Xgrid was pretty flaky about connections). I *think* this is an MPI problem (perhaps via operator error!)

If I submit openMPI with:


#PBS -l nodes=2:ppn=8

mpirun MyProg


pbs locks off two of the processors, checked via "pbsnodes -a", and the job output. But mpirun runs the whole job on the second of the two processors.

If I run the same job w/o qsub (i.e. using ssh)
mpirun -n 16 -host xserve01,xserve02 MyProg
it runs fine on all the nodes....

My /var/spool/toque/server_priv/nodes file looks like:

xserve01.local np=8
xserve02.local np=8


Any idea what could be going wrong or how to debu this properly? There is nothing suspicious in the server or mom logs.

Thanks for any help,

Jody





--
Jody Klymak
http://web.uvic.ca/~jklymak/




_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

--
Jody Klymak
http://web.uvic.ca/~jklymak/




_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to