Hi Ralph,

On Aug 10, 2009, at  13:04 PM, Ralph Castain wrote:

Umm...are you saying that your $PBS_NODEFILE contains the following:

No, if I put cat $PBS_NODEFILE in the pbs script I get
xserve02.local
...
xserve02.local
xserve01.local
...
xserve01.local

each repeated 8 times.  So that seems to be working....


xserve01.local np=8
xserve02.local np=8

If so, that could be part of the problem - it isn't the standard notation we are expecting to see in that file. What Torque normally provides is one line for each slot, so we would expect to see "xserve01.local" repeated 8 times, followed by "xserve02.local" repeated 8 times. Given the different syntax, we may not be parsing the file correctly. How was this file created?

The file I am referring to above is the $TORQUEHOME/server_priv/nodes file, that I created it by hand based on my understanding of the docs at:

http://www.clusterresources.com/torquedocs/nodeconfig.shtml


Also, could you clarify what node mpirun is executing on?

mpirun seems to only run on xserve02

The job I'm running is just creating a file:

#!/bin/bash

H=`hostname`
echo $H
sleep 10
uptime >&  $H.txt

In the stdout, the echo $H returns
"xserve02.local" 16 times and only xsever02.local.txt gets created...

Again, if I run with "ssh" outside of pbs I get the expected response.


Thanks,  Jody




Ralph

On Aug 10, 2009, at 1:43 PM, Jody Klymak wrote:


Hi All,

I've been trying to get torque pbs to work on my OS X 10.5.7 cluster with openMPI (after finding that Xgrid was pretty flaky about connections). I *think* this is an MPI problem (perhaps via operator error!)

If I submit openMPI with:


#PBS -l nodes=2:ppn=8

mpirun MyProg


pbs locks off two of the processors, checked via "pbsnodes -a", and the job output. But mpirun runs the whole job on the second of the two processors.

If I run the same job w/o qsub (i.e. using ssh)
mpirun -n 16 -host xserve01,xserve02 MyProg
it runs fine on all the nodes....

My /var/spool/toque/server_priv/nodes file looks like:

xserve01.local np=8
xserve02.local np=8


Any idea what could be going wrong or how to debu this properly? There is nothing suspicious in the server or mom logs.

Thanks for any help,

Jody





--
Jody Klymak
http://web.uvic.ca/~jklymak/




_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

--
Jody Klymak
http://web.uvic.ca/~jklymak/




Reply via email to