Hi Ralph,
On Aug 10, 2009, at 13:04 PM, Ralph Castain wrote:
Umm...are you saying that your $PBS_NODEFILE contains the following:
No, if I put cat $PBS_NODEFILE in the pbs script I get
xserve02.local
...
xserve02.local
xserve01.local
...
xserve01.local
each repeated 8 times. So that seems to be working....
xserve01.local np=8
xserve02.local np=8
If so, that could be part of the problem - it isn't the standard
notation we are expecting to see in that file. What Torque normally
provides is one line for each slot, so we would expect to see
"xserve01.local" repeated 8 times, followed by "xserve02.local"
repeated 8 times. Given the different syntax, we may not be parsing
the file correctly. How was this file created?
The file I am referring to above is the $TORQUEHOME/server_priv/nodes
file, that I created it by hand based on my understanding of the docs
at:
http://www.clusterresources.com/torquedocs/nodeconfig.shtml
Also, could you clarify what node mpirun is executing on?
mpirun seems to only run on xserve02
The job I'm running is just creating a file:
#!/bin/bash
H=`hostname`
echo $H
sleep 10
uptime >& $H.txt
In the stdout, the echo $H returns
"xserve02.local" 16 times and only xsever02.local.txt gets created...
Again, if I run with "ssh" outside of pbs I get the expected response.
Thanks, Jody
Ralph
On Aug 10, 2009, at 1:43 PM, Jody Klymak wrote:
Hi All,
I've been trying to get torque pbs to work on my OS X 10.5.7
cluster with openMPI (after finding that Xgrid was pretty flaky
about connections). I *think* this is an MPI problem (perhaps via
operator error!)
If I submit openMPI with:
#PBS -l nodes=2:ppn=8
mpirun MyProg
pbs locks off two of the processors, checked via "pbsnodes -a", and
the job output. But mpirun runs the whole job on the second of the
two processors.
If I run the same job w/o qsub (i.e. using ssh)
mpirun -n 16 -host xserve01,xserve02 MyProg
it runs fine on all the nodes....
My /var/spool/toque/server_priv/nodes file looks like:
xserve01.local np=8
xserve02.local np=8
Any idea what could be going wrong or how to debu this properly?
There is nothing suspicious in the server or mom logs.
Thanks for any help,
Jody
--
Jody Klymak
http://web.uvic.ca/~jklymak/
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jody Klymak
http://web.uvic.ca/~jklymak/