Hi Zhiliang
First thing to check is that your Torque system is defining and
setting the environmental variables we are expecting in a Torque
system. It is quite possible that your Torque system isn't configured
as we expect.
Can you run a job and send us the output from "printenv | grep PBS"?
We should see a PBS jobid, the name of the file containing the names
of the allocated nodes, etc.
Since you are able to run with -machinefile, my guess is that your
system isn't setting those environmental variables as we expect. In
that case, you will have to keep specifying the machinefile by hand.
Thanks
Ralph
On Sep 28, 2008, at 7:02 PM, Zhiliang Hu wrote:
I have asked this question on TorqueUsers list. Responses from that
list suggests that the question be asked on this list:
The situation is:
I can submit my jobs as in:
qsub -l nodes=6:ppn=2 /path/to/mpi_program
where "mpi_program" is:
/path/to/mpirun -np 12 /path/to/my_program
-- however everything went to run on the head node (one time on the
first compute node). Jobs can be done anyway.
While the mpirun can run on its own by specifying a "-machinefile",
it is pointed out by Glen among others, and also on this web site http://wiki.hpc.ufl.edu/index.php/Common_Problems
(I got the same error as the last example on that web page) that
it's not a good idea to provide machinefile since it's "already
handled by OpenMPI and Torque".
My question is, why the OpenMPI and Torque is not handling the jobs
to all nodes?
ps 1:
The OpenMPI is configured and installed with the "--with-tm" option,
and the "ompi_info" does show lines:
MCA ras: tm (MCA v1.0, API v1.3, Component v1.2.7)
MCA pls: tm (MCA v1.0, API v1.3, Component v1.2.7)
ps 2:
"/path/to/mpirun -np 12 -machinefile /path/to/machinefile /path/to/
my_program"
works normal (send jobs to all nodes).
Thanks,
Zhiliang
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users