On 11/22/13 11:15 AM, Ralph Castain wrote:
On Nov 22, 2013, at 10:03 AM, Reuti <re...@staff.uni-marburg.de
<mailto:re...@staff.uni-marburg.de>> wrote:
Am 22.11.2013 um 18:56 schrieb Jason Gans:
On 11/22/13 10:47 AM, Reuti wrote:
Hi,
Am 22.11.2013 um 17:32 schrieb Gans, Jason D:
I would like to run an instance of my application on every *core*
of a small cluster. I am using Torque 2.5.12 to run jobs on the
cluster. The cluster in question is a heterogeneous collection of
machines that are all past their prime. Specifically, the number
of cores ranges from 2-8. Here is the Torque "nodes" file:
n0000 np=2
n0001 np=2
n0002 np=8
n0003 np=8
n0004 np=2
n0005 np=2
n0006 np=2
n0007 np=4
When I use openmpi-1.6.3, I can oversubscribe nodes but the tasks
are allocated to nodes without regard to the number of cores on
each node (specified by the "np=xx" in the nodes file). For
example, when I run "mpirun -np 24 hostname", mpirun places three
instances of "hostname" on each node, despite the fact that some
nodes only have two processors and some have more.
You submitted the job itself by requesting 24 cores for it too?
-- Reuti
Since there are only 8 Torque nodes in the cluster, I submitted the
job by requesting 8 nodes, i.e. "qsub -I -l nodes=8".
No, AFAICT it's necessary to request there 24 too. To investigate it
further it would also be good to copy the $PBS_NODEFILE in your job
script for later inspection to your home directory. I.e. whether you
are getting the correct values there already.
Not really - we take the number of slots on each node and add them
together.
Question: is that a copy/paste of the actual PBS_NODEFILE? It doesn't
look right to me - there is supposed to be one node entry for each
slot. In other words, it should have looked like this:
n0000
n0000
n0001
n0001
n0002
n0002
...
That is what I expected -- however, the $PBS_NODEFILE lists each node
just once.
-- Reuti
Is there a way to have OpenMPI "gracefully" oversubscribe nodes by
allocating instances based on the "np=xx" information in the
Torque nodes file? It this a Torque problem?
p.s. I do get the desired behavior when I run *without* Torque and
specify the following machine file to mpirun:
n0000 slots=2
n0001 slots=2
n0002 slots=8
n0003 slots=8
n0004 slots=2
n0005 slots=2
n0006 slots=2
n0007 slots=4
Regards,
Jason
_______________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users