Am 22.11.2013 um 19:34 schrieb Jason Gans:

> On 11/22/13 11:18 AM, Lloyd Brown wrote:
>> As far as I understand, the mpirun will assign processes to hosts in the
>> hostlist ($PBS_NODEFILE) sequentially, and if it runs out of hosts in
>> the list, it starts over at the top of the file.
>> 
>> Theoretically, you should be able to request specific hostnames, and the
>> processor counts per hostname, in your torque submit request.  I'm not
>> sure if this is correct (we don't use Torque here anymore, and I'm going
>> off memory), but it should be approximately correct:
>> 
>>> qsub -l 
>>> nodes=n0000:2+n0001:2+n0002:8+n0003:8+n0004:2+n0005:2+n0006:2+n0007:4 ...
> Thanks! This is awkward, but it did the trick. To get the desired behavior I 
> first
> had to provide a "fake" nodes file to Torque (where all of the nodes were 
> listed
> as having a large number of processors, i.e. np=8). Now I can submit jobs 
> using:
> 
> qsub -I -l nodes=n0000:ppn=2+n0001:ppn=2+n0002:ppn=8+...

This shouldn't be necessary when Torque knows the number of cores in each 
machine and you request the suggested 24 ones.

-- Reuti


> 
> and get the expected behavior (including the expected $PBS_NODFILE, where the
> name of each node appears "ppn" number of times).
> 
> Thanks to everyone who responded!
> 
> Regards,
> 
> Jason
>> Granted, that's awkward, but I'm not sure if there's another way in
>> Torque to request different numbers of processors per node.  You might
>> ask on the Torque Users list.  They might tell you to change the nodes
>> file to reflect the number of actual processes you want on each node,
>> rather than the number of physical processors on the hosts.  Whether
>> this works for you, depends on whether you want this type of
>> oversubscription to happen all the time, or on a per-job basis, etc.
>> 
>> 
>> Lloyd Brown
>> Systems Administrator
>> Fulton Supercomputing Lab
>> Brigham Young University
>> http://marylou.byu.edu
>> 
>> On 11/22/2013 11:11 AM, Gans, Jason D wrote:
>>> I have tried the 1.7 series (specifically 1.7.3) and I get the same
>>> behavior.
>>> 
>>> When I run "mpirun -oversubscribe -np 24 hostname", three instances of
>>> "hostname" are run on each node.
>>> 
>>> The contents of the $PBS_NODEFILE are:
>>> n0007
>>> n0006
>>> n0005
>>> n0004
>>> n0003
>>> n0002
>>> n0001
>>> n0000
>>> 
>>> but, since I have compiled OpenMPI using the "--with-tm",  it appears
>>> that OpenMPI is not using the $PBS_NODEFILE (which I tested by modifying
>>> the torque pbs_mom to write a $PBS_NODEFILE that contained "slot=xx"
>>> information for each node. mpirun complained when I did this).
>>> 
>>> Regards,
>>> 
>>> Jason
>>> 
>>> ------------------------------------------------------------------------
>>> *From:* users [users-boun...@open-mpi.org] on behalf of Ralph Castain
>>> [r...@open-mpi.org]
>>> *Sent:* Friday, November 22, 2013 11:04 AM
>>> *To:* Open MPI Users
>>> *Subject:* Re: [OMPI users] Oversubscription of nodes with Torque and
>>> OpenMPI
>>> 
>>> Really shouldn't matter - this is clearly a bug in OMPI if it is doing
>>> mapping as you describe. Out of curiosity, have you tried the 1.7
>>> series? Does it behave the same?
>>> 
>>> I can take a look at the code later today and try to figure out what
>>> happened.
>>> 
>>> On Nov 22, 2013, at 9:56 AM, Jason Gans <jg...@lanl.gov
>>> <mailto:jg...@lanl.gov>> wrote:
>>> 
>>>> On 11/22/13 10:47 AM, Reuti wrote:
>>>>> Hi,
>>>>> 
>>>>> Am 22.11.2013 um 17:32 schrieb Gans, Jason D:
>>>>> 
>>>>>> I would like to run an instance of my application on every *core* of
>>>>>> a small cluster. I am using Torque 2.5.12 to run jobs on the
>>>>>> cluster. The cluster in question is a heterogeneous collection of
>>>>>> machines that are all past their prime. Specifically, the number of
>>>>>> cores ranges from 2-8. Here is the Torque "nodes" file:
>>>>>> 
>>>>>> n0000 np=2
>>>>>> n0001 np=2
>>>>>> n0002 np=8
>>>>>> n0003 np=8
>>>>>> n0004 np=2
>>>>>> n0005 np=2
>>>>>> n0006 np=2
>>>>>> n0007 np=4
>>>>>> 
>>>>>> When I use openmpi-1.6.3, I can oversubscribe nodes but the tasks
>>>>>> are allocated to nodes without regard to the number of cores on each
>>>>>> node (specified by the "np=xx" in the nodes file). For example, when
>>>>>> I run "mpirun -np 24 hostname", mpirun places three instances of
>>>>>> "hostname" on each node, despite the fact that some nodes only have
>>>>>> two processors and some have more.
>>>>> You submitted the job itself by requesting 24 cores for it too?
>>>>> 
>>>>> -- Reuti
>>>> Since there are only 8 Torque nodes in the cluster, I submitted the
>>>> job by requesting 8 nodes, i.e. "qsub -I -l nodes=8".
>>>>> 
>>>>>> Is there a way to have OpenMPI "gracefully" oversubscribe nodes by
>>>>>> allocating instances based on the "np=xx" information in the Torque
>>>>>> nodes file? It this a Torque problem?
>>>>>> 
>>>>>> p.s. I do get the desired behavior when I run *without* Torque and
>>>>>> specify the following machine file to mpirun:
>>>>>> 
>>>>>> n0000 slots=2
>>>>>> n0001 slots=2
>>>>>> n0002 slots=8
>>>>>> n0003 slots=8
>>>>>> n0004 slots=2
>>>>>> n0005 slots=2
>>>>>> n0006 slots=2
>>>>>> n0007 slots=4
>>>>>> 
>>>>>> Regards,
>>>>>> 
>>>>>> Jason
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to