On Nov 22, 2013, at 10:26 AM, Jason Gans <jg...@lanl.gov> wrote:

> On 11/22/13 11:15 AM, Ralph Castain wrote:
>> 
>> On Nov 22, 2013, at 10:03 AM, Reuti <re...@staff.uni-marburg.de> wrote:
>> 
>>> Am 22.11.2013 um 18:56 schrieb Jason Gans:
>>> 
>>>> On 11/22/13 10:47 AM, Reuti wrote:
>>>>> Hi,
>>>>> 
>>>>> Am 22.11.2013 um 17:32 schrieb Gans, Jason D:
>>>>> 
>>>>>> I would like to run an instance of my application on every *core* of a 
>>>>>> small cluster. I am using Torque 2.5.12 to run jobs on the cluster. The 
>>>>>> cluster in question is a heterogeneous collection of machines that are 
>>>>>> all past their prime. Specifically, the number of cores ranges from 2-8. 
>>>>>> Here is the Torque "nodes" file:
>>>>>> 
>>>>>> n0000 np=2
>>>>>> n0001 np=2
>>>>>> n0002 np=8
>>>>>> n0003 np=8
>>>>>> n0004 np=2
>>>>>> n0005 np=2
>>>>>> n0006 np=2
>>>>>> n0007 np=4
>>>>>> 
>>>>>> When I use openmpi-1.6.3, I can oversubscribe nodes but the tasks are 
>>>>>> allocated to nodes without regard to the number of cores on each node 
>>>>>> (specified by the "np=xx" in the nodes file). For example, when I run 
>>>>>> "mpirun -np 24 hostname", mpirun places three instances of "hostname" on 
>>>>>> each node, despite the fact that some nodes only have two processors and 
>>>>>> some have more.
>>>>> You submitted the job itself by requesting 24 cores for it too?
>>>>> 
>>>>> -- Reuti
>>>> Since there are only 8 Torque nodes in the cluster, I submitted the job by 
>>>> requesting 8 nodes, i.e. "qsub -I -l nodes=8".
>>> 
>>> No, AFAICT it's necessary to request there 24 too. To investigate it 
>>> further it would also be good to copy the $PBS_NODEFILE in your job script 
>>> for later inspection to your home directory. I.e. whether you are getting 
>>> the correct values there already.
>> 
>> Not really - we take the number of slots on each node and add them together.
>> 
>> Question: is that a copy/paste of the actual PBS_NODEFILE? It doesn't look 
>> right to me - there is supposed to be one node entry for each slot. In other 
>> words, it should have looked like this:
>> 
>>> n0000
>>> n0000
>>> n0001
>>> n0001
>>> n0002
>>> n0002
>> ...
>> 
> That is what I expected -- however, the $PBS_NODEFILE lists each node just 
> once.

Been a while since I used Torque, but I suspect Reuti is right - you have to 
ask for 24 slots. Sounds like Torque is only assigning you one slot/node.


>> 
>> 
>>> 
>>> -- Reuti
>>> 
>>>>> 
>>>>> 
>>>>>> Is there a way to have OpenMPI "gracefully" oversubscribe nodes by 
>>>>>> allocating instances based on the "np=xx" information in the Torque 
>>>>>> nodes file? It this a Torque problem?
>>>>>> 
>>>>>> p.s. I do get the desired behavior when I run *without* Torque and 
>>>>>> specify the following machine file to mpirun:
>>>>>> 
>>>>>> n0000 slots=2
>>>>>> n0001 slots=2
>>>>>> n0002 slots=8
>>>>>> n0003 slots=8
>>>>>> n0004 slots=2
>>>>>> n0005 slots=2
>>>>>> n0006 slots=2
>>>>>> n0007 slots=4
>>>>>> 
>>>>>> Regards,
>>>>>> 
>>>>>> Jason
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to