Just to be clear: I take it that the first entry is the host name, and the 
second is the number of slots allocated on that host?

FWIW: I see the problem. Our parser was apparently written assuming every line 
was a unique host, so it doesn't even check to see if there is duplication. 
Easy fix - can shoot it to you today.

On Mar 15, 2012, at 6:53 AM, Reuti wrote:

> Am 15.03.2012 um 05:22 schrieb Joshua Baker-LePain:
> 
>> On Wed, 14 Mar 2012 at 5:50pm, Ralph Castain wrote
>> 
>>> On Mar 14, 2012, at 5:44 PM, Reuti wrote:
>> 
>>>> (I was just typing when Ralph's message came in: I can confirm this. To 
>>>> avoid it, it would mean for Open MPI to collect all lines from the 
>>>> hostfile which are on the same machine. SGE creates entries for each 
>>>> queue/host pair in the machine file).
>>> 
>>> Hmmm…I can take a look at the allocator module and see why we aren't doing 
>>> it. Would the host names be the same for the two queues?
>> 
>> I can't speak authoritatively like Reuti can, but here's what a hostfile
>> looks like on my cluster (note that all our name resolution is done via 
>> /etc/hosts -- there's no DNS involved):
>> 
>> iq103 8 lab.q@iq103 <NULL>
>> iq103 1 test.q@iq103 <NULL>
>> iq104 8 lab.q@iq104 <NULL>
>> iq104 1 test.q@iq104 <NULL>
>> opt221 2 lab.q@opt221 <NULL>
>> opt221 1 test.q@opt221 <NULL>
> 
> Yes, exactly this needs to be parsed and adding up all entries therein for 
> one and the same machine.
> 
> If you need it instantly, it could be put in a wrapper for start_proc_args of 
> the PE (and Open MPI compiled without SGE support), so that a custom build 
> machinefile can be used. In this case the rsh resp. ssh call also needs to be 
> caught.
> 
> Often the opposite is desired in an SGE setup: tune it so that all slots are 
> coming from one queue only.
> 
> But I still wonder whether it is possible to tune your setup in a similar 
> way: allow one slot more in the high priority queue (long,.q) in case it's a 
> parallel job, with an RQS (assuming 8 cores with one core oversubscription):
> 
> limit queues long.q pes * to slots=9
> limit queues long.q to slots=8
> 
> while you have an additonal short.q (the low priority queue) there with one 
> slot. The overall limit is still set on an exechost level to 9. The PE is 
> then only attached to long.q.
> 
> -- Reuti
> 
> PS: In your example you also had the case 2 slots in the low priority queue, 
> what is the actual setup in your cluster?
> 
> 
>>>> @Ralph: it could work if SGE would have a facility to request the desired 
>>>> queue in `qrsh -inherit ...`, because then the $TMPDIR would be unique for 
>>>> each orted again (assuming its using different ports for each).
>>> 
>>> Gotcha! I suspect getting the allocator to handle this cleanly is the 
>>> better solution, though.
>> 
>> If I can help (testing patches, e.g.), let me know.
>> 
>> -- 
>> Joshua Baker-LePain
>> QB3 Shared Cluster Sysadmin
>> UCSF_______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to