Just to be clear: I take it that the first entry is the host name, and the second is the number of slots allocated on that host?
FWIW: I see the problem. Our parser was apparently written assuming every line was a unique host, so it doesn't even check to see if there is duplication. Easy fix - can shoot it to you today. On Mar 15, 2012, at 6:53 AM, Reuti wrote: > Am 15.03.2012 um 05:22 schrieb Joshua Baker-LePain: > >> On Wed, 14 Mar 2012 at 5:50pm, Ralph Castain wrote >> >>> On Mar 14, 2012, at 5:44 PM, Reuti wrote: >> >>>> (I was just typing when Ralph's message came in: I can confirm this. To >>>> avoid it, it would mean for Open MPI to collect all lines from the >>>> hostfile which are on the same machine. SGE creates entries for each >>>> queue/host pair in the machine file). >>> >>> Hmmm…I can take a look at the allocator module and see why we aren't doing >>> it. Would the host names be the same for the two queues? >> >> I can't speak authoritatively like Reuti can, but here's what a hostfile >> looks like on my cluster (note that all our name resolution is done via >> /etc/hosts -- there's no DNS involved): >> >> iq103 8 lab.q@iq103 <NULL> >> iq103 1 test.q@iq103 <NULL> >> iq104 8 lab.q@iq104 <NULL> >> iq104 1 test.q@iq104 <NULL> >> opt221 2 lab.q@opt221 <NULL> >> opt221 1 test.q@opt221 <NULL> > > Yes, exactly this needs to be parsed and adding up all entries therein for > one and the same machine. > > If you need it instantly, it could be put in a wrapper for start_proc_args of > the PE (and Open MPI compiled without SGE support), so that a custom build > machinefile can be used. In this case the rsh resp. ssh call also needs to be > caught. > > Often the opposite is desired in an SGE setup: tune it so that all slots are > coming from one queue only. > > But I still wonder whether it is possible to tune your setup in a similar > way: allow one slot more in the high priority queue (long,.q) in case it's a > parallel job, with an RQS (assuming 8 cores with one core oversubscription): > > limit queues long.q pes * to slots=9 > limit queues long.q to slots=8 > > while you have an additonal short.q (the low priority queue) there with one > slot. The overall limit is still set on an exechost level to 9. The PE is > then only attached to long.q. > > -- Reuti > > PS: In your example you also had the case 2 slots in the low priority queue, > what is the actual setup in your cluster? > > >>>> @Ralph: it could work if SGE would have a facility to request the desired >>>> queue in `qrsh -inherit ...`, because then the $TMPDIR would be unique for >>>> each orted again (assuming its using different ports for each). >>> >>> Gotcha! I suspect getting the allocator to handle this cleanly is the >>> better solution, though. >> >> If I can help (testing patches, e.g.), let me know. >> >> -- >> Joshua Baker-LePain >> QB3 Shared Cluster Sysadmin >> UCSF_______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users