Hi, Am 16.11.2010 um 14:07 schrieb Ralph Castain:
> Perhaps I'm missing it, but it seems to me that the real problem lies in the > interaction between SGE and OMPI during OMPI's two-phase launch. The verbose > output shows that SGE dutifully allocated the requested number of cores on > each node. However, OMPI launches only one process on each node (the ORTE > daemon), which SGE "binds" to a single core since that is what it was told to > do. > > Since SGE never sees the local MPI procs spawned by ORTE, it can't assign > bindings to them. The ORTE daemon senses its local binding (i.e., to a single > core in the allocation), and subsequently binds all its local procs to that > core. > > I believe all you need to do is tell SGE to: > > 1. allocate a specified number of cores on each node to your job this is currently the bug in the "slot <=> core" relation in SGE, which has to be removed, updated or clarified. For now slot and core count are out of sync AFAICS. > 2. have SGE bind procs it launches to -all- of those cores. I believe SGE > does this automatically to constrain the procs to running on only those cores. This is another "bug/feature" in SGE: it's a matter of discussion, whether the shepherd should get exactly one core (in case you use more than one `qrsh`per node) for each call, or *all* cores assigned (which we need right now, as the processes in Open MPI will be forks of orte daemon). About such a situtation I filled an issue a long time ago and "limit_to_one_qrsh_per_host yes/no" in the PE definition would do (this setting should then also change the core allocation of the master process): http://gridengine.sunsource.net/issues/show_bug.cgi?id=1254 > 3. tell OMPI to --bind-to-core. > > In other words, tell SGE to allocate a certain number of cores on each node, > but to bind each proc to all of them (i.e., don't bind a proc to a specific > core). I'm pretty sure that is a standard SGE option today (at least, I know > it used to be). I don't believe any patch or devel work is required (to > either SGE or OMPI). When you use a fixed allocation_rule and a matching -binding request it will work today. But any other case won't be distributed in the correct way. -- Reuti > > > On Tue, Nov 16, 2010 at 4:07 AM, Reuti <re...@staff.uni-marburg.de> wrote: > Am 16.11.2010 um 10:26 schrieb Chris Jewell: > > > Hi all, > > > >> On 11/15/2010 02:11 PM, Reuti wrote: > >>> Just to give my understanding of the problem: > >>>> > >>>>>> Sorry, I am still trying to grok all your email as what the problem you > >>>>>> are trying to solve. So is the issue is trying to have two jobs having > >>>>>> processes on the same node be able to bind there processes on different > >>>>>> resources. Like core 1 for the first job and core 2 and 3 for the 2nd > >>>>>> job? > >>>>>> > >>>>>> --td > >> You can't get 2 slots on a machine, as it's limited by the core count to > >> one here, so such a slot allocation shouldn't occur at all. > > > > So to clarify, the current -binding <binding_strategy>:<binding_amount> > > allocates binding_amount cores to each sge_shepherd process associated with > > a job_id. There appears to be only one sge_shepherd process per job_id per > > execution node, so all child processes run on these allocated cores. This > > is irrespective of the number of slots allocated to the node. > > > > I agree with Reuti that the binding_amount parameter should be a maximum > > number of bound cores per node, with the actual number determined by the > > number of slots allocated per node. FWIW, an alternative approach might be > > to have another binding_type ('slot', say) that automatically allocated one > > core per slot. > > > > Of course, a complex situation might arise if a user submits a combined > > MPI/multithreaded job, but then I guess we're into the realm of setting > > allocation_rule. > > IIRC there was a discussion on the [GE users] list about it, to get an > uniform distribution on all slave nodes for such jobs, as also e.g. > $OMP_NUM_THREADS will be set to the same value for all slave nodes for hybrid > jobs. Otherwise it would be necessary to adjust SGE to set this value in the > "-builtin-" startup method automatically on all nodes to the local granted > slots value. For now a fixed allocation rule of 1,2,4 or whatever must be > used and you have to submit by reqeusting a wildcard PE to get any of these > defined PEs for an even distribution and you don't care whether it's two > times two slots, one time four slots, or four times one slot. > > In my understanding, any type of parallel job should always request and get > the total number of slots equal to the cores it needs to execute. Independent > whether these are threads, forks or any hybrid type of jobs. Otherwise any > resource planing and reservation will most likely fail. Nevertheless, there > might exist rare cases where you submit an exclusive serial job but create > threads/forks in the end. But such a setup should be an exception, not the > default. > > > > Is it going to be worth looking at creating a patch for this? > > Absolute. > > > > I don't know much of the internals of SGE -- would it be hard work to do? > > I've not that much time to dedicate towards it, but I could put some effort > > in if necessary... > > I don't know about the exact coding for it, but when it's for now a plain > "copy" of the binding list, then it should become a loop to create a list of > cores from the original specification until all granted slots got a core > allocated. > > -- Reuti > > > > > > Chris > > > > > > -- > > Dr Chris Jewell > > Department of Statistics > > University of Warwick > > Coventry > > CV4 7AL > > UK > > Tel: +44 (0)24 7615 0778 > > > > > > > > > > > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users