Hi Ralph, Am 16.11.2010 um 15:40 schrieb Ralph Castain:
> > 2. have SGE bind procs it launches to -all- of those cores. I believe SGE > > does this automatically to constrain the procs to running on only those > > cores. > > This is another "bug/feature" in SGE: it's a matter of discussion, whether > the shepherd should get exactly one core (in case you use more than one > `qrsh`per node) for each call, or *all* cores assigned (which we need right > now, as the processes in Open MPI will be forks of orte daemon). About such a > situtation I filled an issue a long time ago and "limit_to_one_qrsh_per_host > yes/no" in the PE definition would do (this setting should then also change > the core allocation of the master process): > > http://gridengine.sunsource.net/issues/show_bug.cgi?id=1254 > > I believe this is indeed the crux of the issue fantastic to share the same view. > > 3. tell OMPI to --bind-to-core. > > > > In other words, tell SGE to allocate a certain number of cores on each > > node, but to bind each proc to all of them (i.e., don't bind a proc to a > > specific core). I'm pretty sure that is a standard SGE option today (at > > least, I know it used to be). I don't believe any patch or devel work is > > required (to either SGE or OMPI). > > When you use a fixed allocation_rule and a matching -binding request it will > work today. But any other case won't be distributed in the correct way. > > Is it possible to not include the -binding request? If SGE is told to use a > fixed allocation_rule, and to allocate (for example) 2 cores/node, then won't > the orted see > itself bound to two specific cores on each node? When you leave out the -binding, all jobs are allowed to run on any core. > We would then be okay as the spawned children of orted would inherit its > binding. Just don't tell mpirun to bind the processes and the threads of > those MPI procs will be able to operate across the provided cores. > > Or does SGE only allocate 2 cores/node in that case (i.e., allocate, but no > -binding given), but doesn't bind the orted to any two specific cores? If so, > then that would be a problem as the orted would think itself unconstrained. > If I understand the thread correctly, you're saying that this is what happens > today - true? Exactly. It won't apply any binding at all and orted would think of being unlimited. I.e. limited only by the number of slots it should use thereon. -- Reuti