Hi Ralph,

Am 16.11.2010 um 15:40 schrieb Ralph Castain:

> > 2. have SGE bind procs it launches to -all- of those cores. I believe SGE 
> > does this automatically to constrain the procs to running on only those 
> > cores.
> 
> This is another "bug/feature" in SGE: it's a matter of discussion, whether 
> the shepherd should get exactly one core (in case you use more than one 
> `qrsh`per node) for each call, or *all* cores assigned (which we need right 
> now, as the processes in Open MPI will be forks of orte daemon). About such a 
> situtation I filled an issue a long time ago and "limit_to_one_qrsh_per_host 
> yes/no" in the PE definition would do (this setting should then also change 
> the core allocation of the master process):
> 
> http://gridengine.sunsource.net/issues/show_bug.cgi?id=1254
> 
> I believe this is indeed the crux of the issue

fantastic to share the same view.


> > 3. tell OMPI to --bind-to-core.
> >
> > In other words, tell SGE to allocate a certain number of cores on each 
> > node, but to bind each proc to all of them (i.e., don't bind a proc to a 
> > specific core). I'm pretty sure that is a standard SGE option today (at 
> > least, I know it used to be). I don't believe any patch or devel work is 
> > required (to either SGE or OMPI).
> 
> When you use a fixed allocation_rule and a matching -binding request it will 
> work today. But any other case won't be distributed in the correct way.
> 
> Is it possible to not include the -binding request? If SGE is told to use a 
> fixed allocation_rule, and to allocate (for example) 2 cores/node, then won't 
> the orted see 
> itself bound to two specific cores on each node?

When you leave out the -binding, all jobs are allowed to run on any core.


> We would then be okay as the spawned children of orted would inherit its 
> binding. Just don't tell mpirun to bind the processes and the threads of 
> those MPI procs will be able to operate across the provided cores.
> 
> Or does SGE only allocate 2 cores/node in that case (i.e., allocate, but no 
> -binding given), but doesn't bind the orted to any two specific cores? If so, 
> then that would be a problem as the orted would think itself unconstrained. 
> If I understand the thread correctly, you're saying that this is what happens 
> today - true?

Exactly. It won't apply any binding at all and orted would think of being 
unlimited. I.e. limited only by the number of slots it should use thereon.

-- Reuti

Reply via email to