Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

Reuti Tue, 16 Nov 2010 12:36:31 -0500

Am 16.11.2010 um 17:35 schrieb Terry Dontje:

> On 11/16/2010 10:59 AM, Reuti wrote:
>> Am 16.11.2010 um 15:26 schrieb Terry Dontje:
>> 
>> 
>>>>> <snip>
>>>>> 1. allocate a specified number of cores on each node to your job
>>>>> 
>>>>> 
>>>> this is currently the bug in the "slot <=> core" relation in SGE, which 
>>>> has to be removed, updated or clarified. For now slot and core count are 
>>>> out of sync AFAICS.
>>>> 
>>>> 
>>> Technically this isn't a bug but a gap in the allocation rule.  I think the 
>>> solution is a new allocation rule.
>>> 
>> Yes, you can phrase it this way. But what do you mean by "new allocation 
>> rule"? 
> The proposal of have a slot allocation rule that forces the number of cores 
> allocated on each node to equal the number of slots.


Yep. But then you would end up with $round_robin_cores, $fill_up_cores, 
$pe_slots_cores and with a fixed value "4 cores". Maybe an additional flag 
would be more suitable.


>> The slot allocation should follow the specified cores? 
>> 
> The other way around I think.

Yep, agreed.


>>>>> 2. have SGE bind procs it launches to -all- of those cores. I believe SGE 
>>>>> does this automatically to constrain the procs to running on only those 
>>>>> cores.
>>>>> 
>>>>> 
>>>> This is another "bug/feature" in SGE: it's a matter of discussion, whether 
>>>> the shepherd should get exactly one core (in case you use more than one 
>>>> `qrsh`per node) for each call, or *all* cores assigned (which we need 
>>>> right now, as the processes in Open MPI will be forks of orte daemon). 
>>>> About such a situtation I filled an issue a long time ago and 
>>>> "limit_to_one_qrsh_per_host yes/no" in the PE definition would do (this 
>>>> setting should then also change the core allocation of the master process):
>>>> 
>>>> 
>>>> 
>>>> http://gridengine.sunsource.net/issues/show_bug.cgi?id=1254
>>> Isn't it almost required to have the shepherd bind to all the cores so that 
>>> the orted inherits that binding?
>>> 
>> Yes, for orted. But if you want to have any other (legacy) application which 
>> using N times `qrsh` to an exechost when you got N slots thereon, then only 
>> one core should be bound to each of the started shepherds.
>> 
>> 
> Blech.  Not sure of the solution for that but I see what you are saying now 
> :-).

:-)


>>>>> 3. tell OMPI to --bind-to-core.
>>>>> 
>>>>> In other words, tell SGE to allocate a certain number of cores on each 
>>>>> node, but to bind each proc to all of them (i.e., don't bind a proc to a 
>>>>> specific core). I'm pretty sure that is a standard SGE option today (at 
>>>>> least, I know it used to be). I don't believe any patch or devel work is 
>>>>> required (to either SGE or OMPI).
>>>>> 
>>>>> 
>>>> When you use a fixed allocation_rule and a matching -binding request it 
>>>> will work today. But any other case won't be distributed in the correct 
>>>> way.
>>>> 
>>>> 
>>> Ok, so what is the "correct" way and we sure it isn't distributed correctly?
>>> 
>> You posted the two cases yesterday. Do we agree that both cases aren't 
>> correct, or do you think it's a correct allocation for both cases? Even if 
>> it could be "repaired" in Open MPI, it would be better to fix the generated 
>> 'pe' PE hostfile and 'set' allocation, i.e. the "slot <=> cores" relation.
>> 
>> 
>> 
> So I am not a GE type of guy but from what I've been led to believe what 
> happened is correct (in some form of correct).  That is in case one we asked 
> for a core allocation of 1 core per node and a core allocation of 2 cores in 
> the other case.  That is what we were given.  The fact that we distributed 
> the slots in a non-uniform manner I am not sure is GE's fault.  Note I can 
> understand where it may seem non-intuitive and not nice for people wanting to 
> do things like this.
>>> In the original case of 7 nodes and processes if we do -binding pe 
>>> linear:2, and add the -bind-to-core to mpirun  I'd actually expect 6 of the 
>>> nodes processes bind to one core and the 7th node with 2 processes to have 
>>> each of those processes bound to different cores on the same machine.
>>> 
>> Yes, possibly it could be repaired this way (for now I have no free machines 
>> to play with). But then the "reserved" cores by the "-binding pe linear:2" 
>> are lost for other processes on these 6 nodes, and the slot count gets out 
>> of sync with slots.
>> 
> Right, if you want to rightsize the amount of cores allocated to slots 
> allocated on each node then we are stuck unless a new allocation rule is 
> made.  

Great.


>>> Can we get a full output of such a run with -report-bindings turned on.  I 
>>> think we should find out that things actually are happening correctly 
>>> except for the fact that the 6 of the nodes have 2 cores allocated but only 
>>> one is being bound to by a process.
>>> 
>> You mean, to accept the current behavior as being the intended one, as 
>> finally for having only one job running on these machines we get what we 
>> asked for - despite the fact that cores are lost for other processes?
>> 
>> 
> Yes, that is what I mean.  I first would like to prove at least to myself 
> things are working the way we think they are.  I believe the discussion of 
> recovering the lost cores is the next step.  Either we redefine what -binding 
> linear:X means in light of slots, we make a new allocation rule -binding 
> slots:X or live with the lost cores.      Note, the "we" here is loosely 
> used.  I am by no means the keeper of GE and just injected myself in this 
> discussion because, like Ralph, I have dealt with binding and I work for 
> Oracle which develops GE.  Just to be clear I do not work in the Grid Engine 
> group but have talked with them about this thread which has, for good or bad, 
> formed my opinion above.

Okay. But the discussion is archived here and we got some statements, which 
will be of use also for other developers.

-- Reuti

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

Reply via email to