Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

Terry Dontje Tue, 16 Nov 2010 11:37:18 -0500

On 11/16/2010 10:59 AM, Reuti wrote:

Am 16.11.2010 um 15:26 schrieb Terry Dontje:

<snip>
1. allocate a specified number of cores on each node to your job

this is currently the bug in the "slot<=>  core" relation in SGE, which has to 
be removed, updated or clarified. For now slot and core count are out of sync AFAICS.

Technically this isn't a bug but a gap in the allocation rule.  I think the 
solution is a new allocation rule.

Yes, you can phrase it this way. But what do you mean by "new allocation rule"?

The proposal of have a slot allocation rule that forces the number ofcores allocated on each node to equal the number of slots.

The slot allocation should follow the specified cores?

The other way around I think.

2. have SGE bind procs it launches to -all- of those cores. I believe SGE does 
this automatically to constrain the procs to running on only those cores.

This is another "bug/feature" in SGE: it's a matter of discussion, whether the shepherd 
should get exactly one core (in case you use more than one `qrsh`per node) for each call, or *all* 
cores assigned (which we need right now, as the processes in Open MPI will be forks of orte 
daemon). About such a situtation I filled an issue a long time ago and 
"limit_to_one_qrsh_per_host yes/no" in the PE definition would do (this setting should 
then also change the core allocation of the master process):


http://gridengine.sunsource.net/issues/show_bug.cgi?id=1254

Isn't it almost required to have the shepherd bind to all the cores so that the 
orted inherits that binding?

Yes, for orted. But if you want to have any other (legacy) application which 
using N times `qrsh` to an exechost when you got N slots thereon, then only one 
core should be bound to each of the started shepherds.

Blech. Not sure of the solution for that but I see what you are sayingnow :-).

3. tell OMPI to --bind-to-core.

In other words, tell SGE to allocate a certain number of cores on each node, 
but to bind each proc to all of them (i.e., don't bind a proc to a specific 
core). I'm pretty sure that is a standard SGE option today (at least, I know it 
used to be). I don't believe any patch or devel work is required (to either SGE 
or OMPI).

When you use a fixed allocation_rule and a matching -binding request it will 
work today. But any other case won't be distributed in the correct way.

Ok, so what is the "correct" way and we sure it isn't distributed correctly?

You posted the two cases yesterday. Do we agree that both cases aren't correct, or do you think it's a 
correct allocation for both cases? Even if it could be "repaired" in Open MPI, it would be 
better to fix the generated 'pe' PE hostfile and 'set' allocation, i.e. the "slot<=>  
cores" relation.

So I am not a GE type of guy but from what I've been led to believe whathappened is correct (in some form of correct). That is in case one weasked for a core allocation of 1 core per node and a core allocation of2 cores in the other case. That is what we were given. The fact thatwe distributed the slots in a non-uniform manner I am not sure is GE'sfault. Note I can understand where it may seem non-intuitive and notnice for people wanting to do things like this.

In the original case of 7 nodes and processes if we do -binding pe linear:2, 
and add the -bind-to-core to mpirun  I'd actually expect 6 of the nodes 
processes bind to one core and the 7th node with 2 processes to have each of 
those processes bound to different cores on the same machine.

Yes, possibly it could be repaired this way (for now I have no free machines to play with). But 
then the "reserved" cores by the "-binding pe linear:2" are lost for other 
processes on these 6 nodes, and the slot count gets out of sync with slots.

Right, if you want to rightsize the amount of cores allocated to slotsallocated on each node then we are stuck unless a new allocation rule ismade.

Can we get a full output of such a run with -report-bindings turned on.  I 
think we should find out that things actually are happening correctly except 
for the fact that the 6 of the nodes have 2 cores allocated but only one is 
being bound to by a process.

You mean, to accept the current behavior as being the intended one, as finally 
for having only one job running on these machines we get what we asked for - 
despite the fact that cores are lost for other processes?

Yes, that is what I mean. I first would like to prove at least tomyself things are working the way we think they are. I believe thediscussion of recovering the lost cores is the next step. Either weredefine what -binding linear:X means in light of slots, we make a newallocation rule -binding slots:X or live with the lost cores. Note, the"we" here is loosely used. I am by no means the keeper of GE and justinjected myself in this discussion because, like Ralph, I have dealtwith binding and I work for Oracle which develops GE. Just to be clearI do not work in the Grid Engine group but have talked with them aboutthis thread which has, for good or bad, formed my opinion above.


--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com <mailto:terry.don...@oracle.com>

Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts

Reply via email to