"SLIM H.A." <h.a.s...@durham.ac.uk> writes:

> We switched on hyper threading on our cluster with two eight core
> sockets per node (32 threads per node).

Assuming that's Xeon-ish hyperthreading, the best advice is not to.  It
will typically hurt performance of HPC applications, not least if it
defeats core binding, and it is likely to confusion with resource
managers.  If there are specific applications which benefit from it,
under Linux you can switch it on on the relevant cores for the duration
of jobs which ask for it.

> We configured  gridengine with 16 slots per node to allow the 16 extra
> threads for kernel process use

Have you actually measured that?  We did, and we switch off HT at boot
time.  We've never had cause to turn it on, though there might be a few
jobs which could use it.

> but this apparently does not work. Printout of the gridengine hostfile
> shows that for a 32 slots job, 16 slots are placed on each of two
> nodes as expected. Including the openmpi --display-map option shows
> that all 32 processes are incorrectly placed on the head node. Here is
> part of the output

If OMPI is scheduling by thread, then that's what you'd expect.  (As far
as I know, SGE will DTRT, binding a cores per slot in that case, but
I'll look at bug reports if not.)

> I found some related mailings about a new warning in 1.8.2 about 
> oversubscription and  I tried a few options to avoid the use of the extra 
> threads for MPI tasks by openmpi without success, e.g. variants of
>
> --cpus-per-proc 1 
> --bind-to-core 
>
> and some others. Gridengine treats hw threads as cores==slots (?)

What a slot is is up to you, but if you want to do core binding at all
sensibly, it needs to correspond to a core.  You can fiddle things in
the job itself (see the recent thread that Mark started for OMPI --np !=
SGE NSLOTS).

> but the content of $PE_HOSTFILE suggests it distributes the slots
> sensibly  so it seems there is an option for openmpi required to get
> 16 cores per node?

I'm not sure precisely what you want, but with OMPI 1.8, you should be
able to lay out the job by core if that's what you want.  That may
requires exclusive node access, which makes SGE core binding a null
operation.

> I tried both 1.8.2, 1.8.3 and also 1.6.5.
>
> Thanks for some clarification that anyone can give.

The above is for the current SGE with a recent hwloc.  If Durham are
still using an ancient version, it may not apply, but that should be
irrelevant with -l exclusive or a fixed-count PE.

Reply via email to