"SLIM H.A." <h.a.s...@durham.ac.uk> writes: > We switched on hyper threading on our cluster with two eight core > sockets per node (32 threads per node).
Assuming that's Xeon-ish hyperthreading, the best advice is not to. It will typically hurt performance of HPC applications, not least if it defeats core binding, and it is likely to confusion with resource managers. If there are specific applications which benefit from it, under Linux you can switch it on on the relevant cores for the duration of jobs which ask for it. > We configured gridengine with 16 slots per node to allow the 16 extra > threads for kernel process use Have you actually measured that? We did, and we switch off HT at boot time. We've never had cause to turn it on, though there might be a few jobs which could use it. > but this apparently does not work. Printout of the gridengine hostfile > shows that for a 32 slots job, 16 slots are placed on each of two > nodes as expected. Including the openmpi --display-map option shows > that all 32 processes are incorrectly placed on the head node. Here is > part of the output If OMPI is scheduling by thread, then that's what you'd expect. (As far as I know, SGE will DTRT, binding a cores per slot in that case, but I'll look at bug reports if not.) > I found some related mailings about a new warning in 1.8.2 about > oversubscription and I tried a few options to avoid the use of the extra > threads for MPI tasks by openmpi without success, e.g. variants of > > --cpus-per-proc 1 > --bind-to-core > > and some others. Gridengine treats hw threads as cores==slots (?) What a slot is is up to you, but if you want to do core binding at all sensibly, it needs to correspond to a core. You can fiddle things in the job itself (see the recent thread that Mark started for OMPI --np != SGE NSLOTS). > but the content of $PE_HOSTFILE suggests it distributes the slots > sensibly so it seems there is an option for openmpi required to get > 16 cores per node? I'm not sure precisely what you want, but with OMPI 1.8, you should be able to lay out the job by core if that's what you want. That may requires exclusive node access, which makes SGE core binding a null operation. > I tried both 1.8.2, 1.8.3 and also 1.6.5. > > Thanks for some clarification that anyone can give. The above is for the current SGE with a recent hwloc. If Durham are still using an ancient version, it may not apply, but that should be irrelevant with -l exclusive or a fixed-count PE.