Re: [OMPI users] Performance degradation of OpenMPI 1.10.2 when oversubscribed?

Jeff Squyres (jsquyres) Sat, 25 Mar 2017 07:15:23 -0700

On Mar 25, 2017, at 3:04 AM, Ben Menadue <ben.mena...@nci.org.au> wrote:
> 
> I’m not sure about this. It was my understanding that HyperThreading is 
> implemented as a second set of e.g. registers that share execution units. 
> There’s no division of the resources between the hardware threads, but rather 
> the execution units switch between the two threads as they stall (e.g. cache 
> miss, hazard/dependency, misprediction, …) — kind of like a context switch, 
> but much cheaper. As long as there’s nothing being scheduled on the other 
> hardware thread, there’s no impact on the performance. Moreover, turning HT 
> off in the BIOS doesn’t make more resources available to now-single hardware 
> thread.


Here's an old post on this list where I cited a paper from the Intel Technology 
Journal.  The paper is pretty old at this point (2002, I believe?), but I 
believe it was published near the beginning of the HT technology at Intel:

    https://www.mail-archive.com/hwloc-users@lists.open-mpi.org/msg01135.html

The paper is attached on that post; see, in particular, the section 
"Single-task and multi-task modes".

All this being said, I'm a software wonk with a decent understanding of 
hardware.  But I don't closely follow all the specific details of all hardware. 
 So if Haswell / Broadwell / Skylake processors, for example, are substantially 
different than the HT architecture described in that paper, please feel free to 
correct me!

> This matches our observations on our cluster — there was no 
> statistically-significant change in performance between having HT turned off 
> in the BIOS and turning the second hardware thread of each core off in Linux. 
> We run a mix of architectures — Sandy, Ivy, Haswell, and Broadwell (all 
> dual-socket Xeon E5s), and KNL, and this appears to hold true across of these.

These are very complex architectures; the impacts of enabling/disabling HT are 
going to be highly specific to both the platform and application.

> Moreover, having the second hardware thread turned on in Linux but not used 
> by batch jobs (by cgroup-ing them to just one hardware thread of each core) 
> substantially reduced the performance impact and jitter from the OS — by ~10% 
> in at least one synchronisation-heavy application. This is likely because the 
> kernel began scheduling OS tasks (Lustre, IB, IPoIB, IRQs, Ganglia, PBS, …) 
> on the second, unused hardware thread of each core, which were then run when 
> the batch job’s processes stalled the CPU’s execution units. This is with 
> both a CentOS 6.x kernel and a custom (tickless) 7.2 kernel.

Yes, that's a pretty clever use of HT in an HPC environment.  But be aware that 
you are cutting on-core pipeline depths that can be used by applications to do 
this.  In your setup, it sounds like this is still a net performance win (which 
is pretty sweet).  But that may not be a universal effect.

This is probably a +3 on the existing trend from the prior emails in this 
thread: "As always, experiment to find the best for your hardware and jobs."  
;-)

-- 
Jeff Squyres
jsquy...@cisco.com

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Performance degradation of OpenMPI 1.10.2 when oversubscribed?

Reply via email to