On Mar 25, 2017, at 3:04 AM, Ben Menadue <ben.mena...@nci.org.au> wrote: > > I’m not sure about this. It was my understanding that HyperThreading is > implemented as a second set of e.g. registers that share execution units. > There’s no division of the resources between the hardware threads, but rather > the execution units switch between the two threads as they stall (e.g. cache > miss, hazard/dependency, misprediction, …) — kind of like a context switch, > but much cheaper. As long as there’s nothing being scheduled on the other > hardware thread, there’s no impact on the performance. Moreover, turning HT > off in the BIOS doesn’t make more resources available to now-single hardware > thread.
Here's an old post on this list where I cited a paper from the Intel Technology Journal. The paper is pretty old at this point (2002, I believe?), but I believe it was published near the beginning of the HT technology at Intel: https://www.mail-archive.com/hwloc-users@lists.open-mpi.org/msg01135.html The paper is attached on that post; see, in particular, the section "Single-task and multi-task modes". All this being said, I'm a software wonk with a decent understanding of hardware. But I don't closely follow all the specific details of all hardware. So if Haswell / Broadwell / Skylake processors, for example, are substantially different than the HT architecture described in that paper, please feel free to correct me! > This matches our observations on our cluster — there was no > statistically-significant change in performance between having HT turned off > in the BIOS and turning the second hardware thread of each core off in Linux. > We run a mix of architectures — Sandy, Ivy, Haswell, and Broadwell (all > dual-socket Xeon E5s), and KNL, and this appears to hold true across of these. These are very complex architectures; the impacts of enabling/disabling HT are going to be highly specific to both the platform and application. > Moreover, having the second hardware thread turned on in Linux but not used > by batch jobs (by cgroup-ing them to just one hardware thread of each core) > substantially reduced the performance impact and jitter from the OS — by ~10% > in at least one synchronisation-heavy application. This is likely because the > kernel began scheduling OS tasks (Lustre, IB, IPoIB, IRQs, Ganglia, PBS, …) > on the second, unused hardware thread of each core, which were then run when > the batch job’s processes stalled the CPU’s execution units. This is with > both a CentOS 6.x kernel and a custom (tickless) 7.2 kernel. Yes, that's a pretty clever use of HT in an HPC environment. But be aware that you are cutting on-core pipeline depths that can be used by applications to do this. In your setup, it sounds like this is still a net performance win (which is pretty sweet). But that may not be a universal effect. This is probably a +3 on the existing trend from the prior emails in this thread: "As always, experiment to find the best for your hardware and jobs." ;-) -- Jeff Squyres jsquy...@cisco.com _______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users