Hi Jeff, > On 25 Mar 2017, at 10:31 am, Jeff Squyres (jsquyres) <jsquy...@cisco.com> > wrote: > > When you enable HT, a) there's 2 hardware threads active, and b) most of the > resources in the core are effectively split in half and assigned to each > hardware thread. When you disable HT, a) there's only 1 hardware thread, and > b) the resources of the core are allocated to that one hardware thread.
I’m not sure about this. It was my understanding that HyperThreading is implemented as a second set of e.g. registers that share execution units. There’s no division of the resources between the hardware threads, but rather the execution units switch between the two threads as they stall (e.g. cache miss, hazard/dependency, misprediction, …) — kind of like a context switch, but much cheaper. As long as there’s nothing being scheduled on the other hardware thread, there’s no impact on the performance. Moreover, turning HT off in the BIOS doesn’t make more resources available to now-single hardware thread. This matches our observations on our cluster — there was no statistically-significant change in performance between having HT turned off in the BIOS and turning the second hardware thread of each core off in Linux. We run a mix of architectures — Sandy, Ivy, Haswell, and Broadwell (all dual-socket Xeon E5s), and KNL, and this appears to hold true across of these. Moreover, having the second hardware thread turned on in Linux but not used by batch jobs (by cgroup-ing them to just one hardware thread of each core) substantially reduced the performance impact and jitter from the OS — by ~10% in at least one synchronisation-heavy application. This is likely because the kernel began scheduling OS tasks (Lustre, IB, IPoIB, IRQs, Ganglia, PBS, …) on the second, unused hardware thread of each core, which were then run when the batch job’s processes stalled the CPU’s execution units. This is with both a CentOS 6.x kernel and a custom (tickless) 7.2 kernel. Given these results, we now leave HT on in both the BIOS and OS, and cgroup batch jobs to either one or all hardware threads of the allocated cores based on a PBS resource request. Most jobs don’t request or benefit from the extra hardware threads, but some (e.g. very I/O-heavy) do. >> My personal experience is, that it depends not only application, but also on >> the way how you oversubscribe. > > +1 +2 As always, experiment to find the best for your hardware and jobs. Cheers, Ben
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users