Hi Jeff,

> On 25 Mar 2017, at 10:31 am, Jeff Squyres (jsquyres) <jsquy...@cisco.com> 
> wrote:
> 
> When you enable HT, a) there's 2 hardware threads active, and b) most of the 
> resources in the core are effectively split in half and assigned to each 
> hardware thread.  When you disable HT, a) there's only 1 hardware thread, and 
> b) the resources of the core are allocated to that one hardware thread.

I’m not sure about this. It was my understanding that HyperThreading is 
implemented as a second set of e.g. registers that share execution units. 
There’s no division of the resources between the hardware threads, but rather 
the execution units switch between the two threads as they stall (e.g. cache 
miss, hazard/dependency, misprediction, …) — kind of like a context switch, but 
much cheaper. As long as there’s nothing being scheduled on the other hardware 
thread, there’s no impact on the performance. Moreover, turning HT off in the 
BIOS doesn’t make more resources available to now-single hardware thread.

This matches our observations on our cluster — there was no 
statistically-significant change in performance between having HT turned off in 
the BIOS and turning the second hardware thread of each core off in Linux. We 
run a mix of architectures — Sandy, Ivy, Haswell, and Broadwell (all 
dual-socket Xeon E5s), and KNL, and this appears to hold true across of these.

Moreover, having the second hardware thread turned on in Linux but not used by 
batch jobs (by cgroup-ing them to just one hardware thread of each core) 
substantially reduced the performance impact and jitter from the OS — by ~10% 
in at least one synchronisation-heavy application. This is likely because the 
kernel began scheduling OS tasks (Lustre, IB, IPoIB, IRQs, Ganglia, PBS, …) on 
the second, unused hardware thread of each core, which were then run when the 
batch job’s processes stalled the CPU’s execution units. This is with both a 
CentOS 6.x kernel and a custom (tickless) 7.2 kernel.

Given these results, we now leave HT on in both the BIOS and OS, and cgroup 
batch jobs to either one or all hardware threads of the allocated cores based 
on a PBS resource request. Most jobs don’t request or benefit from the extra 
hardware threads, but some (e.g. very I/O-heavy) do.

>> My personal experience is, that it depends not only application, but also on 
>> the way how you oversubscribe.
> 
> +1


+2

As always, experiment to find the best for your hardware and jobs.

Cheers,
Ben

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to