On 3/24/2017 6:10 PM, Reuti wrote:
Hi,

Am 24.03.2017 um 20:39 schrieb Jeff Squyres (jsquyres):

Limiting MPI processes to hyperthreads *helps*, but current generation Intel 
hyperthreads are not as powerful as cores (they have roughly half the resources 
of a core), so -- depending on your application and your exact system setup -- 
you will almost certainly see performance degradation of running N MPI 
processes across N cores vs. across N hyper threads.  You can try it yourself 
by running the same size application over N cores on a single machine, and then 
run the same application over N hyper threads (i.e., N/2 cores) on the same 
machine.

[…]

- Disabling HT in the BIOS means that the one hardware thread left in each core 
will get all the cores resources (buffers, queues, processor units, etc.).
- Enabling HT in the BIOS means that each of the 2 hardware threads will 
statically be allocated roughly half the core's resources (buffers, queues, 
processor units, etc.).

Do you have a reference for the two topics above (sure, I will try next week on 
my own)? My knowledge was, that there is no dedicated HT core, and using all 
cores will not give the result that the real cores get N x 100%, plus the HT 
ones N x 50% (or alike). But the scheduler inside the CPU will balance the 
resources between the double face of a single core and both are equal.


[…]
Spoiler alert: many people have looked at this.  In *most* (but not all) cases, 
using HT is not a performance win for MPI/HPC codes that are designed to run 
processors at 100%.

I think it was also on this mailing list, that someone mentioned that the 
pipelines in the CPU are reorganized in case you switch HT off, as only half of 
them would be needed and these resources are then bound to the real cores too, 
extending their performance. Similar, but not exactly what Jeff mentiones above.

Another aspect is, that even if they are not really doubling the performance, 
one might get 150%. And if you pay per CPU hours, it can be worth to have it 
switched on.

My personal experience is, that it depends not only application, but also on 
the way how you oversubscribe. Using all cores for a single MPI application 
leads to the effect, that all processes are doing the same stuff at the same 
time (at least often) and fight for the same part of the CPU, essentially 
becoming a bottleneck. But using each half of a CPU for two (or even more) 
applications will allow a better interleaving in the demand for resources. To 
allow this in the best way: no taskset or binding to cores, let the Linux 
kernel and CPU do their best - YMMV.

-- Reuti
_______________________________________________
HT implementations vary in some of the details to which you refer.
The most severe limitation in disabling HT on Intel CPUs of the last 5 years has been that half of the hardware ITLB entries remain inaccessible. This was supposed not to be a serious limitation for many HPC applications. Applications where each thread needs all of L1 or fill (cache lines pending update) buffers aren't so suitable for HT. Intel compilers have some ability at -O3 to adjust automatic loop fission and fusion for applications with high fill buffer demand, requiring that there be just 1 thread using those buffers. HT threading actually reduces in practice the rate at which FPU instructions may be issued on Intel "big core" CPUs. HT together with MPI usually requires effective HT-aware pinning. It seems unusual for MPI ranks to share cores effectively simply under control of kernel scheduling (although linux is more capable than Windows). Agree that explicit use of taskset under MPI should have been superseded by the options implemented by several MPI including openmpi.

--
Tim Prince
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to