On 3/24/2017 6:10 PM, Reuti wrote:
Hi,
Am 24.03.2017 um 20:39 schrieb Jeff Squyres (jsquyres):
Limiting MPI processes to hyperthreads *helps*, but current generation Intel
hyperthreads are not as powerful as cores (they have roughly half the resources
of a core), so -- depending on your application and your exact system setup --
you will almost certainly see performance degradation of running N MPI
processes across N cores vs. across N hyper threads. You can try it yourself
by running the same size application over N cores on a single machine, and then
run the same application over N hyper threads (i.e., N/2 cores) on the same
machine.
[…]
- Disabling HT in the BIOS means that the one hardware thread left in each core
will get all the cores resources (buffers, queues, processor units, etc.).
- Enabling HT in the BIOS means that each of the 2 hardware threads will
statically be allocated roughly half the core's resources (buffers, queues,
processor units, etc.).
Do you have a reference for the two topics above (sure, I will try next week on
my own)? My knowledge was, that there is no dedicated HT core, and using all
cores will not give the result that the real cores get N x 100%, plus the HT
ones N x 50% (or alike). But the scheduler inside the CPU will balance the
resources between the double face of a single core and both are equal.
[…]
Spoiler alert: many people have looked at this. In *most* (but not all) cases,
using HT is not a performance win for MPI/HPC codes that are designed to run
processors at 100%.
I think it was also on this mailing list, that someone mentioned that the
pipelines in the CPU are reorganized in case you switch HT off, as only half of
them would be needed and these resources are then bound to the real cores too,
extending their performance. Similar, but not exactly what Jeff mentiones above.
Another aspect is, that even if they are not really doubling the performance,
one might get 150%. And if you pay per CPU hours, it can be worth to have it
switched on.
My personal experience is, that it depends not only application, but also on
the way how you oversubscribe. Using all cores for a single MPI application
leads to the effect, that all processes are doing the same stuff at the same
time (at least often) and fight for the same part of the CPU, essentially
becoming a bottleneck. But using each half of a CPU for two (or even more)
applications will allow a better interleaving in the demand for resources. To
allow this in the best way: no taskset or binding to cores, let the Linux
kernel and CPU do their best - YMMV.
-- Reuti
_______________________________________________
HT implementations vary in some of the details to which you refer.
The most severe limitation in disabling HT on Intel CPUs of the last 5
years has been that half of the hardware ITLB entries remain
inaccessible. This was supposed not to be a serious limitation for many
HPC applications.
Applications where each thread needs all of L1 or fill (cache lines
pending update) buffers aren't so suitable for HT. Intel compilers have
some ability at -O3 to adjust automatic loop fission and fusion for
applications with high fill buffer demand, requiring that there be just
1 thread using those buffers.
HT threading actually reduces in practice the rate at which FPU
instructions may be issued on Intel "big core" CPUs.
HT together with MPI usually requires effective HT-aware pinning. It
seems unusual for MPI ranks to share cores effectively simply under
control of kernel scheduling (although linux is more capable than
Windows). Agree that explicit use of taskset under MPI should have been
superseded by the options implemented by several MPI including openmpi.
--
Tim Prince
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users