Hi,
Very interesting discussing about the impact of HT. I was not aware
about the potential difference between turning off HT in the BIOS vs. in
the OS. However, this was not the main issue in my original message. I
was expecting the performance degradation with oversubscription. And I
can also agree that the performance when using HT depends on the
application. However, what is puzzling me is the performance difference
between OpenMPI 1.10.1 (and prior versions) and OpenMPI 1.10.2 (and
later versions) in my experiments with oversubscription, i.e. 82 seconds
vs. 111 seconds. Note that the two experiments have the same degree of
oversubscription (36 over 28) and the same HT configuration (the same
processors allowed in the cpuset mask). In addition, the performance
difference is consistent between executions. According to this,
non-determinism of oversubscription is not enough to explain this
performance difference, and there must be some implementation issue in
OpenMPI 1.10.2 that was not present in version 1.10.1.
Thanks
PS. About the use of taskset, I tried using --cpu-set flag of mpirun
(which as far as I understand should provide the same effect), but it
was not working correctly in my system, as processes were scheduled in
processors not included in the cpuset list.
On 24/03/2017 20:39, Jeff Squyres (jsquyres) wrote:
Performance goes out the window if you oversubscribe your machines (i.e., run
more MPI processes than cores). The effect of oversubscription is
non-deterministic.
(for the next few paragraphs, assume that HT is disabled in the BIOS -- i.e.,
that there's only 1 hardware thread on each core)
Open MPI uses spinning to check for progress, meaning that any one process will
peg a core at 100%. When you run N MPI processes (where N <= num_cores), then
each process can run at 100% and run as fast as the cores allow.
When you run M MPI processes (where M > num_cores), then, by definition, some
processes will have to yield their position on a core to let another process run.
This means that they will react to MPI/network traffic slower than if they had an
entire core to themselves (a similar effect occurs with the computational part of
the app).
Limiting MPI processes to hyperthreads *helps*, but current generation Intel
hyperthreads are not as powerful as cores (they have roughly half the resources
of a core), so -- depending on your application and your exact system setup --
you will almost certainly see performance degradation of running N MPI
processes across N cores vs. across N hyper threads. You can try it yourself
by running the same size application over N cores on a single machine, and then
run the same application over N hyper threads (i.e., N/2 cores) on the same
machine.
You can use mpirun's binding options to bind to hypethreads or cores, too --
you don't have to use task set (which can be fairly confusing with the
differences between physical and logical numbering of linux virtual processor
IDs). And/or you might want to look at the hwloc project to get nice pictures
of the topology of your machine, and look at hwloc-bind as a simpler-to-use
alternative to taskset.
Also be aware of the difference between enabling and disabling hyperthreads:
there's a (big) difference between enabling and disabling HT in the BIOS and
enabling and disabling HT in the OS.
- Disabling HT in the BIOS means that the one hardware thread left in each core
will get all the cores resources (buffers, queues, processor units, etc.).
- Enabling HT in the BIOS means that each of the 2 hardware threads will
statically be allocated roughly half the core's resources (buffers, queues,
processor units, etc.).
- When HT is enabled in the BIOS and you enable HT in the OS, then Linux
assigns one virtual processor ID to each HT.
- When HT is enabled in the BIOS and you disable HT in the OS, then Linux
simply does not schedule anything to run on half the virtual processor IDs
(e.g., the 2nd hardware thread in each core). This is NOT the same thing as
disabling HT in the BIOS -- those HTs are still enabled and have half the
core's resources; Linux is just choosing not to use them.
Make sense?
Hence, if you're testing whether your applications will work well with HT or
not, you need to enable/disable HT in the BIOS to get a proper test.
Spoiler alert: many people have looked at this. In *most* (but not all) cases,
using HT is not a performance win for MPI/HPC codes that are designed to run
processors at 100%.
On Mar 24, 2017, at 6:45 AM, Jordi Guitart <jordi.guit...@bsc.es> wrote:
Hello,
I'm running experiments with BT NAS benchmark on OpenMPI. I've identified a
very weird performance degradation of OpenMPI v1.10.2 (and later versions) when
the system is oversubscribed. In particular, note the performance difference
between 1.10.2 and 1.10.1 when running 36 MPI processes over 28 CPUs.
$HOME/openmpi-bin-1.10.1/bin/mpirun -np 36 taskset -c 0-27
$HOME/NPB/NPB3.3-MPI/bin/bt.C.36 -> Time in seconds = 82.79
$HOME/openmpi-bin-1.10.2/bin/mpirun -np 36 taskset -c 0-27
$HOME/NPB/NPB3.3-MPI/bin/bt.C.36 -> Time in seconds = 111.71
The performance when the system is undersubscribed (i.e. 16 MPI processes over
28 CPUs) seems pretty similar in both versions:
$HOME/openmpi-bin-1.10.1/bin/mpirun -np 16 taskset -c 0-27
$HOME/NPB/NPB3.3-MPI/bin/bt.C.16 -> Time in seconds = 96.78
$HOME/openmpi-bin-1.10.2/bin/mpirun -np 16 taskset -c 0-27
$HOME/NPB/NPB3.3-MPI/bin/bt.C.16 -> Time in seconds = 99.35
Any idea of what is happening?
Thanks
PS. As the system has 28 cores with hyperthreaded enabled, I use taskset to
ensure that only one thread per core is used.
PS2. I have tested also versions 1.10.6, 2.0.1 and 2.0.2, and the degradation
also occurs.
http://bsc.es/disclaimer
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
http://bsc.es/disclaimer
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users