Re: [OMPI users] Performance degradation of OpenMPI 1.10.2 when oversubscribed?

Jordi Guitart Sat, 25 Mar 2017 08:24:32 -0700

Hi,

Very interesting discussing about the impact of HT. I was not awareabout the potential difference between turning off HT in the BIOS vs. inthe OS. However, this was not the main issue in my original message. Iwas expecting the performance degradation with oversubscription. And Ican also agree that the performance when using HT depends on theapplication. However, what is puzzling me is the performance differencebetween OpenMPI 1.10.1 (and prior versions) and OpenMPI 1.10.2 (andlater versions) in my experiments with oversubscription, i.e. 82 secondsvs. 111 seconds. Note that the two experiments have the same degree ofoversubscription (36 over 28) and the same HT configuration (the sameprocessors allowed in the cpuset mask). In addition, the performancedifference is consistent between executions. According to this,non-determinism of oversubscription is not enough to explain thisperformance difference, and there must be some implementation issue inOpenMPI 1.10.2 that was not present in version 1.10.1.


Thanks

PS. About the use of taskset, I tried using --cpu-set flag of mpirun(which as far as I understand should provide the same effect), but itwas not working correctly in my system, as processes were scheduled inprocessors not included in the cpuset list.


On 24/03/2017 20:39, Jeff Squyres (jsquyres) wrote:

Performance goes out the window if you oversubscribe your machines (i.e., run 
more MPI processes than cores).  The effect of oversubscription is 
non-deterministic.

(for the next few paragraphs, assume that HT is disabled in the BIOS -- i.e., 
that there's only 1 hardware thread on each core)

Open MPI uses spinning to check for progress, meaning that any one process will 
peg a core at 100%.  When you run N MPI processes (where N <= num_cores), then 
each process can run at 100% and run as fast as the cores allow.

When you run M MPI processes (where M > num_cores), then, by definition, some 
processes will have to yield their position on a core to let another process run.  
This means that they will react to MPI/network traffic slower than if they had an 
entire core to themselves (a similar effect occurs with the computational part of 
the app).

Limiting MPI processes to hyperthreads *helps*, but current generation Intel 
hyperthreads are not as powerful as cores (they have roughly half the resources 
of a core), so -- depending on your application and your exact system setup -- 
you will almost certainly see performance degradation of running N MPI 
processes across N cores vs. across N hyper threads.  You can try it yourself 
by running the same size application over N cores on a single machine, and then 
run the same application over N hyper threads (i.e., N/2 cores) on the same 
machine.

You can use mpirun's binding options to bind to hypethreads or cores, too -- 
you don't have to use task set (which can be fairly confusing with the 
differences between physical and logical numbering of linux virtual processor 
IDs).  And/or you might want to look at the hwloc project to get nice pictures 
of the topology of your machine, and look at hwloc-bind as a simpler-to-use 
alternative to taskset.

Also be aware of the difference between enabling and disabling hyperthreads: 
there's a (big) difference between enabling and disabling HT in the BIOS and 
enabling and disabling HT in the OS.

- Disabling HT in the BIOS means that the one hardware thread left in each core 
will get all the cores resources (buffers, queues, processor units, etc.).
- Enabling HT in the BIOS means that each of the 2 hardware threads will 
statically be allocated roughly half the core's resources (buffers, queues, 
processor units, etc.).

- When HT is enabled in the BIOS and you enable HT in the OS, then Linux 
assigns one virtual processor ID to each HT.
- When HT is enabled in the BIOS and you disable HT in the OS, then Linux 
simply does not schedule anything to run on half the virtual processor IDs 
(e.g., the 2nd hardware thread in each core).  This is NOT the same thing as 
disabling HT in the BIOS -- those HTs are still enabled and have half the 
core's resources; Linux is just choosing not to use them.

Make sense?

Hence, if you're testing whether your applications will work well with HT or 
not, you need to enable/disable HT in the BIOS to get a proper test.

Spoiler alert: many people have looked at this.  In *most* (but not all) cases, 
using HT is not a performance win for MPI/HPC codes that are designed to run 
processors at 100%.

On Mar 24, 2017, at 6:45 AM, Jordi Guitart <jordi.guit...@bsc.es> wrote:

Hello,

I'm running experiments with BT NAS benchmark on OpenMPI. I've identified a 
very weird performance degradation of OpenMPI v1.10.2 (and later versions) when 
the system is oversubscribed. In particular, note the performance difference 
between 1.10.2 and 1.10.1 when running 36 MPI processes over 28 CPUs.

$HOME/openmpi-bin-1.10.1/bin/mpirun -np 36 taskset -c 0-27 
$HOME/NPB/NPB3.3-MPI/bin/bt.C.36 -> Time in seconds =  82.79
$HOME/openmpi-bin-1.10.2/bin/mpirun -np 36 taskset -c 0-27 
$HOME/NPB/NPB3.3-MPI/bin/bt.C.36 -> Time in seconds =  111.71

The performance when the system is undersubscribed (i.e. 16 MPI processes over 
28 CPUs) seems pretty similar in both versions:

$HOME/openmpi-bin-1.10.1/bin/mpirun -np 16 taskset -c 0-27 
$HOME/NPB/NPB3.3-MPI/bin/bt.C.16 -> Time in seconds =  96.78
$HOME/openmpi-bin-1.10.2/bin/mpirun -np 16 taskset -c 0-27 
$HOME/NPB/NPB3.3-MPI/bin/bt.C.16 -> Time in seconds =  99.35

Any idea of what is happening?

Thanks

PS. As the system has 28 cores with hyperthreaded enabled, I use taskset to 
ensure that only one thread per core is used.
PS2. I have tested also versions 1.10.6, 2.0.1 and 2.0.2, and the degradation 
also occurs.

http://bsc.es/disclaimer
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users



http://bsc.es/disclaimer
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Performance degradation of OpenMPI 1.10.2 when oversubscribed?

Reply via email to