Performance goes out the window if you oversubscribe your machines (i.e., run 
more MPI processes than cores).  The effect of oversubscription is 
non-deterministic.

(for the next few paragraphs, assume that HT is disabled in the BIOS -- i.e., 
that there's only 1 hardware thread on each core)

Open MPI uses spinning to check for progress, meaning that any one process will 
peg a core at 100%.  When you run N MPI processes (where N <= num_cores), then 
each process can run at 100% and run as fast as the cores allow.

When you run M MPI processes (where M > num_cores), then, by definition, some 
processes will have to yield their position on a core to let another process 
run.  This means that they will react to MPI/network traffic slower than if 
they had an entire core to themselves (a similar effect occurs with the 
computational part of the app).

Limiting MPI processes to hyperthreads *helps*, but current generation Intel 
hyperthreads are not as powerful as cores (they have roughly half the resources 
of a core), so -- depending on your application and your exact system setup -- 
you will almost certainly see performance degradation of running N MPI 
processes across N cores vs. across N hyper threads.  You can try it yourself 
by running the same size application over N cores on a single machine, and then 
run the same application over N hyper threads (i.e., N/2 cores) on the same 
machine.

You can use mpirun's binding options to bind to hypethreads or cores, too -- 
you don't have to use task set (which can be fairly confusing with the 
differences between physical and logical numbering of linux virtual processor 
IDs).  And/or you might want to look at the hwloc project to get nice pictures 
of the topology of your machine, and look at hwloc-bind as a simpler-to-use 
alternative to taskset.

Also be aware of the difference between enabling and disabling hyperthreads: 
there's a (big) difference between enabling and disabling HT in the BIOS and 
enabling and disabling HT in the OS.

- Disabling HT in the BIOS means that the one hardware thread left in each core 
will get all the cores resources (buffers, queues, processor units, etc.).
- Enabling HT in the BIOS means that each of the 2 hardware threads will 
statically be allocated roughly half the core's resources (buffers, queues, 
processor units, etc.).

- When HT is enabled in the BIOS and you enable HT in the OS, then Linux 
assigns one virtual processor ID to each HT.
- When HT is enabled in the BIOS and you disable HT in the OS, then Linux 
simply does not schedule anything to run on half the virtual processor IDs 
(e.g., the 2nd hardware thread in each core).  This is NOT the same thing as 
disabling HT in the BIOS -- those HTs are still enabled and have half the 
core's resources; Linux is just choosing not to use them.

Make sense?

Hence, if you're testing whether your applications will work well with HT or 
not, you need to enable/disable HT in the BIOS to get a proper test.

Spoiler alert: many people have looked at this.  In *most* (but not all) cases, 
using HT is not a performance win for MPI/HPC codes that are designed to run 
processors at 100%.



> On Mar 24, 2017, at 6:45 AM, Jordi Guitart <jordi.guit...@bsc.es> wrote:
> 
> Hello,
> 
> I'm running experiments with BT NAS benchmark on OpenMPI. I've identified a 
> very weird performance degradation of OpenMPI v1.10.2 (and later versions) 
> when the system is oversubscribed. In particular, note the performance 
> difference between 1.10.2 and 1.10.1 when running 36 MPI processes over 28 
> CPUs.
> 
> > $HOME/openmpi-bin-1.10.1/bin/mpirun -np 36 taskset -c 0-27 
> > $HOME/NPB/NPB3.3-MPI/bin/bt.C.36 -> Time in seconds =  82.79
> > $HOME/openmpi-bin-1.10.2/bin/mpirun -np 36 taskset -c 0-27 
> > $HOME/NPB/NPB3.3-MPI/bin/bt.C.36 -> Time in seconds =  111.71
> 
> The performance when the system is undersubscribed (i.e. 16 MPI processes 
> over 28 CPUs) seems pretty similar in both versions:
> 
> > $HOME/openmpi-bin-1.10.1/bin/mpirun -np 16 taskset -c 0-27 
> > $HOME/NPB/NPB3.3-MPI/bin/bt.C.16 -> Time in seconds =  96.78
> > $HOME/openmpi-bin-1.10.2/bin/mpirun -np 16 taskset -c 0-27 
> > $HOME/NPB/NPB3.3-MPI/bin/bt.C.16 -> Time in seconds =  99.35
> 
> Any idea of what is happening?
> 
> Thanks
> 
> PS. As the system has 28 cores with hyperthreaded enabled, I use taskset to 
> ensure that only one thread per core is used.
> PS2. I have tested also versions 1.10.6, 2.0.1 and 2.0.2, and the degradation 
> also occurs.
> 
> http://bsc.es/disclaimer
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users


-- 
Jeff Squyres
jsquy...@cisco.com

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to