Performance goes out the window if you oversubscribe your machines (i.e., run more MPI processes than cores). The effect of oversubscription is non-deterministic.
(for the next few paragraphs, assume that HT is disabled in the BIOS -- i.e., that there's only 1 hardware thread on each core) Open MPI uses spinning to check for progress, meaning that any one process will peg a core at 100%. When you run N MPI processes (where N <= num_cores), then each process can run at 100% and run as fast as the cores allow. When you run M MPI processes (where M > num_cores), then, by definition, some processes will have to yield their position on a core to let another process run. This means that they will react to MPI/network traffic slower than if they had an entire core to themselves (a similar effect occurs with the computational part of the app). Limiting MPI processes to hyperthreads *helps*, but current generation Intel hyperthreads are not as powerful as cores (they have roughly half the resources of a core), so -- depending on your application and your exact system setup -- you will almost certainly see performance degradation of running N MPI processes across N cores vs. across N hyper threads. You can try it yourself by running the same size application over N cores on a single machine, and then run the same application over N hyper threads (i.e., N/2 cores) on the same machine. You can use mpirun's binding options to bind to hypethreads or cores, too -- you don't have to use task set (which can be fairly confusing with the differences between physical and logical numbering of linux virtual processor IDs). And/or you might want to look at the hwloc project to get nice pictures of the topology of your machine, and look at hwloc-bind as a simpler-to-use alternative to taskset. Also be aware of the difference between enabling and disabling hyperthreads: there's a (big) difference between enabling and disabling HT in the BIOS and enabling and disabling HT in the OS. - Disabling HT in the BIOS means that the one hardware thread left in each core will get all the cores resources (buffers, queues, processor units, etc.). - Enabling HT in the BIOS means that each of the 2 hardware threads will statically be allocated roughly half the core's resources (buffers, queues, processor units, etc.). - When HT is enabled in the BIOS and you enable HT in the OS, then Linux assigns one virtual processor ID to each HT. - When HT is enabled in the BIOS and you disable HT in the OS, then Linux simply does not schedule anything to run on half the virtual processor IDs (e.g., the 2nd hardware thread in each core). This is NOT the same thing as disabling HT in the BIOS -- those HTs are still enabled and have half the core's resources; Linux is just choosing not to use them. Make sense? Hence, if you're testing whether your applications will work well with HT or not, you need to enable/disable HT in the BIOS to get a proper test. Spoiler alert: many people have looked at this. In *most* (but not all) cases, using HT is not a performance win for MPI/HPC codes that are designed to run processors at 100%. > On Mar 24, 2017, at 6:45 AM, Jordi Guitart <jordi.guit...@bsc.es> wrote: > > Hello, > > I'm running experiments with BT NAS benchmark on OpenMPI. I've identified a > very weird performance degradation of OpenMPI v1.10.2 (and later versions) > when the system is oversubscribed. In particular, note the performance > difference between 1.10.2 and 1.10.1 when running 36 MPI processes over 28 > CPUs. > > > $HOME/openmpi-bin-1.10.1/bin/mpirun -np 36 taskset -c 0-27 > > $HOME/NPB/NPB3.3-MPI/bin/bt.C.36 -> Time in seconds = 82.79 > > $HOME/openmpi-bin-1.10.2/bin/mpirun -np 36 taskset -c 0-27 > > $HOME/NPB/NPB3.3-MPI/bin/bt.C.36 -> Time in seconds = 111.71 > > The performance when the system is undersubscribed (i.e. 16 MPI processes > over 28 CPUs) seems pretty similar in both versions: > > > $HOME/openmpi-bin-1.10.1/bin/mpirun -np 16 taskset -c 0-27 > > $HOME/NPB/NPB3.3-MPI/bin/bt.C.16 -> Time in seconds = 96.78 > > $HOME/openmpi-bin-1.10.2/bin/mpirun -np 16 taskset -c 0-27 > > $HOME/NPB/NPB3.3-MPI/bin/bt.C.16 -> Time in seconds = 99.35 > > Any idea of what is happening? > > Thanks > > PS. As the system has 28 cores with hyperthreaded enabled, I use taskset to > ensure that only one thread per core is used. > PS2. I have tested also versions 1.10.6, 2.0.1 and 2.0.2, and the degradation > also occurs. > > http://bsc.es/disclaimer > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users -- Jeff Squyres jsquy...@cisco.com _______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users