On 12/11/2011 12:16 PM, Andreas Schäfer wrote:
Hey,

on an SMP box threaded codes CAN always be faster than their MPI
equivalents. One reason why MPI sometimes turns out to be faster is
that with MPI every process actually initializes its own
data. Therefore it'll end up in the NUMA domain to which the core
running that process belongs. A lot of threaded codes are not NUMA
aware. So, for instance the initialization is done sequentially
(because it may not take a lot of time), and Linux' first touch policy
makes all memory pages belong to a single domain. In essence, those
codes will use just a single memory controller (and its bandwidth).


Many applications require significant additional RAM and message passing communication per MPI rank. Where those are not adverse issues, MPI is likely to out-perform pure OpenMP (Andreas just quoted some of the reasons), and OpenMP is likely to be favored only where it is an easier development model. The OpenMP library also should implement a first-touch policy, but it's very difficult to carry out fully in legacy applications. OpenMPI has had effective shared memory message passing from the beginning, as did its predecessor (LAM) and all current commercial MPI implementations I have seen, so you shouldn't have to beat on an issue which was dealt with 10 years ago. If you haven't been watching this mail list, you've missed some impressive reporting of new support features for effective pinning by CPU, cache, etc. When you get to hundreds of nodes, depending on your application and interconnect performance, you may need to consider "hybrid" (OpenMP as the threading model for MPI_THREAD_FUNNELED mode), if you are running a single application across the entire cluster. The biggest cluster in my neighborhood, which ranked #54 on the recent Top500, gave best performance in pure MPI mode for that ranking. It uses FDR infiniband, and ran 16 ranks per node, for 646 nodes, with DGEMM running in 4-wide vector parallel. Hybrid was tested as well, with each multiple-thread rank pinned to a single L3 cache. All 3 MPI implementations which were tested have full shared memory message passing and pinning to local cache within each node (OpenMPI and 2 commercial MPIs).


--
Tim Prince

Reply via email to