Re: [OMPI users] How to justify the use MPI codes on multicore systems/PCs?

Tim Prince Sun, 11 Dec 2011 16:42:49 -0500

On 12/11/2011 12:16 PM, Andreas Schäfer wrote:

Hey,


on an SMP box threaded codes CAN always be faster than their MPI
equivalents. One reason why MPI sometimes turns out to be faster is
that with MPI every process actually initializes its own
data. Therefore it'll end up in the NUMA domain to which the core
running that process belongs. A lot of threaded codes are not NUMA
aware. So, for instance the initialization is done sequentially
(because it may not take a lot of time), and Linux' first touch policy
makes all memory pages belong to a single domain. In essence, those
codes will use just a single memory controller (and its bandwidth).

Many applications require significant additional RAM and message passingcommunication per MPI rank. Where those are not adverse issues, MPI islikely to out-perform pure OpenMP (Andreas just quoted some of thereasons), and OpenMP is likely to be favored only where it is an easierdevelopment model. The OpenMP library also should implement afirst-touch policy, but it's very difficult to carry out fully in legacyapplications.OpenMPI has had effective shared memory message passing from thebeginning, as did its predecessor (LAM) and all current commercial MPIimplementations I have seen, so you shouldn't have to beat on an issuewhich was dealt with 10 years ago. If you haven't been watching thismail list, you've missed some impressive reporting of new supportfeatures for effective pinning by CPU, cache, etc.When you get to hundreds of nodes, depending on your application andinterconnect performance, you may need to consider "hybrid" (OpenMP asthe threading model for MPI_THREAD_FUNNELED mode), if you are running asingle application across the entire cluster.The biggest cluster in my neighborhood, which ranked #54 on the recentTop500, gave best performance in pure MPI mode for that ranking. Ituses FDR infiniband, and ran 16 ranks per node, for 646 nodes, withDGEMM running in 4-wide vector parallel. Hybrid was tested as well,with each multiple-thread rank pinned to a single L3 cache.All 3 MPI implementations which were tested have full shared memorymessage passing and pinning to local cache within each node (OpenMPI and2 commercial MPIs).



--
Tim Prince

Re: [OMPI users] How to justify the use MPI codes on multicore systems/PCs?

Reply via email to