On Aug 15, 2008, at 3:32 PM, Gus Correa wrote:

Just like Daniel and many others, I have seen some disappointing performance of MPI code on multicore machines, in code that scales fine in networked environments and single core CPUs,
particularly in memory-intensive programs.
The bad performance has been variously ascribed to memory bandwidth / contention, to setting processor and memory affinity versus letting the kernel scheduler do its thing,
to poor performance of memcpy, and so on.

I'd suspect that all of these play a role -- not necessarily any one single one of them.

- It is my believe (contrary to several kernel developers' beliefs) that explicitly setting processor affinity is a Good Thing for MPI applications. Not only does MPI have more knowledge than the OS for a parallel job spanning multiple processes, each MPI process is allocating resources that may be spatially / temporally relevant. For example, say that an MPI process allocates some memory during MPI_INIT in a NUMA system. This memory will likely be "near" in a NUMA sense. If the OS later decides to move that process, then the memory would be "far" in a NUMA sense. Similarly, OMPI decides what I/O resources to use during MPI_INIT -- and may specifically choose some "near" resources (and exclude "far" resources). If the OS moves the process after MPI_INIT, these "near" and "far" determinations could become stale/incorrect, and performance would go down the tubes.

- Unoptimized memcpy implementations is definitely a factor, mainly for large message transfers through shared memory. Since most (all?) MPI implementations use some form of shared memory for on-host communication, memcpy can play a big part of its performance for large messages. Using hardware (such as IB HCAs) for on-host communication can effectively avoid unoptimized memcpy's, but then you're just shifting the problem to the hardware -- you're now dependent upon the hardware's DMA engine (which is *usually* pretty good). But then other issues can arise, such as the asynchronicity of the transfer, potentially causing collisions and/or extra memory bus traversals that might be able to be avoided with memcpy (it depends on the topology inside your server -- e.g., if 2 processes are "far" from the IB HCA, then the transfer will have to traverse QPI/HT/whatever twice, whereas a memcpy would assumedly stay local). As Ron pointed out in this thread, non-temporal memcpy's can be quite helpful for benchmarks that don't touch the resulting message at the receiver (because the non- temporal memcpy doesn't bother to take the time to load the cache).

- Using different compilers is a highly religious topic, and IMHO, may tend to be application specific. Compilers are large complex software systems (just like MPI); different compiler authors have chosen to implement different optimizations that work well in different applications. So yes, you may well see different run-time performance with different compilers depending on your application and/or MPI implementations. Some compilers may have better memcpy's.

My $0.02: I think there are a *lot* of factors involved here.

--
Jeff Squyres
Cisco Systems

Reply via email to