On Aug 15, 2008, at 3:32 PM, Gus Correa wrote:
Just like Daniel and many others, I have seen some disappointing
performance of MPI code on multicore machines,
in code that scales fine in networked environments and single core
CPUs,
particularly in memory-intensive programs.
The bad performance has been variously ascribed to memory
bandwidth / contention,
to setting processor and memory affinity versus letting the kernel
scheduler do its thing,
to poor performance of memcpy, and so on.
I'd suspect that all of these play a role -- not necessarily any one
single one of them.
- It is my believe (contrary to several kernel developers' beliefs)
that explicitly setting processor affinity is a Good Thing for MPI
applications. Not only does MPI have more knowledge than the OS for a
parallel job spanning multiple processes, each MPI process is
allocating resources that may be spatially / temporally relevant. For
example, say that an MPI process allocates some memory during MPI_INIT
in a NUMA system. This memory will likely be "near" in a NUMA sense.
If the OS later decides to move that process, then the memory would be
"far" in a NUMA sense. Similarly, OMPI decides what I/O resources to
use during MPI_INIT -- and may specifically choose some "near"
resources (and exclude "far" resources). If the OS moves the process
after MPI_INIT, these "near" and "far" determinations could become
stale/incorrect, and performance would go down the tubes.
- Unoptimized memcpy implementations is definitely a factor, mainly
for large message transfers through shared memory. Since most (all?)
MPI implementations use some form of shared memory for on-host
communication, memcpy can play a big part of its performance for large
messages. Using hardware (such as IB HCAs) for on-host communication
can effectively avoid unoptimized memcpy's, but then you're just
shifting the problem to the hardware -- you're now dependent upon the
hardware's DMA engine (which is *usually* pretty good). But then
other issues can arise, such as the asynchronicity of the transfer,
potentially causing collisions and/or extra memory bus traversals that
might be able to be avoided with memcpy (it depends on the topology
inside your server -- e.g., if 2 processes are "far" from the IB HCA,
then the transfer will have to traverse QPI/HT/whatever twice, whereas
a memcpy would assumedly stay local). As Ron pointed out in this
thread, non-temporal memcpy's can be quite helpful for benchmarks that
don't touch the resulting message at the receiver (because the non-
temporal memcpy doesn't bother to take the time to load the cache).
- Using different compilers is a highly religious topic, and IMHO, may
tend to be application specific. Compilers are large complex software
systems (just like MPI); different compiler authors have chosen to
implement different optimizations that work well in different
applications. So yes, you may well see different run-time performance
with different compilers depending on your application and/or MPI
implementations. Some compilers may have better memcpy's.
My $0.02: I think there are a *lot* of factors involved here.
--
Jeff Squyres
Cisco Systems