Re: [OMPI users] SM btl slows down bandwidth?

Jeff Squyres Sat, 16 Aug 2008 08:19:40 -0400

On Aug 15, 2008, at 3:32 PM, Gus Correa wrote:

Just like Daniel and many others, I have seen some disappointingperformance of MPI code on multicore machines,in code that scales fine in networked environments and single coreCPUs,
particularly in memory-intensive programs.
The bad performance has been variously ascribed to memorybandwidth / contention,to setting processor and memory affinity versus letting the kernelscheduler do its thing,
to poor performance of memcpy, and so on.

I'd suspect that all of these play a role -- not necessarily any onesingle one of them.

- It is my believe (contrary to several kernel developers' beliefs)that explicitly setting processor affinity is a Good Thing for MPIapplications. Not only does MPI have more knowledge than the OS for aparallel job spanning multiple processes, each MPI process isallocating resources that may be spatially / temporally relevant. Forexample, say that an MPI process allocates some memory during MPI_INITin a NUMA system. This memory will likely be "near" in a NUMA sense.If the OS later decides to move that process, then the memory would be"far" in a NUMA sense. Similarly, OMPI decides what I/O resources touse during MPI_INIT -- and may specifically choose some "near"resources (and exclude "far" resources). If the OS moves the processafter MPI_INIT, these "near" and "far" determinations could becomestale/incorrect, and performance would go down the tubes.

- Unoptimized memcpy implementations is definitely a factor, mainlyfor large message transfers through shared memory. Since most (all?)MPI implementations use some form of shared memory for on-hostcommunication, memcpy can play a big part of its performance for largemessages. Using hardware (such as IB HCAs) for on-host communicationcan effectively avoid unoptimized memcpy's, but then you're justshifting the problem to the hardware -- you're now dependent upon thehardware's DMA engine (which is *usually* pretty good). But thenother issues can arise, such as the asynchronicity of the transfer,potentially causing collisions and/or extra memory bus traversals thatmight be able to be avoided with memcpy (it depends on the topologyinside your server -- e.g., if 2 processes are "far" from the IB HCA,then the transfer will have to traverse QPI/HT/whatever twice, whereasa memcpy would assumedly stay local). As Ron pointed out in thisthread, non-temporal memcpy's can be quite helpful for benchmarks thatdon't touch the resulting message at the receiver (because the non-temporal memcpy doesn't bother to take the time to load the cache).

- Using different compilers is a highly religious topic, and IMHO, maytend to be application specific. Compilers are large complex softwaresystems (just like MPI); different compiler authors have chosen toimplement different optimizations that work well in differentapplications. So yes, you may well see different run-time performancewith different compilers depending on your application and/or MPIimplementations. Some compilers may have better memcpy's.


My $0.02: I think there are a *lot* of factors involved here.

--
Jeff Squyres
Cisco Systems

Re: [OMPI users] SM btl slows down bandwidth?

Reply via email to