On Sep 27, 2015, at 1:38 PM, marcin.krotkiewski <marcin.krotkiew...@gmail.com> 
wrote:
> 
> Hello, everyone
> 
> I am struggling a bit with IB performance when sending data from a POSIX 
> shared memory region (/dev/shm). The memory is shared among many MPI 
> processes within the same compute node. Essentially, I see a bit hectic 
> performance, but it seems that my code it is roughly twice slower than when 
> using a usual, malloced send buffer.

It may have to do with NUMA effects and the way you're allocating/touching your 
shared memory vs. your private (malloced) memory.  If you have a 
multi-NUMA-domain system (i.e., any 2+ socket server, and even some 
single-socket servers) then you are likely to run into this sort of issue.  The 
PCI bus on which your IB HCA communicates is almost certainly closer to one 
NUMA domain than the others, and performance will usually be worse if you are 
sending/receiving from/to a "remote" NUMA domain.

"lstopo" and other tools can sometimes help you get a handle on the situation, 
though I don't know if it knows how to show memory affinity.  I think you can 
find memory affinity for a process via "/proc/<pid>/numa_maps".  There's lots 
of info about NUMA affinity here: https://queue.acm.org/detail.cfm?id=2513149

-Dave

Reply via email to