What Open MPI version are you using? Does this happen when you run on a single 
node or multiple nodes?

-Nathan

> On May 23, 2018, at 4:45 AM, Joseph Schuchart <schuch...@hlrs.de> wrote:
> 
> All,
> 
> We are observing some strange/interesting performance issues in accessing 
> memory that has been allocated through MPI_Win_allocate. I am attaching our 
> test case, which allocates memory for 100M integer values on each process 
> both through malloc and MPI_Win_allocate and writes to the local ranges 
> sequentially.
> 
> On different systems (incl. SuperMUC and a Bull Cluster), we see that 
> accessing the memory allocated through MPI is significantly slower than 
> accessing the malloc'ed memory if multiple processes run on a single node, 
> increasing the effect with increasing number of processes per node. As an 
> example, running 24 processes per node with the example attached we see the 
> operations on the malloc'ed memory to take ~0.4s while the MPI allocated 
> memory takes up to 10s.
> 
> After some experiments, I think there are two factors involved:
> 
> 1) Initialization: it appears that the first iteration is significantly 
> slower than any subsequent accesses (1.1s vs 0.4s with 12 processes on a 
> single socket). Excluding the first iteration from the timing or memsetting 
> the range leads to comparable performance. I assume that this is due to page 
> faults that stem from first accessing the mmap'ed memory that backs the 
> shared memory used in the window. The effect of presetting the  malloc'ed 
> memory seems smaller (0.4s vs 0.6s).
> 
> 2) NUMA effects: Given proper initialization, running on two sockets still 
> leads to fluctuating performance degradation under the MPI window memory, 
> which ranges up to 20x (in extreme cases). The performance of accessing the 
> malloc'ed memory is rather stable. The difference seems to get smaller (but 
> does not disappear) with increasing number of repetitions. I am not sure what 
> causes these effects as each process should first-touch their local memory.
> 
> Are these known issues? Does anyone have any thoughts on my analysis?
> 
> It is problematic for us that replacing local memory allocation with MPI 
> memory allocation leads to performance degradation as we rely on this 
> mechanism in our distributed data structures. While we can ensure proper 
> initialization of the memory to mitigate 1) for performance measurements, I 
> don't see a way to control the NUMA effects. If there is one I'd be happy 
> about any hints :)
> 
> I should note that we also tested MPICH-based implementations, which showed 
> similar effects (as they also mmap their window memory). Not surprisingly, 
> using MPI_Alloc_mem and attaching that memory to a dynamic window does not 
> cause these effects while using shared memory windows does. I ran my 
> experiments using Open MPI 3.1.0 with the following command lines:
> 
> - 12 cores / 1 socket:
> mpirun -n 12 --bind-to socket --map-by ppr:12:socket
> - 24 cores / 2 sockets:
> mpirun -n 24 --bind-to socket
> 
> and verified the binding using  --report-bindings.
> 
> Any help or comment would be much appreciated.
> 
> Cheers
> Joseph
> 
> --
> Dipl.-Inf. Joseph Schuchart
> High Performance Computing Center Stuttgart (HLRS)
> Nobelstr. 19
> D-70569 Stuttgart
> 
> Tel.: +49(0)711-68565890
> Fax: +49(0)711-6856832
> E-Mail: schuch...@hlrs.de
> <mpiwin_vs_malloc.c>_______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

Attachment: signature.asc
Description: Message signed with OpenPGP

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to