What Open MPI version are you using? Does this happen when you run on a single node or multiple nodes?
-Nathan > On May 23, 2018, at 4:45 AM, Joseph Schuchart <schuch...@hlrs.de> wrote: > > All, > > We are observing some strange/interesting performance issues in accessing > memory that has been allocated through MPI_Win_allocate. I am attaching our > test case, which allocates memory for 100M integer values on each process > both through malloc and MPI_Win_allocate and writes to the local ranges > sequentially. > > On different systems (incl. SuperMUC and a Bull Cluster), we see that > accessing the memory allocated through MPI is significantly slower than > accessing the malloc'ed memory if multiple processes run on a single node, > increasing the effect with increasing number of processes per node. As an > example, running 24 processes per node with the example attached we see the > operations on the malloc'ed memory to take ~0.4s while the MPI allocated > memory takes up to 10s. > > After some experiments, I think there are two factors involved: > > 1) Initialization: it appears that the first iteration is significantly > slower than any subsequent accesses (1.1s vs 0.4s with 12 processes on a > single socket). Excluding the first iteration from the timing or memsetting > the range leads to comparable performance. I assume that this is due to page > faults that stem from first accessing the mmap'ed memory that backs the > shared memory used in the window. The effect of presetting the malloc'ed > memory seems smaller (0.4s vs 0.6s). > > 2) NUMA effects: Given proper initialization, running on two sockets still > leads to fluctuating performance degradation under the MPI window memory, > which ranges up to 20x (in extreme cases). The performance of accessing the > malloc'ed memory is rather stable. The difference seems to get smaller (but > does not disappear) with increasing number of repetitions. I am not sure what > causes these effects as each process should first-touch their local memory. > > Are these known issues? Does anyone have any thoughts on my analysis? > > It is problematic for us that replacing local memory allocation with MPI > memory allocation leads to performance degradation as we rely on this > mechanism in our distributed data structures. While we can ensure proper > initialization of the memory to mitigate 1) for performance measurements, I > don't see a way to control the NUMA effects. If there is one I'd be happy > about any hints :) > > I should note that we also tested MPICH-based implementations, which showed > similar effects (as they also mmap their window memory). Not surprisingly, > using MPI_Alloc_mem and attaching that memory to a dynamic window does not > cause these effects while using shared memory windows does. I ran my > experiments using Open MPI 3.1.0 with the following command lines: > > - 12 cores / 1 socket: > mpirun -n 12 --bind-to socket --map-by ppr:12:socket > - 24 cores / 2 sockets: > mpirun -n 24 --bind-to socket > > and verified the binding using --report-bindings. > > Any help or comment would be much appreciated. > > Cheers > Joseph > > -- > Dipl.-Inf. Joseph Schuchart > High Performance Computing Center Stuttgart (HLRS) > Nobelstr. 19 > D-70569 Stuttgart > > Tel.: +49(0)711-68565890 > Fax: +49(0)711-6856832 > E-Mail: schuch...@hlrs.de > <mpiwin_vs_malloc.c>_______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users
signature.asc
Description: Message signed with OpenPGP
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users