This is very interesting. Thanks for providing a test code. I have two suggestions for understanding this better.
1) Use MPI_Win_allocate_shared instead and measure the difference with and without alloc_shared_noncontig. I think this info is not available for MPI_Win_allocate because MPI_Win_shared_query is not permitted on MPI_Win_allocate windows. This is a flaw in MPI-3 that I would like to see fixed (https://github.com/mpi-forum/mpi-issues/issues/23). 2) Extend your test to allocate with mmap and measure with various sets of map flags (http://man7.org/linux/man-pages/man2/mmap.2.html). Starting with MAP_SHARED and MAP_PRIVATE is the right place to start. This experiment should make the cause unambiguous. Most likely, this is due to shared vs private mapping, but there is likely a tradeoff w.r.t. RMA performance. It depends on your network and how the MPI implementation uses it, but MPI_Win_create_dynamic likely leads to much worse RMA performance than MPI_Win_allocate. MPI_Win_create with a malloc'd buffer may perform worse than MPI_Win_allocate for internode RMA if the MPI implementation is lazy and doesn't cache page registration in MPI_Win_create. Jeff On Wed, May 23, 2018 at 3:45 AM, Joseph Schuchart <schuch...@hlrs.de> wrote: > All, > > We are observing some strange/interesting performance issues in accessing > memory that has been allocated through MPI_Win_allocate. I am attaching our > test case, which allocates memory for 100M integer values on each process > both through malloc and MPI_Win_allocate and writes to the local ranges > sequentially. > > On different systems (incl. SuperMUC and a Bull Cluster), we see that > accessing the memory allocated through MPI is significantly slower than > accessing the malloc'ed memory if multiple processes run on a single node, > increasing the effect with increasing number of processes per node. As an > example, running 24 processes per node with the example attached we see the > operations on the malloc'ed memory to take ~0.4s while the MPI allocated > memory takes up to 10s. > > After some experiments, I think there are two factors involved: > > 1) Initialization: it appears that the first iteration is significantly > slower than any subsequent accesses (1.1s vs 0.4s with 12 processes on a > single socket). Excluding the first iteration from the timing or memsetting > the range leads to comparable performance. I assume that this is due to > page faults that stem from first accessing the mmap'ed memory that backs > the shared memory used in the window. The effect of presetting the > malloc'ed memory seems smaller (0.4s vs 0.6s). > > 2) NUMA effects: Given proper initialization, running on two sockets still > leads to fluctuating performance degradation under the MPI window memory, > which ranges up to 20x (in extreme cases). The performance of accessing the > malloc'ed memory is rather stable. The difference seems to get smaller (but > does not disappear) with increasing number of repetitions. I am not sure > what causes these effects as each process should first-touch their local > memory. > > Are these known issues? Does anyone have any thoughts on my analysis? > > It is problematic for us that replacing local memory allocation with MPI > memory allocation leads to performance degradation as we rely on this > mechanism in our distributed data structures. While we can ensure proper > initialization of the memory to mitigate 1) for performance measurements, I > don't see a way to control the NUMA effects. If there is one I'd be happy > about any hints :) > > I should note that we also tested MPICH-based implementations, which > showed similar effects (as they also mmap their window memory). Not > surprisingly, using MPI_Alloc_mem and attaching that memory to a dynamic > window does not cause these effects while using shared memory windows does. > I ran my experiments using Open MPI 3.1.0 with the following command lines: > > - 12 cores / 1 socket: > mpirun -n 12 --bind-to socket --map-by ppr:12:socket > - 24 cores / 2 sockets: > mpirun -n 24 --bind-to socket > > and verified the binding using --report-bindings. > > Any help or comment would be much appreciated. > > Cheers > Joseph > > -- > Dipl.-Inf. Joseph Schuchart > High Performance Computing Center Stuttgart (HLRS) > Nobelstr. 19 > D-70569 Stuttgart > > Tel.: +49(0)711-68565890 > Fax: +49(0)711-6856832 > E-Mail: schuch...@hlrs.de > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users > -- Jeff Hammond jeff.scie...@gmail.com http://jeffhammond.github.io/
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users