We had a similar issue few months back. After investigation it turned out to be related to NUMA balancing [1] being enabled by default on recent releases of Linux-based OSes.
In our case turning off NUMA balancing fixed most of the performance incoherences we had. You can check its status in /proc/sys/kernel/numa_ balancing. George. On Wed, May 23, 2018 at 11:16 AM, Jeff Hammond <jeff.scie...@gmail.com> wrote: > This is very interesting. Thanks for providing a test code. I have two > suggestions for understanding this better. > > 1) Use MPI_Win_allocate_shared instead and measure the difference with and > without alloc_shared_noncontig. I think this info is not available for > MPI_Win_allocate because MPI_Win_shared_query is not permitted on > MPI_Win_allocate windows. This is a flaw in MPI-3 that I would like to see > fixed (https://github.com/mpi-forum/mpi-issues/issues/23). > > 2) Extend your test to allocate with mmap and measure with various sets of > map flags (http://man7.org/linux/man-pages/man2/mmap.2.html). Starting > with MAP_SHARED and MAP_PRIVATE is the right place to start. This > experiment should make the cause unambiguous. > > Most likely, this is due to shared vs private mapping, but there is likely > a tradeoff w.r.t. RMA performance. It depends on your network and how the > MPI implementation uses it, but MPI_Win_create_dynamic likely leads to much > worse RMA performance than MPI_Win_allocate. MPI_Win_create with a > malloc'd buffer may perform worse than MPI_Win_allocate for internode RMA > if the MPI implementation is lazy and doesn't cache page registration in > MPI_Win_create. > > Jeff > > On Wed, May 23, 2018 at 3:45 AM, Joseph Schuchart <schuch...@hlrs.de> > wrote: > >> All, >> >> We are observing some strange/interesting performance issues in accessing >> memory that has been allocated through MPI_Win_allocate. I am attaching our >> test case, which allocates memory for 100M integer values on each process >> both through malloc and MPI_Win_allocate and writes to the local ranges >> sequentially. >> >> On different systems (incl. SuperMUC and a Bull Cluster), we see that >> accessing the memory allocated through MPI is significantly slower than >> accessing the malloc'ed memory if multiple processes run on a single node, >> increasing the effect with increasing number of processes per node. As an >> example, running 24 processes per node with the example attached we see the >> operations on the malloc'ed memory to take ~0.4s while the MPI allocated >> memory takes up to 10s. >> >> After some experiments, I think there are two factors involved: >> >> 1) Initialization: it appears that the first iteration is significantly >> slower than any subsequent accesses (1.1s vs 0.4s with 12 processes on a >> single socket). Excluding the first iteration from the timing or memsetting >> the range leads to comparable performance. I assume that this is due to >> page faults that stem from first accessing the mmap'ed memory that backs >> the shared memory used in the window. The effect of presetting the >> malloc'ed memory seems smaller (0.4s vs 0.6s). >> >> 2) NUMA effects: Given proper initialization, running on two sockets >> still leads to fluctuating performance degradation under the MPI window >> memory, which ranges up to 20x (in extreme cases). The performance of >> accessing the malloc'ed memory is rather stable. The difference seems to >> get smaller (but does not disappear) with increasing number of repetitions. >> I am not sure what causes these effects as each process should first-touch >> their local memory. >> >> Are these known issues? Does anyone have any thoughts on my analysis? >> >> It is problematic for us that replacing local memory allocation with MPI >> memory allocation leads to performance degradation as we rely on this >> mechanism in our distributed data structures. While we can ensure proper >> initialization of the memory to mitigate 1) for performance measurements, I >> don't see a way to control the NUMA effects. If there is one I'd be happy >> about any hints :) >> >> I should note that we also tested MPICH-based implementations, which >> showed similar effects (as they also mmap their window memory). Not >> surprisingly, using MPI_Alloc_mem and attaching that memory to a dynamic >> window does not cause these effects while using shared memory windows does. >> I ran my experiments using Open MPI 3.1.0 with the following command lines: >> >> - 12 cores / 1 socket: >> mpirun -n 12 --bind-to socket --map-by ppr:12:socket >> - 24 cores / 2 sockets: >> mpirun -n 24 --bind-to socket >> >> and verified the binding using --report-bindings. >> >> Any help or comment would be much appreciated. >> >> Cheers >> Joseph >> >> -- >> Dipl.-Inf. Joseph Schuchart >> High Performance Computing Center Stuttgart (HLRS) >> Nobelstr. 19 >> D-70569 Stuttgart >> >> Tel.: +49(0)711-68565890 >> Fax: +49(0)711-6856832 >> E-Mail: schuch...@hlrs.de >> >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://lists.open-mpi.org/mailman/listinfo/users >> > > > > -- > Jeff Hammond > jeff.scie...@gmail.com > http://jeffhammond.github.io/ > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users >
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users