We had a similar issue few months back. After investigation it turned out
to be related to NUMA balancing [1] being enabled by default on recent
releases of Linux-based OSes.

In our case turning off NUMA balancing fixed most of the performance
incoherences we had. You can check its status in /proc/sys/kernel/numa_
balancing.

  George.




On Wed, May 23, 2018 at 11:16 AM, Jeff Hammond <jeff.scie...@gmail.com>
wrote:

> This is very interesting.  Thanks for providing a test code.  I have two
> suggestions for understanding this better.
>
> 1) Use MPI_Win_allocate_shared instead and measure the difference with and
> without alloc_shared_noncontig.  I think this info is not available for
> MPI_Win_allocate because MPI_Win_shared_query is not permitted on
> MPI_Win_allocate windows.  This is a flaw in MPI-3 that I would like to see
> fixed (https://github.com/mpi-forum/mpi-issues/issues/23).
>
> 2) Extend your test to allocate with mmap and measure with various sets of
> map flags (http://man7.org/linux/man-pages/man2/mmap.2.html).  Starting
> with MAP_SHARED and MAP_PRIVATE is the right place to start.  This
> experiment should make the cause unambiguous.
>
> Most likely, this is due to shared vs private mapping, but there is likely
> a tradeoff w.r.t. RMA performance.  It depends on your network and how the
> MPI implementation uses it, but MPI_Win_create_dynamic likely leads to much
> worse RMA performance than MPI_Win_allocate.  MPI_Win_create with a
> malloc'd buffer may perform worse than MPI_Win_allocate for internode RMA
> if the MPI implementation is lazy and doesn't cache page registration in
> MPI_Win_create.
>
> Jeff
>
> On Wed, May 23, 2018 at 3:45 AM, Joseph Schuchart <schuch...@hlrs.de>
> wrote:
>
>> All,
>>
>> We are observing some strange/interesting performance issues in accessing
>> memory that has been allocated through MPI_Win_allocate. I am attaching our
>> test case, which allocates memory for 100M integer values on each process
>> both through malloc and MPI_Win_allocate and writes to the local ranges
>> sequentially.
>>
>> On different systems (incl. SuperMUC and a Bull Cluster), we see that
>> accessing the memory allocated through MPI is significantly slower than
>> accessing the malloc'ed memory if multiple processes run on a single node,
>> increasing the effect with increasing number of processes per node. As an
>> example, running 24 processes per node with the example attached we see the
>> operations on the malloc'ed memory to take ~0.4s while the MPI allocated
>> memory takes up to 10s.
>>
>> After some experiments, I think there are two factors involved:
>>
>> 1) Initialization: it appears that the first iteration is significantly
>> slower than any subsequent accesses (1.1s vs 0.4s with 12 processes on a
>> single socket). Excluding the first iteration from the timing or memsetting
>> the range leads to comparable performance. I assume that this is due to
>> page faults that stem from first accessing the mmap'ed memory that backs
>> the shared memory used in the window. The effect of presetting the
>> malloc'ed memory seems smaller (0.4s vs 0.6s).
>>
>> 2) NUMA effects: Given proper initialization, running on two sockets
>> still leads to fluctuating performance degradation under the MPI window
>> memory, which ranges up to 20x (in extreme cases). The performance of
>> accessing the malloc'ed memory is rather stable. The difference seems to
>> get smaller (but does not disappear) with increasing number of repetitions.
>> I am not sure what causes these effects as each process should first-touch
>> their local memory.
>>
>> Are these known issues? Does anyone have any thoughts on my analysis?
>>
>> It is problematic for us that replacing local memory allocation with MPI
>> memory allocation leads to performance degradation as we rely on this
>> mechanism in our distributed data structures. While we can ensure proper
>> initialization of the memory to mitigate 1) for performance measurements, I
>> don't see a way to control the NUMA effects. If there is one I'd be happy
>> about any hints :)
>>
>> I should note that we also tested MPICH-based implementations, which
>> showed similar effects (as they also mmap their window memory). Not
>> surprisingly, using MPI_Alloc_mem and attaching that memory to a dynamic
>> window does not cause these effects while using shared memory windows does.
>> I ran my experiments using Open MPI 3.1.0 with the following command lines:
>>
>> - 12 cores / 1 socket:
>> mpirun -n 12 --bind-to socket --map-by ppr:12:socket
>> - 24 cores / 2 sockets:
>> mpirun -n 24 --bind-to socket
>>
>> and verified the binding using  --report-bindings.
>>
>> Any help or comment would be much appreciated.
>>
>> Cheers
>> Joseph
>>
>> --
>> Dipl.-Inf. Joseph Schuchart
>> High Performance Computing Center Stuttgart (HLRS)
>> Nobelstr. 19
>> D-70569 Stuttgart
>>
>> Tel.: +49(0)711-68565890
>> Fax: +49(0)711-6856832
>> E-Mail: schuch...@hlrs.de
>>
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
>>
>
>
>
> --
> Jeff Hammond
> jeff.scie...@gmail.com
> http://jeffhammond.github.io/
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to