Re: [OMPI users] MPI Windows: performance of local memory access

Jeff Hammond Wed, 23 May 2018 08:20:13 -0700

This is very interesting.  Thanks for providing a test code.  I have two
suggestions for understanding this better.


1) Use MPI_Win_allocate_shared instead and measure the difference with and
without alloc_shared_noncontig.  I think this info is not available for
MPI_Win_allocate because MPI_Win_shared_query is not permitted on
MPI_Win_allocate windows.  This is a flaw in MPI-3 that I would like to see
fixed (https://github.com/mpi-forum/mpi-issues/issues/23).

2) Extend your test to allocate with mmap and measure with various sets of
map flags (http://man7.org/linux/man-pages/man2/mmap.2.html).  Starting
with MAP_SHARED and MAP_PRIVATE is the right place to start.  This
experiment should make the cause unambiguous.

Most likely, this is due to shared vs private mapping, but there is likely
a tradeoff w.r.t. RMA performance.  It depends on your network and how the
MPI implementation uses it, but MPI_Win_create_dynamic likely leads to much
worse RMA performance than MPI_Win_allocate.  MPI_Win_create with a
malloc'd buffer may perform worse than MPI_Win_allocate for internode RMA
if the MPI implementation is lazy and doesn't cache page registration in
MPI_Win_create.

Jeff

On Wed, May 23, 2018 at 3:45 AM, Joseph Schuchart <schuch...@hlrs.de> wrote:

> All,
>
> We are observing some strange/interesting performance issues in accessing
> memory that has been allocated through MPI_Win_allocate. I am attaching our
> test case, which allocates memory for 100M integer values on each process
> both through malloc and MPI_Win_allocate and writes to the local ranges
> sequentially.
>
> On different systems (incl. SuperMUC and a Bull Cluster), we see that
> accessing the memory allocated through MPI is significantly slower than
> accessing the malloc'ed memory if multiple processes run on a single node,
> increasing the effect with increasing number of processes per node. As an
> example, running 24 processes per node with the example attached we see the
> operations on the malloc'ed memory to take ~0.4s while the MPI allocated
> memory takes up to 10s.
>
> After some experiments, I think there are two factors involved:
>
> 1) Initialization: it appears that the first iteration is significantly
> slower than any subsequent accesses (1.1s vs 0.4s with 12 processes on a
> single socket). Excluding the first iteration from the timing or memsetting
> the range leads to comparable performance. I assume that this is due to
> page faults that stem from first accessing the mmap'ed memory that backs
> the shared memory used in the window. The effect of presetting the
> malloc'ed memory seems smaller (0.4s vs 0.6s).
>
> 2) NUMA effects: Given proper initialization, running on two sockets still
> leads to fluctuating performance degradation under the MPI window memory,
> which ranges up to 20x (in extreme cases). The performance of accessing the
> malloc'ed memory is rather stable. The difference seems to get smaller (but
> does not disappear) with increasing number of repetitions. I am not sure
> what causes these effects as each process should first-touch their local
> memory.
>
> Are these known issues? Does anyone have any thoughts on my analysis?
>
> It is problematic for us that replacing local memory allocation with MPI
> memory allocation leads to performance degradation as we rely on this
> mechanism in our distributed data structures. While we can ensure proper
> initialization of the memory to mitigate 1) for performance measurements, I
> don't see a way to control the NUMA effects. If there is one I'd be happy
> about any hints :)
>
> I should note that we also tested MPICH-based implementations, which
> showed similar effects (as they also mmap their window memory). Not
> surprisingly, using MPI_Alloc_mem and attaching that memory to a dynamic
> window does not cause these effects while using shared memory windows does.
> I ran my experiments using Open MPI 3.1.0 with the following command lines:
>
> - 12 cores / 1 socket:
> mpirun -n 12 --bind-to socket --map-by ppr:12:socket
> - 24 cores / 2 sockets:
> mpirun -n 24 --bind-to socket
>
> and verified the binding using  --report-bindings.
>
> Any help or comment would be much appreciated.
>
> Cheers
> Joseph
>
> --
> Dipl.-Inf. Joseph Schuchart
> High Performance Computing Center Stuttgart (HLRS)
> Nobelstr. 19
> D-70569 Stuttgart
>
> Tel.: +49(0)711-68565890
> Fax: +49(0)711-6856832
> E-Mail: schuch...@hlrs.de
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>



-- 
Jeff Hammond
jeff.scie...@gmail.com
http://jeffhammond.github.io/

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] MPI Windows: performance of local memory access

Reply via email to