I tested with Open MPI 3.1.0 and Open MPI 3.0.0, both compiled with GCC
7.1.0 on the Bull Cluster. I only ran on a single node but haven't
tested what happens if more than one node is involved.
Joseph
On 05/23/2018 02:04 PM, Nathan Hjelm wrote:
What Open MPI version are you using? Does this happen when you run on a single
node or multiple nodes?
-Nathan
On May 23, 2018, at 4:45 AM, Joseph Schuchart <schuch...@hlrs.de> wrote:
All,
We are observing some strange/interesting performance issues in accessing
memory that has been allocated through MPI_Win_allocate. I am attaching our
test case, which allocates memory for 100M integer values on each process both
through malloc and MPI_Win_allocate and writes to the local ranges sequentially.
On different systems (incl. SuperMUC and a Bull Cluster), we see that accessing
the memory allocated through MPI is significantly slower than accessing the
malloc'ed memory if multiple processes run on a single node, increasing the
effect with increasing number of processes per node. As an example, running 24
processes per node with the example attached we see the operations on the
malloc'ed memory to take ~0.4s while the MPI allocated memory takes up to 10s.
After some experiments, I think there are two factors involved:
1) Initialization: it appears that the first iteration is significantly slower
than any subsequent accesses (1.1s vs 0.4s with 12 processes on a single
socket). Excluding the first iteration from the timing or memsetting the range
leads to comparable performance. I assume that this is due to page faults that
stem from first accessing the mmap'ed memory that backs the shared memory used
in the window. The effect of presetting the malloc'ed memory seems smaller
(0.4s vs 0.6s).
2) NUMA effects: Given proper initialization, running on two sockets still
leads to fluctuating performance degradation under the MPI window memory, which
ranges up to 20x (in extreme cases). The performance of accessing the malloc'ed
memory is rather stable. The difference seems to get smaller (but does not
disappear) with increasing number of repetitions. I am not sure what causes
these effects as each process should first-touch their local memory.
Are these known issues? Does anyone have any thoughts on my analysis?
It is problematic for us that replacing local memory allocation with MPI memory
allocation leads to performance degradation as we rely on this mechanism in our
distributed data structures. While we can ensure proper initialization of the
memory to mitigate 1) for performance measurements, I don't see a way to
control the NUMA effects. If there is one I'd be happy about any hints :)
I should note that we also tested MPICH-based implementations, which showed
similar effects (as they also mmap their window memory). Not surprisingly,
using MPI_Alloc_mem and attaching that memory to a dynamic window does not
cause these effects while using shared memory windows does. I ran my
experiments using Open MPI 3.1.0 with the following command lines:
- 12 cores / 1 socket:
mpirun -n 12 --bind-to socket --map-by ppr:12:socket
- 24 cores / 2 sockets:
mpirun -n 24 --bind-to socket
and verified the binding using --report-bindings.
Any help or comment would be much appreciated.
Cheers
Joseph
--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart
Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de
<mpiwin_vs_malloc.c>_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users