Re: [OMPI users] MPI Windows: performance of local memory access

Nathan Hjelm Wed, 23 May 2018 05:28:10 -0700

Odd. I wonder if it is something affected by your session directory. It might 
be worth moving the segment to /dev/shm. I don’t expect it will have an impact 
but you could try the following patch:



diff --git a/ompi/mca/osc/sm/osc_sm_component.c 
b/ompi/mca/osc/sm/osc_sm_component.c
index f7211cd93c..bfc26b39f2 100644
--- a/ompi/mca/osc/sm/osc_sm_component.c
+++ b/ompi/mca/osc/sm/osc_sm_component.c
@@ -262,8 +262,8 @@ component_select(struct ompi_win_t *win, void **base, 
size_t size, int disp_unit
         posts_size += OPAL_ALIGN_PAD_AMOUNT(posts_size, 64);
         if (0 == ompi_comm_rank (module->comm)) {
             char *data_file;
-            if (asprintf(&data_file, "%s"OPAL_PATH_SEP"shared_window_%d.%s",
-                         ompi_process_info.proc_session_dir,
+            if (asprintf(&data_file, "/dev/shm/%d.shared_window_%d.%s",
+                         ompi_process_info.my_name.jobid,
                          ompi_comm_get_cid(module->comm),
                          ompi_process_info.nodename) < 0) {
                 return OMPI_ERR_OUT_OF_RESOURCE;


> On May 23, 2018, at 6:11 AM, Joseph Schuchart <schuch...@hlrs.de> wrote:
> 
> I tested with Open MPI 3.1.0 and Open MPI 3.0.0, both compiled with GCC 7.1.0 
> on the Bull Cluster. I only ran on a single node but haven't tested what 
> happens if more than one node is involved.
> 
> Joseph
> 
> On 05/23/2018 02:04 PM, Nathan Hjelm wrote:
>> What Open MPI version are you using? Does this happen when you run on a 
>> single node or multiple nodes?
>> -Nathan
>>> On May 23, 2018, at 4:45 AM, Joseph Schuchart <schuch...@hlrs.de> wrote:
>>> 
>>> All,
>>> 
>>> We are observing some strange/interesting performance issues in accessing 
>>> memory that has been allocated through MPI_Win_allocate. I am attaching our 
>>> test case, which allocates memory for 100M integer values on each process 
>>> both through malloc and MPI_Win_allocate and writes to the local ranges 
>>> sequentially.
>>> 
>>> On different systems (incl. SuperMUC and a Bull Cluster), we see that 
>>> accessing the memory allocated through MPI is significantly slower than 
>>> accessing the malloc'ed memory if multiple processes run on a single node, 
>>> increasing the effect with increasing number of processes per node. As an 
>>> example, running 24 processes per node with the example attached we see the 
>>> operations on the malloc'ed memory to take ~0.4s while the MPI allocated 
>>> memory takes up to 10s.
>>> 
>>> After some experiments, I think there are two factors involved:
>>> 
>>> 1) Initialization: it appears that the first iteration is significantly 
>>> slower than any subsequent accesses (1.1s vs 0.4s with 12 processes on a 
>>> single socket). Excluding the first iteration from the timing or memsetting 
>>> the range leads to comparable performance. I assume that this is due to 
>>> page faults that stem from first accessing the mmap'ed memory that backs 
>>> the shared memory used in the window. The effect of presetting the  
>>> malloc'ed memory seems smaller (0.4s vs 0.6s).
>>> 
>>> 2) NUMA effects: Given proper initialization, running on two sockets still 
>>> leads to fluctuating performance degradation under the MPI window memory, 
>>> which ranges up to 20x (in extreme cases). The performance of accessing the 
>>> malloc'ed memory is rather stable. The difference seems to get smaller (but 
>>> does not disappear) with increasing number of repetitions. I am not sure 
>>> what causes these effects as each process should first-touch their local 
>>> memory.
>>> 
>>> Are these known issues? Does anyone have any thoughts on my analysis?
>>> 
>>> It is problematic for us that replacing local memory allocation with MPI 
>>> memory allocation leads to performance degradation as we rely on this 
>>> mechanism in our distributed data structures. While we can ensure proper 
>>> initialization of the memory to mitigate 1) for performance measurements, I 
>>> don't see a way to control the NUMA effects. If there is one I'd be happy 
>>> about any hints :)
>>> 
>>> I should note that we also tested MPICH-based implementations, which showed 
>>> similar effects (as they also mmap their window memory). Not surprisingly, 
>>> using MPI_Alloc_mem and attaching that memory to a dynamic window does not 
>>> cause these effects while using shared memory windows does. I ran my 
>>> experiments using Open MPI 3.1.0 with the following command lines:
>>> 
>>> - 12 cores / 1 socket:
>>> mpirun -n 12 --bind-to socket --map-by ppr:12:socket
>>> - 24 cores / 2 sockets:
>>> mpirun -n 24 --bind-to socket
>>> 
>>> and verified the binding using  --report-bindings.
>>> 
>>> Any help or comment would be much appreciated.
>>> 
>>> Cheers
>>> Joseph
>>> 
>>> --
>>> Dipl.-Inf. Joseph Schuchart
>>> High Performance Computing Center Stuttgart (HLRS)
>>> Nobelstr. 19
>>> D-70569 Stuttgart
>>> 
>>> Tel.: +49(0)711-68565890
>>> Fax: +49(0)711-6856832
>>> E-Mail: schuch...@hlrs.de
>>> <mpiwin_vs_malloc.c>_______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://lists.open-mpi.org/mailman/listinfo/users
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

signature.asc
Description: Message signed with OpenPGP

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] MPI Windows: performance of local memory access

Reply via email to