Re: [OMPI users] MPI Windows: performance of local memory access

Nathan Hjelm Thu, 24 May 2018 06:12:13 -0700

Ok, thanks for testing that. I will open a PR for master changing the default 
backing location to /dev/shm on linux. Will be PR’d to v3.0.x and v3.1.x.


-Nathan

> On May 24, 2018, at 6:46 AM, Joseph Schuchart <schuch...@hlrs.de> wrote:
> 
> Thank you all for your input!
> 
> Nathan: thanks for that hint, this seems to be the culprit: With your patch, 
> I do not observe a difference in the performance between the two memory 
> allocations. I remembered that Open MPI allows to change the shmem allocator 
> on the command line. Using vanilla Open MPI 3.1.0 and increasing the priority 
> of the POSIX shmem implementation using `--mca shmem_posix_priority 100` 
> leads to good performance, too. The reason could be that on the Bull machine 
> /tmp is mounted on a disk partition (SSD, iirc). Maybe there is actual I/O 
> involved that hurts performance if the shm backing file is located on a disk 
> (even though the file is unlinked before the memory is accessed)?
> 
> Regarding the other hints: I tried using MPI_Win_allocate_shared with the 
> noncontig hint. Using POSIX shmem, I do not observe a difference in 
> performance to the other two options. If using the disk-backed shmem file, 
> performance fluctuations are similar to MPI_Win_allocate.
> 
> On this machine /proc/sys/kernel/numa_balancing is not available, so I assume 
> that this is not the cause in this case. It's good to know for the future 
> that this might become an issue on other systems.
> 
> Cheers
> Joseph
> 
> On 05/23/2018 02:26 PM, Nathan Hjelm wrote:
>> Odd. I wonder if it is something affected by your session directory. It 
>> might be worth moving the segment to /dev/shm. I don’t expect it will have 
>> an impact but you could try the following patch:
>> diff --git a/ompi/mca/osc/sm/osc_sm_component.c 
>> b/ompi/mca/osc/sm/osc_sm_component.c
>> index f7211cd93c..bfc26b39f2 100644
>> --- a/ompi/mca/osc/sm/osc_sm_component.c
>> +++ b/ompi/mca/osc/sm/osc_sm_component.c
>> @@ -262,8 +262,8 @@ component_select(struct ompi_win_t *win, void **base, 
>> size_t size, int disp_unit
>>          posts_size += OPAL_ALIGN_PAD_AMOUNT(posts_size, 64);
>>          if (0 == ompi_comm_rank (module->comm)) {
>>              char *data_file;
>> -            if (asprintf(&data_file, "%s"OPAL_PATH_SEP"shared_window_%d.%s",
>> -                         ompi_process_info.proc_session_dir,
>> +            if (asprintf(&data_file, "/dev/shm/%d.shared_window_%d.%s",
>> +                         ompi_process_info.my_name.jobid,
>>                           ompi_comm_get_cid(module->comm),
>>                           ompi_process_info.nodename) < 0) {
>>                  return OMPI_ERR_OUT_OF_RESOURCE;
>>> On May 23, 2018, at 6:11 AM, Joseph Schuchart <schuch...@hlrs.de> wrote:
>>> 
>>> I tested with Open MPI 3.1.0 and Open MPI 3.0.0, both compiled with GCC 
>>> 7.1.0 on the Bull Cluster. I only ran on a single node but haven't tested 
>>> what happens if more than one node is involved.
>>> 
>>> Joseph
>>> 
>>> On 05/23/2018 02:04 PM, Nathan Hjelm wrote:
>>>> What Open MPI version are you using? Does this happen when you run on a 
>>>> single node or multiple nodes?
>>>> -Nathan
>>>>> On May 23, 2018, at 4:45 AM, Joseph Schuchart <schuch...@hlrs.de> wrote:
>>>>> 
>>>>> All,
>>>>> 
>>>>> We are observing some strange/interesting performance issues in accessing 
>>>>> memory that has been allocated through MPI_Win_allocate. I am attaching 
>>>>> our test case, which allocates memory for 100M integer values on each 
>>>>> process both through malloc and MPI_Win_allocate and writes to the local 
>>>>> ranges sequentially.
>>>>> 
>>>>> On different systems (incl. SuperMUC and a Bull Cluster), we see that 
>>>>> accessing the memory allocated through MPI is significantly slower than 
>>>>> accessing the malloc'ed memory if multiple processes run on a single 
>>>>> node, increasing the effect with increasing number of processes per node. 
>>>>> As an example, running 24 processes per node with the example attached we 
>>>>> see the operations on the malloc'ed memory to take ~0.4s while the MPI 
>>>>> allocated memory takes up to 10s.
>>>>> 
>>>>> After some experiments, I think there are two factors involved:
>>>>> 
>>>>> 1) Initialization: it appears that the first iteration is significantly 
>>>>> slower than any subsequent accesses (1.1s vs 0.4s with 12 processes on a 
>>>>> single socket). Excluding the first iteration from the timing or 
>>>>> memsetting the range leads to comparable performance. I assume that this 
>>>>> is due to page faults that stem from first accessing the mmap'ed memory 
>>>>> that backs the shared memory used in the window. The effect of presetting 
>>>>> the  malloc'ed memory seems smaller (0.4s vs 0.6s).
>>>>> 
>>>>> 2) NUMA effects: Given proper initialization, running on two sockets 
>>>>> still leads to fluctuating performance degradation under the MPI window 
>>>>> memory, which ranges up to 20x (in extreme cases). The performance of 
>>>>> accessing the malloc'ed memory is rather stable. The difference seems to 
>>>>> get smaller (but does not disappear) with increasing number of 
>>>>> repetitions. I am not sure what causes these effects as each process 
>>>>> should first-touch their local memory.
>>>>> 
>>>>> Are these known issues? Does anyone have any thoughts on my analysis?
>>>>> 
>>>>> It is problematic for us that replacing local memory allocation with MPI 
>>>>> memory allocation leads to performance degradation as we rely on this 
>>>>> mechanism in our distributed data structures. While we can ensure proper 
>>>>> initialization of the memory to mitigate 1) for performance measurements, 
>>>>> I don't see a way to control the NUMA effects. If there is one I'd be 
>>>>> happy about any hints :)
>>>>> 
>>>>> I should note that we also tested MPICH-based implementations, which 
>>>>> showed similar effects (as they also mmap their window memory). Not 
>>>>> surprisingly, using MPI_Alloc_mem and attaching that memory to a dynamic 
>>>>> window does not cause these effects while using shared memory windows 
>>>>> does. I ran my experiments using Open MPI 3.1.0 with the following 
>>>>> command lines:
>>>>> 
>>>>> - 12 cores / 1 socket:
>>>>> mpirun -n 12 --bind-to socket --map-by ppr:12:socket
>>>>> - 24 cores / 2 sockets:
>>>>> mpirun -n 24 --bind-to socket
>>>>> 
>>>>> and verified the binding using  --report-bindings.
>>>>> 
>>>>> Any help or comment would be much appreciated.
>>>>> 
>>>>> Cheers
>>>>> Joseph
>>>>> 
>>>>> --
>>>>> Dipl.-Inf. Joseph Schuchart
>>>>> High Performance Computing Center Stuttgart (HLRS)
>>>>> Nobelstr. 19
>>>>> D-70569 Stuttgart
>>>>> 
>>>>> Tel.: +49(0)711-68565890
>>>>> Fax: +49(0)711-6856832
>>>>> E-Mail: schuch...@hlrs.de
>>>>> <mpiwin_vs_malloc.c>_______________________________________________
>>>>> users mailing list
>>>>> users@lists.open-mpi.org
>>>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>> _______________________________________________
>>>> users mailing list
>>>> users@lists.open-mpi.org
>>>> https://lists.open-mpi.org/mailman/listinfo/users
>>> _______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://lists.open-mpi.org/mailman/listinfo/users
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

signature.asc
Description: Message signed with OpenPGP

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] MPI Windows: performance of local memory access

Reply via email to