Re: [OMPI users] MPI Windows: performance of local memory access

Nathan Hjelm Thu, 24 May 2018 07:30:56 -0700

PR is up

https://github.com/open-mpi/ompi/pull/5193



-Nathan

> On May 24, 2018, at 7:09 AM, Nathan Hjelm <hje...@me.com> wrote:
> 
> Ok, thanks for testing that. I will open a PR for master changing the default 
> backing location to /dev/shm on linux. Will be PR’d to v3.0.x and v3.1.x.
> 
> -Nathan
> 
>> On May 24, 2018, at 6:46 AM, Joseph Schuchart <schuch...@hlrs.de> wrote:
>> 
>> Thank you all for your input!
>> 
>> Nathan: thanks for that hint, this seems to be the culprit: With your patch, 
>> I do not observe a difference in the performance between the two memory 
>> allocations. I remembered that Open MPI allows to change the shmem allocator 
>> on the command line. Using vanilla Open MPI 3.1.0 and increasing the 
>> priority of the POSIX shmem implementation using `--mca shmem_posix_priority 
>> 100` leads to good performance, too. The reason could be that on the Bull 
>> machine /tmp is mounted on a disk partition (SSD, iirc). Maybe there is 
>> actual I/O involved that hurts performance if the shm backing file is 
>> located on a disk (even though the file is unlinked before the memory is 
>> accessed)?
>> 
>> Regarding the other hints: I tried using MPI_Win_allocate_shared with the 
>> noncontig hint. Using POSIX shmem, I do not observe a difference in 
>> performance to the other two options. If using the disk-backed shmem file, 
>> performance fluctuations are similar to MPI_Win_allocate.
>> 
>> On this machine /proc/sys/kernel/numa_balancing is not available, so I 
>> assume that this is not the cause in this case. It's good to know for the 
>> future that this might become an issue on other systems.
>> 
>> Cheers
>> Joseph
>> 
>> On 05/23/2018 02:26 PM, Nathan Hjelm wrote:
>>> Odd. I wonder if it is something affected by your session directory. It 
>>> might be worth moving the segment to /dev/shm. I don’t expect it will have 
>>> an impact but you could try the following patch:
>>> diff --git a/ompi/mca/osc/sm/osc_sm_component.c 
>>> b/ompi/mca/osc/sm/osc_sm_component.c
>>> index f7211cd93c..bfc26b39f2 100644
>>> --- a/ompi/mca/osc/sm/osc_sm_component.c
>>> +++ b/ompi/mca/osc/sm/osc_sm_component.c
>>> @@ -262,8 +262,8 @@ component_select(struct ompi_win_t *win, void **base, 
>>> size_t size, int disp_unit
>>>         posts_size += OPAL_ALIGN_PAD_AMOUNT(posts_size, 64);
>>>         if (0 == ompi_comm_rank (module->comm)) {
>>>             char *data_file;
>>> -            if (asprintf(&data_file, 
>>> "%s"OPAL_PATH_SEP"shared_window_%d.%s",
>>> -                         ompi_process_info.proc_session_dir,
>>> +            if (asprintf(&data_file, "/dev/shm/%d.shared_window_%d.%s",
>>> +                         ompi_process_info.my_name.jobid,
>>>                          ompi_comm_get_cid(module->comm),
>>>                          ompi_process_info.nodename) < 0) {
>>>                 return OMPI_ERR_OUT_OF_RESOURCE;
>>>> On May 23, 2018, at 6:11 AM, Joseph Schuchart <schuch...@hlrs.de> wrote:
>>>> 
>>>> I tested with Open MPI 3.1.0 and Open MPI 3.0.0, both compiled with GCC 
>>>> 7.1.0 on the Bull Cluster. I only ran on a single node but haven't tested 
>>>> what happens if more than one node is involved.
>>>> 
>>>> Joseph
>>>> 
>>>> On 05/23/2018 02:04 PM, Nathan Hjelm wrote:
>>>>> What Open MPI version are you using? Does this happen when you run on a 
>>>>> single node or multiple nodes?
>>>>> -Nathan
>>>>>> On May 23, 2018, at 4:45 AM, Joseph Schuchart <schuch...@hlrs.de> wrote:
>>>>>> 
>>>>>> All,
>>>>>> 
>>>>>> We are observing some strange/interesting performance issues in 
>>>>>> accessing memory that has been allocated through MPI_Win_allocate. I am 
>>>>>> attaching our test case, which allocates memory for 100M integer values 
>>>>>> on each process both through malloc and MPI_Win_allocate and writes to 
>>>>>> the local ranges sequentially.
>>>>>> 
>>>>>> On different systems (incl. SuperMUC and a Bull Cluster), we see that 
>>>>>> accessing the memory allocated through MPI is significantly slower than 
>>>>>> accessing the malloc'ed memory if multiple processes run on a single 
>>>>>> node, increasing the effect with increasing number of processes per 
>>>>>> node. As an example, running 24 processes per node with the example 
>>>>>> attached we see the operations on the malloc'ed memory to take ~0.4s 
>>>>>> while the MPI allocated memory takes up to 10s.
>>>>>> 
>>>>>> After some experiments, I think there are two factors involved:
>>>>>> 
>>>>>> 1) Initialization: it appears that the first iteration is significantly 
>>>>>> slower than any subsequent accesses (1.1s vs 0.4s with 12 processes on a 
>>>>>> single socket). Excluding the first iteration from the timing or 
>>>>>> memsetting the range leads to comparable performance. I assume that this 
>>>>>> is due to page faults that stem from first accessing the mmap'ed memory 
>>>>>> that backs the shared memory used in the window. The effect of 
>>>>>> presetting the  malloc'ed memory seems smaller (0.4s vs 0.6s).
>>>>>> 
>>>>>> 2) NUMA effects: Given proper initialization, running on two sockets 
>>>>>> still leads to fluctuating performance degradation under the MPI window 
>>>>>> memory, which ranges up to 20x (in extreme cases). The performance of 
>>>>>> accessing the malloc'ed memory is rather stable. The difference seems to 
>>>>>> get smaller (but does not disappear) with increasing number of 
>>>>>> repetitions. I am not sure what causes these effects as each process 
>>>>>> should first-touch their local memory.
>>>>>> 
>>>>>> Are these known issues? Does anyone have any thoughts on my analysis?
>>>>>> 
>>>>>> It is problematic for us that replacing local memory allocation with MPI 
>>>>>> memory allocation leads to performance degradation as we rely on this 
>>>>>> mechanism in our distributed data structures. While we can ensure proper 
>>>>>> initialization of the memory to mitigate 1) for performance 
>>>>>> measurements, I don't see a way to control the NUMA effects. If there is 
>>>>>> one I'd be happy about any hints :)
>>>>>> 
>>>>>> I should note that we also tested MPICH-based implementations, which 
>>>>>> showed similar effects (as they also mmap their window memory). Not 
>>>>>> surprisingly, using MPI_Alloc_mem and attaching that memory to a dynamic 
>>>>>> window does not cause these effects while using shared memory windows 
>>>>>> does. I ran my experiments using Open MPI 3.1.0 with the following 
>>>>>> command lines:
>>>>>> 
>>>>>> - 12 cores / 1 socket:
>>>>>> mpirun -n 12 --bind-to socket --map-by ppr:12:socket
>>>>>> - 24 cores / 2 sockets:
>>>>>> mpirun -n 24 --bind-to socket
>>>>>> 
>>>>>> and verified the binding using  --report-bindings.
>>>>>> 
>>>>>> Any help or comment would be much appreciated.
>>>>>> 
>>>>>> Cheers
>>>>>> Joseph
>>>>>> 
>>>>>> --
>>>>>> Dipl.-Inf. Joseph Schuchart
>>>>>> High Performance Computing Center Stuttgart (HLRS)
>>>>>> Nobelstr. 19
>>>>>> D-70569 Stuttgart
>>>>>> 
>>>>>> Tel.: +49(0)711-68565890
>>>>>> Fax: +49(0)711-6856832
>>>>>> E-Mail: schuch...@hlrs.de
>>>>>> <mpiwin_vs_malloc.c>_______________________________________________
>>>>>> users mailing list
>>>>>> users@lists.open-mpi.org
>>>>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users@lists.open-mpi.org
>>>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>> _______________________________________________
>>>> users mailing list
>>>> users@lists.open-mpi.org
>>>> https://lists.open-mpi.org/mailman/listinfo/users
>>> _______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://lists.open-mpi.org/mailman/listinfo/users
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
> 
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

signature.asc
Description: Message signed with OpenPGP

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] MPI Windows: performance of local memory access

Reply via email to