Re: [OMPI users] Issues with Large Window Allocations

Jeff Hammond Fri, 25 Aug 2017 12:20:46 -0700

There's no reason to do anything special for shared memory with a
single-process job because MPI_Win_allocate_shared(MPI_COMM_SELF) ~=
MPI_Alloc_mem().  However, it would help debugging if MPI implementers at
least had an option to take the code path that allocates shared memory even
when np=1.


Jeff

On Thu, Aug 24, 2017 at 7:41 AM, Joseph Schuchart <schuch...@hlrs.de> wrote:

> Gilles,
>
> Thanks for your swift response. On this system, /dev/shm only has 256M
> available so that is no option unfortunately. I tried disabling both vader
> and sm btl via `--mca btl ^vader,sm` but Open MPI still seems to allocate
> the shmem backing file under /tmp. From my point of view, missing the
> performance benefits of file backed shared memory as long as large
> allocations work but I don't know the implementation details and whether
> that is possible. It seems that the mmap does not happen if there is only
> one process per node.
>
> Cheers,
> Joseph
>
>
> On 08/24/2017 03:49 PM, Gilles Gouaillardet wrote:
>
>> Joseph,
>>
>> the error message suggests that allocating memory with
>> MPI_Win_allocate[_shared] is done by creating a file and then mmap'ing
>> it.
>> how much space do you have in /dev/shm ? (this is a tmpfs e.g. a RAM
>> file system)
>> there is likely quite some space here, so as a workaround, i suggest
>> you use this as the shared-memory backing directory
>>
>> /* i am afk and do not remember the syntax, ompi_info --all | grep
>> backing is likely to help */
>>
>> Cheers,
>>
>> Gilles
>>
>> On Thu, Aug 24, 2017 at 10:31 PM, Joseph Schuchart <schuch...@hlrs.de>
>> wrote:
>>
>>> All,
>>>
>>> I have been experimenting with large window allocations recently and have
>>> made some interesting observations that I would like to share.
>>>
>>> The system under test:
>>>    - Linux cluster equipped with IB,
>>>    - Open MPI 2.1.1,
>>>    - 128GB main memory per node
>>>    - 6GB /tmp filesystem per node
>>>
>>> My observations:
>>> 1) Running with 1 process on a single node, I can allocate and write to
>>> memory up to ~110 GB through MPI_Allocate, MPI_Win_allocate, and
>>> MPI_Win_allocate_shared.
>>>
>>> 2) If running with 1 process per node on 2 nodes single large allocations
>>> succeed but with the repeating allocate/free cycle in the attached code I
>>> see the application being reproducibly being killed by the OOM at 25GB
>>> allocation with MPI_Win_allocate_shared. When I try to run it under
>>> Valgrind
>>> I get an error from MPI_Win_allocate at ~50GB that I cannot make sense
>>> of:
>>>
>>> ```
>>> MPI_Alloc_mem:  53687091200 B
>>> [n131302:11989] *** An error occurred in MPI_Alloc_mem
>>> [n131302:11989] *** reported by process [1567293441,1]
>>> [n131302:11989] *** on communicator MPI_COMM_WORLD
>>> [n131302:11989] *** MPI_ERR_NO_MEM: out of memory
>>> [n131302:11989] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
>>> will now abort,
>>> [n131302:11989] ***    and potentially your MPI job)
>>> ```
>>>
>>> 3) If running with 2 processes on a node, I get the following error from
>>> both MPI_Win_allocate and MPI_Win_allocate_shared:
>>> ```
>>> ------------------------------------------------------------
>>> --------------
>>> It appears as if there is not enough space for
>>> /tmp/openmpi-sessions-31390@n131702_0/23041/1/0/shared_window_4.n131702
>>> (the
>>> shared-memory backing
>>> file). It is likely that your MPI job will now either abort or experience
>>> performance degradation.
>>>
>>>    Local host:  n131702
>>>    Space Requested: 6710890760 B
>>>    Space Available: 6433673216 B
>>> ```
>>> This seems to be related to the size limit of /tmp. MPI_Allocate works as
>>> expected, i.e., I can allocate ~50GB per process. I understand that I can
>>> set $TMP to a bigger filesystem (such as lustre) but then I am greeted
>>> with
>>> a warning on each allocation and performance seems to drop. Is there a
>>> way
>>> to fall back to the allocation strategy used in case 2)?
>>>
>>> 4) It is also worth noting the time it takes to allocate the memory:
>>> while
>>> the allocations are in the sub-millisecond range for both MPI_Allocate
>>> and
>>> MPI_Win_allocate_shared, it takes >24s to allocate 100GB using
>>> MPI_Win_allocate and the time increasing linearly with the allocation
>>> size.
>>>
>>> Are these issues known? Maybe there is documentation describing
>>> work-arounds? (esp. for 3) and 4))
>>>
>>> I am attaching a small benchmark. Please make sure to adjust the
>>> MEM_PER_NODE macro to suit your system before you run it :) I'm happy to
>>> provide additional details if needed.
>>>
>>> Best
>>> Joseph
>>> --
>>> Dipl.-Inf. Joseph Schuchart
>>> High Performance Computing Center Stuttgart (HLRS)
>>> Nobelstr. 19
>>> D-70569 Stuttgart
>>>
>>> Tel.: +49(0)711-68565890
>>> Fax: +49(0)711-6856832
>>> E-Mail: schuch...@hlrs.de
>>>
>>> _______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
>>
>>
>
> --
> Dipl.-Inf. Joseph Schuchart
> High Performance Computing Center Stuttgart (HLRS)
> Nobelstr. 19
> D-70569 Stuttgart
>
> Tel.: +49(0)711-68565890
> Fax: +49(0)711-6856832
> E-Mail: schuch...@hlrs.de
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>



-- 
Jeff Hammond
jeff.scie...@gmail.com
http://jeffhammond.github.io/

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Issues with Large Window Allocations

Reply via email to