Re: [OMPI users] Issues with Large Window Allocations

Gilles Gouaillardet Mon, 04 Sep 2017 06:26:45 -0700

Joseph,

please open a github issue regarding the SIGBUS error.


as far as i understand, MAP_ANONYMOUS+MAP_SHARED can only be used
between related processes. (e.g. parent and children)
in the case of Open MPI, MPI tasks are siblings, so this is not an option.

Cheers,

Gilles


On Mon, Sep 4, 2017 at 10:13 PM, Joseph Schuchart <schuch...@hlrs.de> wrote:
> Jeff, all,
>
> Unfortunately, I (as a user) have no control over the page size on our
> cluster. My interest in this is more of a general nature because I am
> concerned that our users who use Open MPI underneath our code run into this
> issue on their machine.
>
> I took a look at the code for the various window creation methods and now
> have a better picture of the allocation process in Open MPI. I realized that
> memory in windows allocated through MPI_Win_alloc or created through
> MPI_Win_create is registered with the IB device using ibv_reg_mr, which
> takes significant time for large allocations (I assume this is where
> hugepages would help?). In contrast to this, it seems that memory attached
> through MPI_Win_attach is not registered, which explains the lower latency
> for these allocation I am observing (I seem to remember having observed
> higher communication latencies as well).
>
> Regarding the size limitation of /tmp: I found an opal/mca/shmem/posix
> component that uses shmem_open to create a POSIX shared memory object
> instead of a file on disk, which is then mmap'ed. Unfortunately, if I raise
> the priority of this component above that of the default mmap component I
> end up with a SIGBUS during MPI_Init. No other errors are reported by MPI.
> Should I open a ticket on Github for this?
>
> As an alternative, would it be possible to use anonymous shared memory
> mappings to avoid the backing file for large allocations (maybe above a
> certain threshold) on systems that support MAP_ANONYMOUS and distribute the
> result of the mmap call among the processes on the node?
>
> Thanks,
> Joseph
>
> On 08/29/2017 06:12 PM, Jeff Hammond wrote:
>>
>> I don't know any reason why you shouldn't be able to use IB for intra-node
>> transfers.  There are, of course, arguments against doing it in general
>> (e.g. IB/PCI bandwidth less than DDR4 bandwidth), but it likely behaves less
>> synchronously than shared-memory, since I'm not aware of any MPI RMA library
>> that dispatches the intranode RMA operations to an asynchronous agent (e.g.
>> communication helper thread).
>>
>> Regarding 4, faulting 100GB in 24s corresponds to 1us per 4K page, which
>> doesn't sound unreasonable to me.  You might investigate if/how you can use
>> 2M or 1G pages instead.  It's possible Open-MPI already supports this, if
>> the underlying system does.  You may need to twiddle your OS settings to get
>> hugetlbfs working.
>>
>> Jeff
>>
>> On Tue, Aug 29, 2017 at 6:15 AM, Joseph Schuchart <schuch...@hlrs.de
>> <mailto:schuch...@hlrs.de>> wrote:
>>
>>     Jeff, all,
>>
>>     Thanks for the clarification. My measurements show that global
>>     memory allocations do not require the backing file if there is only
>>     one process per node, for arbitrary number of processes. So I was
>>     wondering if it was possible to use the same allocation process even
>>     with multiple processes per node if there is not enough space
>>     available in /tmp. However, I am not sure whether the IB devices can
>>     be used to perform intra-node RMA. At least that would retain the
>>     functionality on this kind of system (which arguably might be a rare
>>     case).
>>
>>     On a different note, I found during the weekend that Valgrind only
>>     supports allocations up to 60GB, so my second point reported below
>>     may be invalid. Number 4 seems still seems curious to me, though.
>>
>>     Best
>>     Joseph
>>
>>     On 08/25/2017 09:17 PM, Jeff Hammond wrote:
>>
>>         There's no reason to do anything special for shared memory with
>>         a single-process job because
>>         MPI_Win_allocate_shared(MPI_COMM_SELF) ~= MPI_Alloc_mem().
>> However, it would help debugging if MPI implementers at least
>>         had an option to take the code path that allocates shared memory
>>         even when np=1.
>>
>>         Jeff
>>
>>         On Thu, Aug 24, 2017 at 7:41 AM, Joseph Schuchart
>>         <schuch...@hlrs.de <mailto:schuch...@hlrs.de>
>>         <mailto:schuch...@hlrs.de <mailto:schuch...@hlrs.de>>> wrote:
>>
>>              Gilles,
>>
>>              Thanks for your swift response. On this system, /dev/shm
>>         only has
>>              256M available so that is no option unfortunately. I tried
>>         disabling
>>              both vader and sm btl via `--mca btl ^vader,sm` but Open
>>         MPI still
>>              seems to allocate the shmem backing file under /tmp. From
>>         my point
>>              of view, missing the performance benefits of file backed
>> shared
>>              memory as long as large allocations work but I don't know the
>>              implementation details and whether that is possible. It
>>         seems that
>>              the mmap does not happen if there is only one process per
>> node.
>>
>>              Cheers,
>>              Joseph
>>
>>
>>              On 08/24/2017 03:49 PM, Gilles Gouaillardet wrote:
>>
>>                  Joseph,
>>
>>                  the error message suggests that allocating memory with
>>                  MPI_Win_allocate[_shared] is done by creating a file
>>         and then
>>                  mmap'ing
>>                  it.
>>                  how much space do you have in /dev/shm ? (this is a
>>         tmpfs e.g. a RAM
>>                  file system)
>>                  there is likely quite some space here, so as a
>>         workaround, i suggest
>>                  you use this as the shared-memory backing directory
>>
>>                  /* i am afk and do not remember the syntax, ompi_info
>>         --all | grep
>>                  backing is likely to help */
>>
>>                  Cheers,
>>
>>                  Gilles
>>
>>                  On Thu, Aug 24, 2017 at 10:31 PM, Joseph Schuchart
>>                  <schuch...@hlrs.de <mailto:schuch...@hlrs.de>
>>         <mailto:schuch...@hlrs.de <mailto:schuch...@hlrs.de>>> wrote:
>>
>>                      All,
>>
>>                      I have been experimenting with large window
>> allocations
>>                      recently and have
>>                      made some interesting observations that I would
>>         like to share.
>>
>>                      The system under test:
>>                          - Linux cluster equipped with IB,
>>                          - Open MPI 2.1.1,
>>                          - 128GB main memory per node
>>                          - 6GB /tmp filesystem per node
>>
>>                      My observations:
>>                      1) Running with 1 process on a single node, I can
>>         allocate
>>                      and write to
>>                      memory up to ~110 GB through MPI_Allocate,
>>         MPI_Win_allocate, and
>>                      MPI_Win_allocate_shared.
>>
>>                      2) If running with 1 process per node on 2 nodes
>> single
>>                      large allocations
>>                      succeed but with the repeating allocate/free cycle
>>         in the
>>                      attached code I
>>                      see the application being reproducibly being killed
>>         by the
>>                      OOM at 25GB
>>                      allocation with MPI_Win_allocate_shared. When I try
>>         to run
>>                      it under Valgrind
>>                      I get an error from MPI_Win_allocate at ~50GB that
>>         I cannot
>>                      make sense of:
>>
>>                      ```
>>                      MPI_Alloc_mem:  53687091200 B
>>                      [n131302:11989] *** An error occurred in
>> MPI_Alloc_mem
>>                      [n131302:11989] *** reported by process
>> [1567293441,1]
>>                      [n131302:11989] *** on communicator MPI_COMM_WORLD
>>                      [n131302:11989] *** MPI_ERR_NO_MEM: out of memory
>>                      [n131302:11989] *** MPI_ERRORS_ARE_FATAL (processes
>>         in this
>>                      communicator
>>                      will now abort,
>>                      [n131302:11989] ***    and potentially your MPI job)
>>                      ```
>>
>>                      3) If running with 2 processes on a node, I get the
>>                      following error from
>>                      both MPI_Win_allocate and MPI_Win_allocate_shared:
>>                      ```
>>
>> --------------------------------------------------------------------------
>>                      It appears as if there is not enough space for
>>
>> /tmp/openmpi-sessions-31390@n131702_0/23041/1/0/shared_window_4.n131702
>>                      (the
>>                      shared-memory backing
>>                      file). It is likely that your MPI job will now
>>         either abort
>>                      or experience
>>                      performance degradation.
>>
>>                          Local host:  n131702
>>                          Space Requested: 6710890760 B
>>                          Space Available: 6433673216 B
>>                      ```
>>                      This seems to be related to the size limit of /tmp.
>>                      MPI_Allocate works as
>>                      expected, i.e., I can allocate ~50GB per process. I
>>                      understand that I can
>>                      set $TMP to a bigger filesystem (such as lustre)
>>         but then I
>>                      am greeted with
>>                      a warning on each allocation and performance seems
>>         to drop.
>>                      Is there a way
>>                      to fall back to the allocation strategy used in
>>         case 2)?
>>
>>                      4) It is also worth noting the time it takes to
>>         allocate the
>>                      memory: while
>>                      the allocations are in the sub-millisecond range
>>         for both
>>                      MPI_Allocate and
>>                      MPI_Win_allocate_shared, it takes >24s to allocate
>>         100GB using
>>                      MPI_Win_allocate and the time increasing linearly
>>         with the
>>                      allocation size.
>>
>>                      Are these issues known? Maybe there is
>>         documentation describing
>>                      work-arounds? (esp. for 3) and 4))
>>
>>                      I am attaching a small benchmark. Please make sure
>>         to adjust the
>>                      MEM_PER_NODE macro to suit your system before you
>>         run it :)
>>                      I'm happy to
>>                      provide additional details if needed.
>>
>>                      Best
>>                      Joseph
>>                      --
>>                      Dipl.-Inf. Joseph Schuchart
>>                      High Performance Computing Center Stuttgart (HLRS)
>>                      Nobelstr. 19
>>                      D-70569 Stuttgart
>>
>>                      Tel.: +49(0)711-68565890
>>                      Fax: +49(0)711-6856832
>>                      E-Mail: schuch...@hlrs.de
>>         <mailto:schuch...@hlrs.de> <mailto:schuch...@hlrs.de
>>         <mailto:schuch...@hlrs.de>>
>>
>>                      _______________________________________________
>>                      users mailing list
>>         users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>         <mailto:users@lists.open-mpi.org
>> <mailto:users@lists.open-mpi.org>>
>>         https://lists.open-mpi.org/mailman/listinfo/users
>>         <https://lists.open-mpi.org/mailman/listinfo/users>
>>                      <https://lists.open-mpi.org/mailman/listinfo/users
>>         <https://lists.open-mpi.org/mailman/listinfo/users>>
>>
>>                  _______________________________________________
>>                  users mailing list
>>         users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>         <mailto:users@lists.open-mpi.org
>> <mailto:users@lists.open-mpi.org>>
>>         https://lists.open-mpi.org/mailman/listinfo/users
>>         <https://lists.open-mpi.org/mailman/listinfo/users>
>>                  <https://lists.open-mpi.org/mailman/listinfo/users
>>         <https://lists.open-mpi.org/mailman/listinfo/users>>
>>
>>
>>
>>              --     Dipl.-Inf. Joseph Schuchart
>>              High Performance Computing Center Stuttgart (HLRS)
>>              Nobelstr. 19
>>              D-70569 Stuttgart
>>
>>              Tel.: +49(0)711-68565890
>>              Fax: +49(0)711-6856832
>>              E-Mail: schuch...@hlrs.de <mailto:schuch...@hlrs.de>
>>         <mailto:schuch...@hlrs.de <mailto:schuch...@hlrs.de>>
>>              _______________________________________________
>>              users mailing list
>>         users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>         <mailto:users@lists.open-mpi.org
>> <mailto:users@lists.open-mpi.org>>
>>         https://lists.open-mpi.org/mailman/listinfo/users
>>         <https://lists.open-mpi.org/mailman/listinfo/users>
>>              <https://lists.open-mpi.org/mailman/listinfo/users
>>         <https://lists.open-mpi.org/mailman/listinfo/users>>
>>
>>
>>
>>
>>         --         Jeff Hammond
>>         jeff.scie...@gmail.com <mailto:jeff.scie...@gmail.com>
>>         <mailto:jeff.scie...@gmail.com <mailto:jeff.scie...@gmail.com>>
>>
>>         http://jeffhammond.github.io/
>>
>>
>>         _______________________________________________
>>         users mailing list
>>         users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>         https://lists.open-mpi.org/mailman/listinfo/users
>>         <https://lists.open-mpi.org/mailman/listinfo/users>
>>
>>
>>
>>     --     Dipl.-Inf. Joseph Schuchart
>>     High Performance Computing Center Stuttgart (HLRS)
>>     Nobelstr. 19
>>     D-70569 Stuttgart
>>
>>     Tel.: +49(0)711-68565890
>>     Fax: +49(0)711-6856832
>>     E-Mail: schuch...@hlrs.de <mailto:schuch...@hlrs.de>
>>     _______________________________________________
>>     users mailing list
>>     users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>     https://lists.open-mpi.org/mailman/listinfo/users
>>     <https://lists.open-mpi.org/mailman/listinfo/users>
>>
>>
>>
>>
>> --
>> Jeff Hammond
>> jeff.scie...@gmail.com <mailto:jeff.scie...@gmail.com>
>> http://jeffhammond.github.io/
>>
>>
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
>>
>
>
> --
> Dipl.-Inf. Joseph Schuchart
> High Performance Computing Center Stuttgart (HLRS)
> Nobelstr. 19
> D-70569 Stuttgart
>
> Tel.: +49(0)711-68565890
> Fax: +49(0)711-6856832
> E-Mail: schuch...@hlrs.de
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Issues with Large Window Allocations

Reply via email to