Re: [OMPI users] Issues with Large Window Allocations

Joseph Schuchart Fri, 08 Sep 2017 09:05:32 -0700

We are currently discussing internally how to proceed with this issue onour machine. We did a little survey to see the setup of some of themachines we have access to, which includes an IBM, a Bull machine, andtwo Cray XC40 machines. To summarize our findings:

1) On the Cray systems, both /tmp and /dev/shm are mounted tmpfs andeach limited to half of the main memory size per node.2) On the IBM system, nodes have 64GB and /tmp is limited to 20 GB andmounted from a disk partition. /dev/shm, on the other hand, is sized at63GB.3) On the above systems, /proc/sys/kernel/shm* is set up to allow thefull memory of the node to be used as System V shared memory.4) On the Bull machine, /tmp is mounted from a disk and fixed to ~100GBwhile /dev/shm is limited to half the node's memory (there are nodeswith 2TB memory, huge page support is available). System V shmem on theother hand is limited to 4GB.

Overall, it seems that there is no globally optimal allocation strategyas the best matching source of shared memory is machine dependent.

Open MPI treats System V shared memory as the least favorable option,even giving it a lower priority than POSIX shared memory, whereconflicting names might occur. What's the reason for preferring /tmp andPOSIX shared memory over System V? It seems to me that the latter is acleaner and safer way (provided that shared memory is not constrained by/proc, which could easily be detected) while mmap'ing large files feelssomewhat hacky. Maybe I am missing an important aspect here though.

The reason I am interested in this issue is that our PGAS library isbuild on top of MPI and allocates pretty much all memory exposed to theuser through MPI windows. Thus, any limitation from the underlying MPIimplementation (or system for that matter) limits the amount of usablememory for our users.

Given our observations above, I would like to propose a change to theshared memory allocator: the priorities would be derived from thepercentage of main memory each component can cover, i.e.,


Priority = 99*(min(Memory, SpaceAvail) / Memory)

At startup, each shm component would determine the available size (bylooking at /tmp, /dev/shm, and /proc/sys/kernel/shm*, respectively) andset its priority between 0 and 99. A user could force Open MPI to use aspecific component by manually settings its priority to 100 (which ofcourse has to be documented). The priority could factor in other aspectsas well, such as whether /tmp is actually tmpfs or disk-based if thatmakes a difference in performance.

This proposal of course assumes that shared memory size is the soleoptimization goal. Maybe there are other aspects to consider? I'd behappy to work on a patch but would like to get some feedback beforegetting my hands dirty. IMO, the current situation is less than idealand prone to cause pain to the average user. In my recent experience,debugging this has been tedious and the user in general shouldn't haveto care about how shared memory is allocated (and administrators don'talways seem to care, see above).


Any feedback is highly appreciated.

Joseph


On 09/04/2017 03:13 PM, Joseph Schuchart wrote:

Jeff, all,

Unfortunately, I (as a user) have no control over the page size on ourcluster. My interest in this is more of a general nature because I amconcerned that our users who use Open MPI underneath our code run intothis issue on their machine.

I took a look at the code for the various window creation methods andnow have a better picture of the allocation process in Open MPI. Irealized that memory in windows allocated through MPI_Win_alloc orcreated through MPI_Win_create is registered with the IB device usingibv_reg_mr, which takes significant time for large allocations (I assumethis is where hugepages would help?). In contrast to this, it seems thatmemory attached through MPI_Win_attach is not registered, which explainsthe lower latency for these allocation I am observing (I seem toremember having observed higher communication latencies as well).

Regarding the size limitation of /tmp: I found an opal/mca/shmem/posixcomponent that uses shmem_open to create a POSIX shared memory objectinstead of a file on disk, which is then mmap'ed. Unfortunately, if Iraise the priority of this component above that of the default mmapcomponent I end up with a SIGBUS during MPI_Init. No other errors arereported by MPI. Should I open a ticket on Github for this?

As an alternative, would it be possible to use anonymous shared memorymappings to avoid the backing file for large allocations (maybe above acertain threshold) on systems that support MAP_ANONYMOUS and distributethe result of the mmap call among the processes on the node?


Thanks,
Joseph

On 08/29/2017 06:12 PM, Jeff Hammond wrote:

I don't know any reason why you shouldn't be able to use IB forintra-node transfers. There are, of course, arguments against doingit in general (e.g. IB/PCI bandwidth less than DDR4 bandwidth), but itlikely behaves less synchronously than shared-memory, since I'm notaware of any MPI RMA library that dispatches the intranode RMAoperations to an asynchronous agent (e.g. communication helper thread).

Regarding 4, faulting 100GB in 24s corresponds to 1us per 4K page,which doesn't sound unreasonable to me. You might investigate if/howyou can use 2M or 1G pages instead. It's possible Open-MPI alreadysupports this, if the underlying system does. You may need to twiddleyour OS settings to get hugetlbfs working.


Jeff

On Tue, Aug 29, 2017 at 6:15 AM, Joseph Schuchart <schuch...@hlrs.de<mailto:schuch...@hlrs.de>> wrote:


    Jeff, all,

    Thanks for the clarification. My measurements show that global
    memory allocations do not require the backing file if there is only
    one process per node, for arbitrary number of processes. So I was
    wondering if it was possible to use the same allocation process even
    with multiple processes per node if there is not enough space
    available in /tmp. However, I am not sure whether the IB devices can
    be used to perform intra-node RMA. At least that would retain the
    functionality on this kind of system (which arguably might be a rare
    case).

    On a different note, I found during the weekend that Valgrind only
    supports allocations up to 60GB, so my second point reported below
    may be invalid. Number 4 seems still seems curious to me, though.

    Best
    Joseph

    On 08/25/2017 09:17 PM, Jeff Hammond wrote:

        There's no reason to do anything special for shared memory with
        a single-process job because

MPI_Win_allocate_shared(MPI_COMM_SELF) ~= MPI_Alloc_mem().However, it would help debugging if MPI implementers at least

        had an option to take the code path that allocates shared memory
        even when np=1.

        Jeff

        On Thu, Aug 24, 2017 at 7:41 AM, Joseph Schuchart
        <schuch...@hlrs.de <mailto:schuch...@hlrs.de>
        <mailto:schuch...@hlrs.de <mailto:schuch...@hlrs.de>>> wrote:

             Gilles,

             Thanks for your swift response. On this system, /dev/shm
        only has
             256M available so that is no option unfortunately. I tried
        disabling
             both vader and sm btl via `--mca btl ^vader,sm` but Open
        MPI still
             seems to allocate the shmem backing file under /tmp. From
        my point

of view, missing the performance benefits of file backedsharedmemory as long as large allocations work but I don't knowthe

             implementation details and whether that is possible. It
        seems that

the mmap does not happen if there is only one process pernode.


             Cheers,
             Joseph


             On 08/24/2017 03:49 PM, Gilles Gouaillardet wrote:

                 Joseph,

                 the error message suggests that allocating memory with
                 MPI_Win_allocate[_shared] is done by creating a file
        and then
                 mmap'ing
                 it.
                 how much space do you have in /dev/shm ? (this is a
        tmpfs e.g. a RAM
                 file system)
                 there is likely quite some space here, so as a
        workaround, i suggest
                 you use this as the shared-memory backing directory

                 /* i am afk and do not remember the syntax, ompi_info
        --all | grep
                 backing is likely to help */

                 Cheers,

                 Gilles

                 On Thu, Aug 24, 2017 at 10:31 PM, Joseph Schuchart
                 <schuch...@hlrs.de <mailto:schuch...@hlrs.de>
        <mailto:schuch...@hlrs.de <mailto:schuch...@hlrs.de>>> wrote:

                     All,

I have been experimenting with large windowallocations

                     recently and have
                     made some interesting observations that I would
        like to share.

                     The system under test:
                         - Linux cluster equipped with IB,
                         - Open MPI 2.1.1,
                         - 128GB main memory per node
                         - 6GB /tmp filesystem per node

                     My observations:
                     1) Running with 1 process on a single node, I can
        allocate
                     and write to
                     memory up to ~110 GB through MPI_Allocate,
        MPI_Win_allocate, and
                     MPI_Win_allocate_shared.

2) If running with 1 process per node on 2 nodessingle

                     large allocations
                     succeed but with the repeating allocate/free cycle
        in the
                     attached code I
                     see the application being reproducibly being killed
        by the
                     OOM at 25GB
                     allocation with MPI_Win_allocate_shared. When I try
        to run
                     it under Valgrind
                     I get an error from MPI_Win_allocate at ~50GB that
        I cannot
                     make sense of:

                     ```
                     MPI_Alloc_mem:  53687091200 B

[n131302:11989] *** An error occurred inMPI_Alloc_mem[n131302:11989] *** reported by process[1567293441,1]

                     [n131302:11989] *** on communicator MPI_COMM_WORLD
                     [n131302:11989] *** MPI_ERR_NO_MEM: out of memory
                     [n131302:11989] *** MPI_ERRORS_ARE_FATAL (processes
        in this
                     communicator
                     will now abort,
                     [n131302:11989] ***    and potentially your MPI job)
                     ```

                     3) If running with 2 processes on a node, I get the
                     following error from
                     both MPI_Win_allocate and MPI_Win_allocate_shared:
                     ```

--------------------------------------------------------------------------

                     It appears as if there is not enough space for

/tmp/openmpi-sessions-31390@n131702_0/23041/1/0/shared_window_4.n131702

                     (the
                     shared-memory backing
                     file). It is likely that your MPI job will now
        either abort
                     or experience
                     performance degradation.

                         Local host:  n131702
                         Space Requested: 6710890760 B
                         Space Available: 6433673216 B
                     ```
                     This seems to be related to the size limit of /tmp.
                     MPI_Allocate works as
                     expected, i.e., I can allocate ~50GB per process. I
                     understand that I can
                     set $TMP to a bigger filesystem (such as lustre)
        but then I
                     am greeted with
                     a warning on each allocation and performance seems
        to drop.
                     Is there a way
                     to fall back to the allocation strategy used in
        case 2)?

                     4) It is also worth noting the time it takes to
        allocate the
                     memory: while
                     the allocations are in the sub-millisecond range
        for both
                     MPI_Allocate and
                     MPI_Win_allocate_shared, it takes >24s to allocate
        100GB using
                     MPI_Win_allocate and the time increasing linearly
        with the
                     allocation size.

                     Are these issues known? Maybe there is
        documentation describing
                     work-arounds? (esp. for 3) and 4))

                     I am attaching a small benchmark. Please make sure
        to adjust the
                     MEM_PER_NODE macro to suit your system before you
        run it :)
                     I'm happy to
                     provide additional details if needed.

                     Best
                     Joseph
                     --
                     Dipl.-Inf. Joseph Schuchart
                     High Performance Computing Center Stuttgart (HLRS)
                     Nobelstr. 19
                     D-70569 Stuttgart

                     Tel.: +49(0)711-68565890
                     Fax: +49(0)711-6856832
                     E-Mail: schuch...@hlrs.de
        <mailto:schuch...@hlrs.de> <mailto:schuch...@hlrs.de
        <mailto:schuch...@hlrs.de>>

                     _______________________________________________
                     users mailing list
        users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>

<mailto:users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>>

        https://lists.open-mpi.org/mailman/listinfo/users
        <https://lists.open-mpi.org/mailman/listinfo/users>
                     <https://lists.open-mpi.org/mailman/listinfo/users
        <https://lists.open-mpi.org/mailman/listinfo/users>>

                 _______________________________________________
                 users mailing list
        users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>

<mailto:users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>>

        https://lists.open-mpi.org/mailman/listinfo/users
        <https://lists.open-mpi.org/mailman/listinfo/users>
                 <https://lists.open-mpi.org/mailman/listinfo/users
        <https://lists.open-mpi.org/mailman/listinfo/users>>



             --     Dipl.-Inf. Joseph Schuchart
             High Performance Computing Center Stuttgart (HLRS)
             Nobelstr. 19
             D-70569 Stuttgart

             Tel.: +49(0)711-68565890
             Fax: +49(0)711-6856832
             E-Mail: schuch...@hlrs.de <mailto:schuch...@hlrs.de>
        <mailto:schuch...@hlrs.de <mailto:schuch...@hlrs.de>>
             _______________________________________________
             users mailing list
        users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>

<mailto:users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>>

        https://lists.open-mpi.org/mailman/listinfo/users
        <https://lists.open-mpi.org/mailman/listinfo/users>
             <https://lists.open-mpi.org/mailman/listinfo/users
        <https://lists.open-mpi.org/mailman/listinfo/users>>




        --         Jeff Hammond
        jeff.scie...@gmail.com <mailto:jeff.scie...@gmail.com>
        <mailto:jeff.scie...@gmail.com <mailto:jeff.scie...@gmail.com>>
        http://jeffhammond.github.io/


        _______________________________________________
        users mailing list
        users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
        https://lists.open-mpi.org/mailman/listinfo/users
        <https://lists.open-mpi.org/mailman/listinfo/users>



    --     Dipl.-Inf. Joseph Schuchart
    High Performance Computing Center Stuttgart (HLRS)
    Nobelstr. 19
    D-70569 Stuttgart

    Tel.: +49(0)711-68565890
    Fax: +49(0)711-6856832
    E-Mail: schuch...@hlrs.de <mailto:schuch...@hlrs.de>
    _______________________________________________
    users mailing list
    users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
    https://lists.open-mpi.org/mailman/listinfo/users
    <https://lists.open-mpi.org/mailman/listinfo/users>




--
Jeff Hammond
jeff.scie...@gmail.com <mailto:jeff.scie...@gmail.com>
http://jeffhammond.github.io/


_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users



--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Issues with Large Window Allocations

Reply via email to