Jeff, all,

Thanks for the clarification. My measurements show that global memory allocations do not require the backing file if there is only one process per node, for arbitrary number of processes. So I was wondering if it was possible to use the same allocation process even with multiple processes per node if there is not enough space available in /tmp. However, I am not sure whether the IB devices can be used to perform intra-node RMA. At least that would retain the functionality on this kind of system (which arguably might be a rare case).

On a different note, I found during the weekend that Valgrind only supports allocations up to 60GB, so my second point reported below may be invalid. Number 4 seems still seems curious to me, though.

Best
Joseph

On 08/25/2017 09:17 PM, Jeff Hammond wrote:
There's no reason to do anything special for shared memory with a single-process job because MPI_Win_allocate_shared(MPI_COMM_SELF) ~= MPI_Alloc_mem(). However, it would help debugging if MPI implementers at least had an option to take the code path that allocates shared memory even when np=1.

Jeff

On Thu, Aug 24, 2017 at 7:41 AM, Joseph Schuchart <schuch...@hlrs.de <mailto:schuch...@hlrs.de>> wrote:

    Gilles,

    Thanks for your swift response. On this system, /dev/shm only has
    256M available so that is no option unfortunately. I tried disabling
    both vader and sm btl via `--mca btl ^vader,sm` but Open MPI still
    seems to allocate the shmem backing file under /tmp. From my point
    of view, missing the performance benefits of file backed shared
    memory as long as large allocations work but I don't know the
    implementation details and whether that is possible. It seems that
    the mmap does not happen if there is only one process per node.

    Cheers,
    Joseph


    On 08/24/2017 03:49 PM, Gilles Gouaillardet wrote:

        Joseph,

        the error message suggests that allocating memory with
        MPI_Win_allocate[_shared] is done by creating a file and then
        mmap'ing
        it.
        how much space do you have in /dev/shm ? (this is a tmpfs e.g. a RAM
        file system)
        there is likely quite some space here, so as a workaround, i suggest
        you use this as the shared-memory backing directory

        /* i am afk and do not remember the syntax, ompi_info --all | grep
        backing is likely to help */

        Cheers,

        Gilles

        On Thu, Aug 24, 2017 at 10:31 PM, Joseph Schuchart
        <schuch...@hlrs.de <mailto:schuch...@hlrs.de>> wrote:

            All,

            I have been experimenting with large window allocations
            recently and have
            made some interesting observations that I would like to share.

            The system under test:
                - Linux cluster equipped with IB,
                - Open MPI 2.1.1,
                - 128GB main memory per node
                - 6GB /tmp filesystem per node

            My observations:
            1) Running with 1 process on a single node, I can allocate
            and write to
            memory up to ~110 GB through MPI_Allocate, MPI_Win_allocate, and
            MPI_Win_allocate_shared.

            2) If running with 1 process per node on 2 nodes single
            large allocations
            succeed but with the repeating allocate/free cycle in the
            attached code I
            see the application being reproducibly being killed by the
            OOM at 25GB
            allocation with MPI_Win_allocate_shared. When I try to run
            it under Valgrind
            I get an error from MPI_Win_allocate at ~50GB that I cannot
            make sense of:

            ```
            MPI_Alloc_mem:  53687091200 B
            [n131302:11989] *** An error occurred in MPI_Alloc_mem
            [n131302:11989] *** reported by process [1567293441,1]
            [n131302:11989] *** on communicator MPI_COMM_WORLD
            [n131302:11989] *** MPI_ERR_NO_MEM: out of memory
            [n131302:11989] *** MPI_ERRORS_ARE_FATAL (processes in this
            communicator
            will now abort,
            [n131302:11989] ***    and potentially your MPI job)
            ```

            3) If running with 2 processes on a node, I get the
            following error from
            both MPI_Win_allocate and MPI_Win_allocate_shared:
            ```
            
--------------------------------------------------------------------------
            It appears as if there is not enough space for
            
/tmp/openmpi-sessions-31390@n131702_0/23041/1/0/shared_window_4.n131702
            (the
            shared-memory backing
            file). It is likely that your MPI job will now either abort
            or experience
            performance degradation.

                Local host:  n131702
                Space Requested: 6710890760 B
                Space Available: 6433673216 B
            ```
            This seems to be related to the size limit of /tmp.
            MPI_Allocate works as
            expected, i.e., I can allocate ~50GB per process. I
            understand that I can
            set $TMP to a bigger filesystem (such as lustre) but then I
            am greeted with
            a warning on each allocation and performance seems to drop.
            Is there a way
            to fall back to the allocation strategy used in case 2)?

            4) It is also worth noting the time it takes to allocate the
            memory: while
            the allocations are in the sub-millisecond range for both
            MPI_Allocate and
            MPI_Win_allocate_shared, it takes >24s to allocate 100GB using
            MPI_Win_allocate and the time increasing linearly with the
            allocation size.

            Are these issues known? Maybe there is documentation describing
            work-arounds? (esp. for 3) and 4))

            I am attaching a small benchmark. Please make sure to adjust the
            MEM_PER_NODE macro to suit your system before you run it :)
            I'm happy to
            provide additional details if needed.

            Best
            Joseph
            --
            Dipl.-Inf. Joseph Schuchart
            High Performance Computing Center Stuttgart (HLRS)
            Nobelstr. 19
            D-70569 Stuttgart

            Tel.: +49(0)711-68565890
            Fax: +49(0)711-6856832
            E-Mail: schuch...@hlrs.de <mailto:schuch...@hlrs.de>

            _______________________________________________
            users mailing list
            users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
            https://lists.open-mpi.org/mailman/listinfo/users
            <https://lists.open-mpi.org/mailman/listinfo/users>

        _______________________________________________
        users mailing list
        users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
        https://lists.open-mpi.org/mailman/listinfo/users
        <https://lists.open-mpi.org/mailman/listinfo/users>



-- Dipl.-Inf. Joseph Schuchart
    High Performance Computing Center Stuttgart (HLRS)
    Nobelstr. 19
    D-70569 Stuttgart

    Tel.: +49(0)711-68565890
    Fax: +49(0)711-6856832
    E-Mail: schuch...@hlrs.de <mailto:schuch...@hlrs.de>
    _______________________________________________
    users mailing list
    users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
    https://lists.open-mpi.org/mailman/listinfo/users
    <https://lists.open-mpi.org/mailman/listinfo/users>




--
Jeff Hammond
jeff.scie...@gmail.com <mailto:jeff.scie...@gmail.com>
http://jeffhammond.github.io/


_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users



--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to