There's no reason to do anything special for shared memory with a single-process job because MPI_Win_allocate_shared(MPI_COMM_SELF) ~= MPI_Alloc_mem(). However, it would help debugging if MPI implementers at least had an option to take the code path that allocates shared memory even when np=1.
Jeff On Thu, Aug 24, 2017 at 7:41 AM, Joseph Schuchart <schuch...@hlrs.de> wrote: > Gilles, > > Thanks for your swift response. On this system, /dev/shm only has 256M > available so that is no option unfortunately. I tried disabling both vader > and sm btl via `--mca btl ^vader,sm` but Open MPI still seems to allocate > the shmem backing file under /tmp. From my point of view, missing the > performance benefits of file backed shared memory as long as large > allocations work but I don't know the implementation details and whether > that is possible. It seems that the mmap does not happen if there is only > one process per node. > > Cheers, > Joseph > > > On 08/24/2017 03:49 PM, Gilles Gouaillardet wrote: > >> Joseph, >> >> the error message suggests that allocating memory with >> MPI_Win_allocate[_shared] is done by creating a file and then mmap'ing >> it. >> how much space do you have in /dev/shm ? (this is a tmpfs e.g. a RAM >> file system) >> there is likely quite some space here, so as a workaround, i suggest >> you use this as the shared-memory backing directory >> >> /* i am afk and do not remember the syntax, ompi_info --all | grep >> backing is likely to help */ >> >> Cheers, >> >> Gilles >> >> On Thu, Aug 24, 2017 at 10:31 PM, Joseph Schuchart <schuch...@hlrs.de> >> wrote: >> >>> All, >>> >>> I have been experimenting with large window allocations recently and have >>> made some interesting observations that I would like to share. >>> >>> The system under test: >>> - Linux cluster equipped with IB, >>> - Open MPI 2.1.1, >>> - 128GB main memory per node >>> - 6GB /tmp filesystem per node >>> >>> My observations: >>> 1) Running with 1 process on a single node, I can allocate and write to >>> memory up to ~110 GB through MPI_Allocate, MPI_Win_allocate, and >>> MPI_Win_allocate_shared. >>> >>> 2) If running with 1 process per node on 2 nodes single large allocations >>> succeed but with the repeating allocate/free cycle in the attached code I >>> see the application being reproducibly being killed by the OOM at 25GB >>> allocation with MPI_Win_allocate_shared. When I try to run it under >>> Valgrind >>> I get an error from MPI_Win_allocate at ~50GB that I cannot make sense >>> of: >>> >>> ``` >>> MPI_Alloc_mem: 53687091200 B >>> [n131302:11989] *** An error occurred in MPI_Alloc_mem >>> [n131302:11989] *** reported by process [1567293441,1] >>> [n131302:11989] *** on communicator MPI_COMM_WORLD >>> [n131302:11989] *** MPI_ERR_NO_MEM: out of memory >>> [n131302:11989] *** MPI_ERRORS_ARE_FATAL (processes in this communicator >>> will now abort, >>> [n131302:11989] *** and potentially your MPI job) >>> ``` >>> >>> 3) If running with 2 processes on a node, I get the following error from >>> both MPI_Win_allocate and MPI_Win_allocate_shared: >>> ``` >>> ------------------------------------------------------------ >>> -------------- >>> It appears as if there is not enough space for >>> /tmp/openmpi-sessions-31390@n131702_0/23041/1/0/shared_window_4.n131702 >>> (the >>> shared-memory backing >>> file). It is likely that your MPI job will now either abort or experience >>> performance degradation. >>> >>> Local host: n131702 >>> Space Requested: 6710890760 B >>> Space Available: 6433673216 B >>> ``` >>> This seems to be related to the size limit of /tmp. MPI_Allocate works as >>> expected, i.e., I can allocate ~50GB per process. I understand that I can >>> set $TMP to a bigger filesystem (such as lustre) but then I am greeted >>> with >>> a warning on each allocation and performance seems to drop. Is there a >>> way >>> to fall back to the allocation strategy used in case 2)? >>> >>> 4) It is also worth noting the time it takes to allocate the memory: >>> while >>> the allocations are in the sub-millisecond range for both MPI_Allocate >>> and >>> MPI_Win_allocate_shared, it takes >24s to allocate 100GB using >>> MPI_Win_allocate and the time increasing linearly with the allocation >>> size. >>> >>> Are these issues known? Maybe there is documentation describing >>> work-arounds? (esp. for 3) and 4)) >>> >>> I am attaching a small benchmark. Please make sure to adjust the >>> MEM_PER_NODE macro to suit your system before you run it :) I'm happy to >>> provide additional details if needed. >>> >>> Best >>> Joseph >>> -- >>> Dipl.-Inf. Joseph Schuchart >>> High Performance Computing Center Stuttgart (HLRS) >>> Nobelstr. 19 >>> D-70569 Stuttgart >>> >>> Tel.: +49(0)711-68565890 >>> Fax: +49(0)711-6856832 >>> E-Mail: schuch...@hlrs.de >>> >>> _______________________________________________ >>> users mailing list >>> users@lists.open-mpi.org >>> https://lists.open-mpi.org/mailman/listinfo/users >>> >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://lists.open-mpi.org/mailman/listinfo/users >> >> > > -- > Dipl.-Inf. Joseph Schuchart > High Performance Computing Center Stuttgart (HLRS) > Nobelstr. 19 > D-70569 Stuttgart > > Tel.: +49(0)711-68565890 > Fax: +49(0)711-6856832 > E-Mail: schuch...@hlrs.de > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users > -- Jeff Hammond jeff.scie...@gmail.com http://jeffhammond.github.io/
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users