Joseph, the error message suggests that allocating memory with MPI_Win_allocate[_shared] is done by creating a file and then mmap'ing it. how much space do you have in /dev/shm ? (this is a tmpfs e.g. a RAM file system) there is likely quite some space here, so as a workaround, i suggest you use this as the shared-memory backing directory
/* i am afk and do not remember the syntax, ompi_info --all | grep backing is likely to help */ Cheers, Gilles On Thu, Aug 24, 2017 at 10:31 PM, Joseph Schuchart <schuch...@hlrs.de> wrote: > All, > > I have been experimenting with large window allocations recently and have > made some interesting observations that I would like to share. > > The system under test: > - Linux cluster equipped with IB, > - Open MPI 2.1.1, > - 128GB main memory per node > - 6GB /tmp filesystem per node > > My observations: > 1) Running with 1 process on a single node, I can allocate and write to > memory up to ~110 GB through MPI_Allocate, MPI_Win_allocate, and > MPI_Win_allocate_shared. > > 2) If running with 1 process per node on 2 nodes single large allocations > succeed but with the repeating allocate/free cycle in the attached code I > see the application being reproducibly being killed by the OOM at 25GB > allocation with MPI_Win_allocate_shared. When I try to run it under Valgrind > I get an error from MPI_Win_allocate at ~50GB that I cannot make sense of: > > ``` > MPI_Alloc_mem: 53687091200 B > [n131302:11989] *** An error occurred in MPI_Alloc_mem > [n131302:11989] *** reported by process [1567293441,1] > [n131302:11989] *** on communicator MPI_COMM_WORLD > [n131302:11989] *** MPI_ERR_NO_MEM: out of memory > [n131302:11989] *** MPI_ERRORS_ARE_FATAL (processes in this communicator > will now abort, > [n131302:11989] *** and potentially your MPI job) > ``` > > 3) If running with 2 processes on a node, I get the following error from > both MPI_Win_allocate and MPI_Win_allocate_shared: > ``` > -------------------------------------------------------------------------- > It appears as if there is not enough space for > /tmp/openmpi-sessions-31390@n131702_0/23041/1/0/shared_window_4.n131702 (the > shared-memory backing > file). It is likely that your MPI job will now either abort or experience > performance degradation. > > Local host: n131702 > Space Requested: 6710890760 B > Space Available: 6433673216 B > ``` > This seems to be related to the size limit of /tmp. MPI_Allocate works as > expected, i.e., I can allocate ~50GB per process. I understand that I can > set $TMP to a bigger filesystem (such as lustre) but then I am greeted with > a warning on each allocation and performance seems to drop. Is there a way > to fall back to the allocation strategy used in case 2)? > > 4) It is also worth noting the time it takes to allocate the memory: while > the allocations are in the sub-millisecond range for both MPI_Allocate and > MPI_Win_allocate_shared, it takes >24s to allocate 100GB using > MPI_Win_allocate and the time increasing linearly with the allocation size. > > Are these issues known? Maybe there is documentation describing > work-arounds? (esp. for 3) and 4)) > > I am attaching a small benchmark. Please make sure to adjust the > MEM_PER_NODE macro to suit your system before you run it :) I'm happy to > provide additional details if needed. > > Best > Joseph > -- > Dipl.-Inf. Joseph Schuchart > High Performance Computing Center Stuttgart (HLRS) > Nobelstr. 19 > D-70569 Stuttgart > > Tel.: +49(0)711-68565890 > Fax: +49(0)711-6856832 > E-Mail: schuch...@hlrs.de > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users _______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users