Gilles,

Thanks for your swift response. On this system, /dev/shm only has 256M available so that is no option unfortunately. I tried disabling both vader and sm btl via `--mca btl ^vader,sm` but Open MPI still seems to allocate the shmem backing file under /tmp. From my point of view, missing the performance benefits of file backed shared memory as long as large allocations work but I don't know the implementation details and whether that is possible. It seems that the mmap does not happen if there is only one process per node.

Cheers,
Joseph

On 08/24/2017 03:49 PM, Gilles Gouaillardet wrote:
Joseph,

the error message suggests that allocating memory with
MPI_Win_allocate[_shared] is done by creating a file and then mmap'ing
it.
how much space do you have in /dev/shm ? (this is a tmpfs e.g. a RAM
file system)
there is likely quite some space here, so as a workaround, i suggest
you use this as the shared-memory backing directory

/* i am afk and do not remember the syntax, ompi_info --all | grep
backing is likely to help */

Cheers,

Gilles

On Thu, Aug 24, 2017 at 10:31 PM, Joseph Schuchart <schuch...@hlrs.de> wrote:
All,

I have been experimenting with large window allocations recently and have
made some interesting observations that I would like to share.

The system under test:
   - Linux cluster equipped with IB,
   - Open MPI 2.1.1,
   - 128GB main memory per node
   - 6GB /tmp filesystem per node

My observations:
1) Running with 1 process on a single node, I can allocate and write to
memory up to ~110 GB through MPI_Allocate, MPI_Win_allocate, and
MPI_Win_allocate_shared.

2) If running with 1 process per node on 2 nodes single large allocations
succeed but with the repeating allocate/free cycle in the attached code I
see the application being reproducibly being killed by the OOM at 25GB
allocation with MPI_Win_allocate_shared. When I try to run it under Valgrind
I get an error from MPI_Win_allocate at ~50GB that I cannot make sense of:

```
MPI_Alloc_mem:  53687091200 B
[n131302:11989] *** An error occurred in MPI_Alloc_mem
[n131302:11989] *** reported by process [1567293441,1]
[n131302:11989] *** on communicator MPI_COMM_WORLD
[n131302:11989] *** MPI_ERR_NO_MEM: out of memory
[n131302:11989] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
will now abort,
[n131302:11989] ***    and potentially your MPI job)
```

3) If running with 2 processes on a node, I get the following error from
both MPI_Win_allocate and MPI_Win_allocate_shared:
```
--------------------------------------------------------------------------
It appears as if there is not enough space for
/tmp/openmpi-sessions-31390@n131702_0/23041/1/0/shared_window_4.n131702 (the
shared-memory backing
file). It is likely that your MPI job will now either abort or experience
performance degradation.

   Local host:  n131702
   Space Requested: 6710890760 B
   Space Available: 6433673216 B
```
This seems to be related to the size limit of /tmp. MPI_Allocate works as
expected, i.e., I can allocate ~50GB per process. I understand that I can
set $TMP to a bigger filesystem (such as lustre) but then I am greeted with
a warning on each allocation and performance seems to drop. Is there a way
to fall back to the allocation strategy used in case 2)?

4) It is also worth noting the time it takes to allocate the memory: while
the allocations are in the sub-millisecond range for both MPI_Allocate and
MPI_Win_allocate_shared, it takes >24s to allocate 100GB using
MPI_Win_allocate and the time increasing linearly with the allocation size.

Are these issues known? Maybe there is documentation describing
work-arounds? (esp. for 3) and 4))

I am attaching a small benchmark. Please make sure to adjust the
MEM_PER_NODE macro to suit your system before you run it :) I'm happy to
provide additional details if needed.

Best
Joseph
--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users



--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to