Jeff, all,
Thanks for the clarification. My measurements show that global memory
allocations do not require the backing file if there is only one process
per node, for arbitrary number of processes. So I was wondering if it
was possible to use the same allocation process even with multiple
processes per node if there is not enough space available in /tmp.
However, I am not sure whether the IB devices can be used to perform
intra-node RMA. At least that would retain the functionality on this
kind of system (which arguably might be a rare case).
On a different note, I found during the weekend that Valgrind only
supports allocations up to 60GB, so my second point reported below may
be invalid. Number 4 seems still seems curious to me, though.
Best
Joseph
On 08/25/2017 09:17 PM, Jeff Hammond wrote:
There's no reason to do anything special for shared memory with a
single-process job because MPI_Win_allocate_shared(MPI_COMM_SELF) ~=
MPI_Alloc_mem(). However, it would help debugging if MPI implementers
at least had an option to take the code path that allocates shared
memory even when np=1.
Jeff
On Thu, Aug 24, 2017 at 7:41 AM, Joseph Schuchart <schuch...@hlrs.de
<mailto:schuch...@hlrs.de>> wrote:
Gilles,
Thanks for your swift response. On this system, /dev/shm only has
256M available so that is no option unfortunately. I tried disabling
both vader and sm btl via `--mca btl ^vader,sm` but Open MPI still
seems to allocate the shmem backing file under /tmp. From my point
of view, missing the performance benefits of file backed shared
memory as long as large allocations work but I don't know the
implementation details and whether that is possible. It seems that
the mmap does not happen if there is only one process per node.
Cheers,
Joseph
On 08/24/2017 03:49 PM, Gilles Gouaillardet wrote:
Joseph,
the error message suggests that allocating memory with
MPI_Win_allocate[_shared] is done by creating a file and then
mmap'ing
it.
how much space do you have in /dev/shm ? (this is a tmpfs e.g. a RAM
file system)
there is likely quite some space here, so as a workaround, i suggest
you use this as the shared-memory backing directory
/* i am afk and do not remember the syntax, ompi_info --all | grep
backing is likely to help */
Cheers,
Gilles
On Thu, Aug 24, 2017 at 10:31 PM, Joseph Schuchart
<schuch...@hlrs.de <mailto:schuch...@hlrs.de>> wrote:
All,
I have been experimenting with large window allocations
recently and have
made some interesting observations that I would like to share.
The system under test:
- Linux cluster equipped with IB,
- Open MPI 2.1.1,
- 128GB main memory per node
- 6GB /tmp filesystem per node
My observations:
1) Running with 1 process on a single node, I can allocate
and write to
memory up to ~110 GB through MPI_Allocate, MPI_Win_allocate, and
MPI_Win_allocate_shared.
2) If running with 1 process per node on 2 nodes single
large allocations
succeed but with the repeating allocate/free cycle in the
attached code I
see the application being reproducibly being killed by the
OOM at 25GB
allocation with MPI_Win_allocate_shared. When I try to run
it under Valgrind
I get an error from MPI_Win_allocate at ~50GB that I cannot
make sense of:
```
MPI_Alloc_mem: 53687091200 B
[n131302:11989] *** An error occurred in MPI_Alloc_mem
[n131302:11989] *** reported by process [1567293441,1]
[n131302:11989] *** on communicator MPI_COMM_WORLD
[n131302:11989] *** MPI_ERR_NO_MEM: out of memory
[n131302:11989] *** MPI_ERRORS_ARE_FATAL (processes in this
communicator
will now abort,
[n131302:11989] *** and potentially your MPI job)
```
3) If running with 2 processes on a node, I get the
following error from
both MPI_Win_allocate and MPI_Win_allocate_shared:
```
--------------------------------------------------------------------------
It appears as if there is not enough space for
/tmp/openmpi-sessions-31390@n131702_0/23041/1/0/shared_window_4.n131702
(the
shared-memory backing
file). It is likely that your MPI job will now either abort
or experience
performance degradation.
Local host: n131702
Space Requested: 6710890760 B
Space Available: 6433673216 B
```
This seems to be related to the size limit of /tmp.
MPI_Allocate works as
expected, i.e., I can allocate ~50GB per process. I
understand that I can
set $TMP to a bigger filesystem (such as lustre) but then I
am greeted with
a warning on each allocation and performance seems to drop.
Is there a way
to fall back to the allocation strategy used in case 2)?
4) It is also worth noting the time it takes to allocate the
memory: while
the allocations are in the sub-millisecond range for both
MPI_Allocate and
MPI_Win_allocate_shared, it takes >24s to allocate 100GB using
MPI_Win_allocate and the time increasing linearly with the
allocation size.
Are these issues known? Maybe there is documentation describing
work-arounds? (esp. for 3) and 4))
I am attaching a small benchmark. Please make sure to adjust the
MEM_PER_NODE macro to suit your system before you run it :)
I'm happy to
provide additional details if needed.
Best
Joseph
--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart
Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de <mailto:schuch...@hlrs.de>
_______________________________________________
users mailing list
users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
_______________________________________________
users mailing list
users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart
Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de <mailto:schuch...@hlrs.de>
_______________________________________________
users mailing list
users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
--
Jeff Hammond
jeff.scie...@gmail.com <mailto:jeff.scie...@gmail.com>
http://jeffhammond.github.io/
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users
--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart
Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users