I don't know any reason why you shouldn't be able to use IB for
intra-node transfers. There are, of course, arguments against doing
it in general (e.g. IB/PCI bandwidth less than DDR4 bandwidth), but it
likely behaves less synchronously than shared-memory, since I'm not
aware of any MPI RMA library that dispatches the intranode RMA
operations to an asynchronous agent (e.g. communication helper thread).
Regarding 4, faulting 100GB in 24s corresponds to 1us per 4K page,
which doesn't sound unreasonable to me. You might investigate if/how
you can use 2M or 1G pages instead. It's possible Open-MPI already
supports this, if the underlying system does. You may need to twiddle
your OS settings to get hugetlbfs working.
Jeff
On Tue, Aug 29, 2017 at 6:15 AM, Joseph Schuchart <schuch...@hlrs.de
<mailto:schuch...@hlrs.de>> wrote:
Jeff, all,
Thanks for the clarification. My measurements show that global
memory allocations do not require the backing file if there is only
one process per node, for arbitrary number of processes. So I was
wondering if it was possible to use the same allocation process even
with multiple processes per node if there is not enough space
available in /tmp. However, I am not sure whether the IB devices can
be used to perform intra-node RMA. At least that would retain the
functionality on this kind of system (which arguably might be a rare
case).
On a different note, I found during the weekend that Valgrind only
supports allocations up to 60GB, so my second point reported below
may be invalid. Number 4 seems still seems curious to me, though.
Best
Joseph
On 08/25/2017 09:17 PM, Jeff Hammond wrote:
There's no reason to do anything special for shared memory with
a single-process job because
MPI_Win_allocate_shared(MPI_COMM_SELF) ~= MPI_Alloc_mem().
However, it would help debugging if MPI implementers at least
had an option to take the code path that allocates shared memory
even when np=1.
Jeff
On Thu, Aug 24, 2017 at 7:41 AM, Joseph Schuchart
<schuch...@hlrs.de <mailto:schuch...@hlrs.de>
<mailto:schuch...@hlrs.de <mailto:schuch...@hlrs.de>>> wrote:
Gilles,
Thanks for your swift response. On this system, /dev/shm
only has
256M available so that is no option unfortunately. I tried
disabling
both vader and sm btl via `--mca btl ^vader,sm` but Open
MPI still
seems to allocate the shmem backing file under /tmp. From
my point
of view, missing the performance benefits of file backed
shared
memory as long as large allocations work but I don't know
the
implementation details and whether that is possible. It
seems that
the mmap does not happen if there is only one process per
node.
Cheers,
Joseph
On 08/24/2017 03:49 PM, Gilles Gouaillardet wrote:
Joseph,
the error message suggests that allocating memory with
MPI_Win_allocate[_shared] is done by creating a file
and then
mmap'ing
it.
how much space do you have in /dev/shm ? (this is a
tmpfs e.g. a RAM
file system)
there is likely quite some space here, so as a
workaround, i suggest
you use this as the shared-memory backing directory
/* i am afk and do not remember the syntax, ompi_info
--all | grep
backing is likely to help */
Cheers,
Gilles
On Thu, Aug 24, 2017 at 10:31 PM, Joseph Schuchart
<schuch...@hlrs.de <mailto:schuch...@hlrs.de>
<mailto:schuch...@hlrs.de <mailto:schuch...@hlrs.de>>> wrote:
All,
I have been experimenting with large window
allocations
recently and have
made some interesting observations that I would
like to share.
The system under test:
- Linux cluster equipped with IB,
- Open MPI 2.1.1,
- 128GB main memory per node
- 6GB /tmp filesystem per node
My observations:
1) Running with 1 process on a single node, I can
allocate
and write to
memory up to ~110 GB through MPI_Allocate,
MPI_Win_allocate, and
MPI_Win_allocate_shared.
2) If running with 1 process per node on 2 nodes
single
large allocations
succeed but with the repeating allocate/free cycle
in the
attached code I
see the application being reproducibly being killed
by the
OOM at 25GB
allocation with MPI_Win_allocate_shared. When I try
to run
it under Valgrind
I get an error from MPI_Win_allocate at ~50GB that
I cannot
make sense of:
```
MPI_Alloc_mem: 53687091200 B
[n131302:11989] *** An error occurred in
MPI_Alloc_mem
[n131302:11989] *** reported by process
[1567293441,1]
[n131302:11989] *** on communicator MPI_COMM_WORLD
[n131302:11989] *** MPI_ERR_NO_MEM: out of memory
[n131302:11989] *** MPI_ERRORS_ARE_FATAL (processes
in this
communicator
will now abort,
[n131302:11989] *** and potentially your MPI job)
```
3) If running with 2 processes on a node, I get the
following error from
both MPI_Win_allocate and MPI_Win_allocate_shared:
```
--------------------------------------------------------------------------
It appears as if there is not enough space for
/tmp/openmpi-sessions-31390@n131702_0/23041/1/0/shared_window_4.n131702
(the
shared-memory backing
file). It is likely that your MPI job will now
either abort
or experience
performance degradation.
Local host: n131702
Space Requested: 6710890760 B
Space Available: 6433673216 B
```
This seems to be related to the size limit of /tmp.
MPI_Allocate works as
expected, i.e., I can allocate ~50GB per process. I
understand that I can
set $TMP to a bigger filesystem (such as lustre)
but then I
am greeted with
a warning on each allocation and performance seems
to drop.
Is there a way
to fall back to the allocation strategy used in
case 2)?
4) It is also worth noting the time it takes to
allocate the
memory: while
the allocations are in the sub-millisecond range
for both
MPI_Allocate and
MPI_Win_allocate_shared, it takes >24s to allocate
100GB using
MPI_Win_allocate and the time increasing linearly
with the
allocation size.
Are these issues known? Maybe there is
documentation describing
work-arounds? (esp. for 3) and 4))
I am attaching a small benchmark. Please make sure
to adjust the
MEM_PER_NODE macro to suit your system before you
run it :)
I'm happy to
provide additional details if needed.
Best
Joseph
--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart
Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de
<mailto:schuch...@hlrs.de> <mailto:schuch...@hlrs.de
<mailto:schuch...@hlrs.de>>
_______________________________________________
users mailing list
users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
<mailto:users@lists.open-mpi.org
<mailto:users@lists.open-mpi.org>>
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
<https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>>
_______________________________________________
users mailing list
users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
<mailto:users@lists.open-mpi.org
<mailto:users@lists.open-mpi.org>>
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
<https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>>
-- Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart
Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de <mailto:schuch...@hlrs.de>
<mailto:schuch...@hlrs.de <mailto:schuch...@hlrs.de>>
_______________________________________________
users mailing list
users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
<mailto:users@lists.open-mpi.org
<mailto:users@lists.open-mpi.org>>
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
<https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>>
-- Jeff Hammond
jeff.scie...@gmail.com <mailto:jeff.scie...@gmail.com>
<mailto:jeff.scie...@gmail.com <mailto:jeff.scie...@gmail.com>>
http://jeffhammond.github.io/
_______________________________________________
users mailing list
users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
-- Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart
Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de <mailto:schuch...@hlrs.de>
_______________________________________________
users mailing list
users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
--
Jeff Hammond
jeff.scie...@gmail.com <mailto:jeff.scie...@gmail.com>
http://jeffhammond.github.io/
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users