In my experience, POSIX is much more reliable than Sys5. Sys5 depends
on the value of shmmax, which is often set to a small fraction of node
memory. I've probably seen the error described on
http://verahill.blogspot.com/2012/04/solution-to-nwchem-shmmax-too-small.html
with NWChem a 1000 times because of this. POSIX, on the other hand,
isn't limited by SHMMAX (https://community.oracle.com/thread/3828422).
POSIX is newer than Sys5, and while Sys5 is supported by Linux and thus
almost ubiquitous, it wasn't supported by Blue Gene, so in an HPC
context, one can argue that POSIX is more portable.
Jeff
On Fri, Sep 8, 2017 at 9:16 AM, Gilles Gouaillardet
<gilles.gouaillar...@gmail.com <mailto:gilles.gouaillar...@gmail.com>>
wrote:
Joseph,
Thanks for sharing this !
sysv is imho the worst option because if something goes really
wrong, Open MPI might leave some shared memory segments behind when
a job crashes. From that perspective, leaving a big file in /tmp can
be seen as the lesser evil.
That being said, there might be other reasons that drove this design
Cheers,
Gilles
Joseph Schuchart <schuch...@hlrs.de <mailto:schuch...@hlrs.de>> wrote:
>We are currently discussing internally how to proceed with this
issue on
>our machine. We did a little survey to see the setup of some of the
>machines we have access to, which includes an IBM, a Bull machine, and
>two Cray XC40 machines. To summarize our findings:
>
>1) On the Cray systems, both /tmp and /dev/shm are mounted tmpfs and
>each limited to half of the main memory size per node.
>2) On the IBM system, nodes have 64GB and /tmp is limited to 20 GB and
>mounted from a disk partition. /dev/shm, on the other hand, is
sized at
>63GB.
>3) On the above systems, /proc/sys/kernel/shm* is set up to allow the
>full memory of the node to be used as System V shared memory.
>4) On the Bull machine, /tmp is mounted from a disk and fixed to
~100GB
>while /dev/shm is limited to half the node's memory (there are nodes
>with 2TB memory, huge page support is available). System V shmem
on the
>other hand is limited to 4GB.
>
>Overall, it seems that there is no globally optimal allocation
strategy
>as the best matching source of shared memory is machine dependent.
>
>Open MPI treats System V shared memory as the least favorable option,
>even giving it a lower priority than POSIX shared memory, where
>conflicting names might occur. What's the reason for preferring
/tmp and
>POSIX shared memory over System V? It seems to me that the latter is a
>cleaner and safer way (provided that shared memory is not
constrained by
>/proc, which could easily be detected) while mmap'ing large files
feels
>somewhat hacky. Maybe I am missing an important aspect here though.
>
>The reason I am interested in this issue is that our PGAS library is
>build on top of MPI and allocates pretty much all memory exposed
to the
>user through MPI windows. Thus, any limitation from the underlying MPI
>implementation (or system for that matter) limits the amount of usable
>memory for our users.
>
>Given our observations above, I would like to propose a change to the
>shared memory allocator: the priorities would be derived from the
>percentage of main memory each component can cover, i.e.,
>
>Priority = 99*(min(Memory, SpaceAvail) / Memory)
>
>At startup, each shm component would determine the available size (by
>looking at /tmp, /dev/shm, and /proc/sys/kernel/shm*,
respectively) and
>set its priority between 0 and 99. A user could force Open MPI to
use a
>specific component by manually settings its priority to 100 (which of
>course has to be documented). The priority could factor in other
aspects
>as well, such as whether /tmp is actually tmpfs or disk-based if that
>makes a difference in performance.
>
>This proposal of course assumes that shared memory size is the sole
>optimization goal. Maybe there are other aspects to consider? I'd be
>happy to work on a patch but would like to get some feedback before
>getting my hands dirty. IMO, the current situation is less than ideal
>and prone to cause pain to the average user. In my recent experience,
>debugging this has been tedious and the user in general shouldn't have
>to care about how shared memory is allocated (and administrators don't
>always seem to care, see above).
>
>Any feedback is highly appreciated.
>
>Joseph
>
>
>On 09/04/2017 03:13 PM, Joseph Schuchart wrote:
>> Jeff, all,
>>
>> Unfortunately, I (as a user) have no control over the page size
on our
>> cluster. My interest in this is more of a general nature because
I am
>> concerned that our users who use Open MPI underneath our code
run into
>> this issue on their machine.
>>
>> I took a look at the code for the various window creation
methods and
>> now have a better picture of the allocation process in Open MPI. I
>> realized that memory in windows allocated through MPI_Win_alloc or
>> created through MPI_Win_create is registered with the IB device
using
>> ibv_reg_mr, which takes significant time for large allocations
(I assume
>> this is where hugepages would help?). In contrast to this, it
seems that
>> memory attached through MPI_Win_attach is not registered, which
explains
>> the lower latency for these allocation I am observing (I seem to
>> remember having observed higher communication latencies as well).
>>
>> Regarding the size limitation of /tmp: I found an
opal/mca/shmem/posix
>> component that uses shmem_open to create a POSIX shared memory
object
>> instead of a file on disk, which is then mmap'ed. Unfortunately,
if I
>> raise the priority of this component above that of the default mmap
>> component I end up with a SIGBUS during MPI_Init. No other
errors are
>> reported by MPI. Should I open a ticket on Github for this?
>>
>> As an alternative, would it be possible to use anonymous shared
memory
>> mappings to avoid the backing file for large allocations (maybe
above a
>> certain threshold) on systems that support MAP_ANONYMOUS and
distribute
>> the result of the mmap call among the processes on the node?
>>
>> Thanks,
>> Joseph
>>
>> On 08/29/2017 06:12 PM, Jeff Hammond wrote:
>>> I don't know any reason why you shouldn't be able to use IB for
>>> intra-node transfers. There are, of course, arguments against
doing
>>> it in general (e.g. IB/PCI bandwidth less than DDR4 bandwidth),
but it
>>> likely behaves less synchronously than shared-memory, since I'm not
>>> aware of any MPI RMA library that dispatches the intranode RMA
>>> operations to an asynchronous agent (e.g. communication helper
thread).
>>>
>>> Regarding 4, faulting 100GB in 24s corresponds to 1us per 4K page,
>>> which doesn't sound unreasonable to me. You might investigate
if/how
>>> you can use 2M or 1G pages instead. It's possible Open-MPI already
>>> supports this, if the underlying system does. You may need to
twiddle
>>> your OS settings to get hugetlbfs working.
>>>
>>> Jeff
>>>
>>> On Tue, Aug 29, 2017 at 6:15 AM, Joseph Schuchart
<schuch...@hlrs.de <mailto:schuch...@hlrs.de>
>>> <mailto:schuch...@hlrs.de <mailto:schuch...@hlrs.de>>> wrote:
>>>
>>> Jeff, all,
>>>
>>> Thanks for the clarification. My measurements show that global
>>> memory allocations do not require the backing file if there
is only
>>> one process per node, for arbitrary number of processes. So
I was
>>> wondering if it was possible to use the same allocation
process even
>>> with multiple processes per node if there is not enough space
>>> available in /tmp. However, I am not sure whether the IB
devices can
>>> be used to perform intra-node RMA. At least that would
retain the
>>> functionality on this kind of system (which arguably might
be a rare
>>> case).
>>>
>>> On a different note, I found during the weekend that
Valgrind only
>>> supports allocations up to 60GB, so my second point
reported below
>>> may be invalid. Number 4 seems still seems curious to me,
though.
>>>
>>> Best
>>> Joseph
>>>
>>> On 08/25/2017 09:17 PM, Jeff Hammond wrote:
>>>
>>> There's no reason to do anything special for shared
memory with
>>> a single-process job because
>>> MPI_Win_allocate_shared(MPI_COMM_SELF) ~= MPI_Alloc_mem().
>>> However, it would help debugging if MPI implementers at
least
>>> had an option to take the code path that allocates
shared memory
>>> even when np=1.
>>>
>>> Jeff
>>>
>>> On Thu, Aug 24, 2017 at 7:41 AM, Joseph Schuchart
>>> <schuch...@hlrs.de <mailto:schuch...@hlrs.de>
<mailto:schuch...@hlrs.de <mailto:schuch...@hlrs.de>>
>>> <mailto:schuch...@hlrs.de <mailto:schuch...@hlrs.de>
<mailto:schuch...@hlrs.de <mailto:schuch...@hlrs.de>>>> wrote:
>>>
>>> Gilles,
>>>
>>> Thanks for your swift response. On this system,
/dev/shm
>>> only has
>>> 256M available so that is no option unfortunately.
I tried
>>> disabling
>>> both vader and sm btl via `--mca btl ^vader,sm`
but Open
>>> MPI still
>>> seems to allocate the shmem backing file under
/tmp. From
>>> my point
>>> of view, missing the performance benefits of file
backed
>>> shared
>>> memory as long as large allocations work but I
don't know
>>> the
>>> implementation details and whether that is
possible. It
>>> seems that
>>> the mmap does not happen if there is only one
process per
>>> node.
>>>
>>> Cheers,
>>> Joseph
>>>
>>>
>>> On 08/24/2017 03:49 PM, Gilles Gouaillardet wrote:
>>>
>>> Joseph,
>>>
>>> the error message suggests that allocating
memory with
>>> MPI_Win_allocate[_shared] is done by creating
a file
>>> and then
>>> mmap'ing
>>> it.
>>> how much space do you have in /dev/shm ? (this
is a
>>> tmpfs e.g. a RAM
>>> file system)
>>> there is likely quite some space here, so as a
>>> workaround, i suggest
>>> you use this as the shared-memory backing
directory
>>>
>>> /* i am afk and do not remember the syntax,
ompi_info
>>> --all | grep
>>> backing is likely to help */
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>> On Thu, Aug 24, 2017 at 10:31 PM, Joseph Schuchart
>>> <schuch...@hlrs.de <mailto:schuch...@hlrs.de>
<mailto:schuch...@hlrs.de <mailto:schuch...@hlrs.de>>
>>> <mailto:schuch...@hlrs.de <mailto:schuch...@hlrs.de>
<mailto:schuch...@hlrs.de <mailto:schuch...@hlrs.de>>>> wrote:
>>>
>>> All,
>>>
>>> I have been experimenting with large window
>>> allocations
>>> recently and have
>>> made some interesting observations that I
would
>>> like to share.
>>>
>>> The system under test:
>>> - Linux cluster equipped with IB,
>>> - Open MPI 2.1.1,
>>> - 128GB main memory per node
>>> - 6GB /tmp filesystem per node
>>>
>>> My observations:
>>> 1) Running with 1 process on a single
node, I can
>>> allocate
>>> and write to
>>> memory up to ~110 GB through MPI_Allocate,
>>> MPI_Win_allocate, and
>>> MPI_Win_allocate_shared.
>>>
>>> 2) If running with 1 process per node on 2
nodes
>>> single
>>> large allocations
>>> succeed but with the repeating
allocate/free cycle
>>> in the
>>> attached code I
>>> see the application being reproducibly
being killed
>>> by the
>>> OOM at 25GB
>>> allocation with MPI_Win_allocate_shared.
When I try
>>> to run
>>> it under Valgrind
>>> I get an error from MPI_Win_allocate at
~50GB that
>>> I cannot
>>> make sense of:
>>>
>>> ```
>>> MPI_Alloc_mem: 53687091200 B
>>> [n131302:11989] *** An error occurred in
>>> MPI_Alloc_mem
>>> [n131302:11989] *** reported by process
>>> [1567293441,1]
>>> [n131302:11989] *** on communicator
MPI_COMM_WORLD
>>> [n131302:11989] *** MPI_ERR_NO_MEM: out of
memory
>>> [n131302:11989] *** MPI_ERRORS_ARE_FATAL
(processes
>>> in this
>>> communicator
>>> will now abort,
>>> [n131302:11989] *** and potentially
your MPI job)
>>> ```
>>>
>>> 3) If running with 2 processes on a node,
I get the
>>> following error from
>>> both MPI_Win_allocate and
MPI_Win_allocate_shared:
>>> ```
>>>
>>>
--------------------------------------------------------------------------
>>>
>>> It appears as if there is not enough space for
>>>
>>>
/tmp/openmpi-sessions-31390@n131702_0/23041/1/0/shared_window_4.n131702
>>> (the
>>> shared-memory backing
>>> file). It is likely that your MPI job will now
>>> either abort
>>> or experience
>>> performance degradation.
>>>
>>> Local host: n131702
>>> Space Requested: 6710890760 B
>>> Space Available: 6433673216 B
>>> ```
>>> This seems to be related to the size limit
of /tmp.
>>> MPI_Allocate works as
>>> expected, i.e., I can allocate ~50GB per
process. I
>>> understand that I can
>>> set $TMP to a bigger filesystem (such as
lustre)
>>> but then I
>>> am greeted with
>>> a warning on each allocation and
performance seems
>>> to drop.
>>> Is there a way
>>> to fall back to the allocation strategy
used in
>>> case 2)?
>>>
>>> 4) It is also worth noting the time it
takes to
>>> allocate the
>>> memory: while
>>> the allocations are in the sub-millisecond
range
>>> for both
>>> MPI_Allocate and
>>> MPI_Win_allocate_shared, it takes >24s to
allocate
>>> 100GB using
>>> MPI_Win_allocate and the time increasing
linearly
>>> with the
>>> allocation size.
>>>
>>> Are these issues known? Maybe there is
>>> documentation describing
>>> work-arounds? (esp. for 3) and 4))
>>>
>>> I am attaching a small benchmark. Please
make sure
>>> to adjust the
>>> MEM_PER_NODE macro to suit your system
before you
>>> run it :)
>>> I'm happy to
>>> provide additional details if needed.
>>>
>>> Best
>>> Joseph
>>> --
>>> Dipl.-Inf. Joseph Schuchart
>>> High Performance Computing Center
Stuttgart (HLRS)
>>> Nobelstr. 19
>>> D-70569 Stuttgart
>>>
>>> Tel.: +49(0)711-68565890
>>> Fax: +49(0)711-6856832
>>> E-Mail: schuch...@hlrs.de
<mailto:schuch...@hlrs.de>
>>> <mailto:schuch...@hlrs.de <mailto:schuch...@hlrs.de>>
<mailto:schuch...@hlrs.de <mailto:schuch...@hlrs.de>
>>> <mailto:schuch...@hlrs.de <mailto:schuch...@hlrs.de>>>
>>>
>>>
_______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
<mailto:users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>>
>>> <mailto:users@lists.open-mpi.org
<mailto:users@lists.open-mpi.org>
>>> <mailto:users@lists.open-mpi.org
<mailto:users@lists.open-mpi.org>>>
>>> https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
>>> <https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>>
>>>
<https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
>>> <https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
<mailto:users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>>
>>> <mailto:users@lists.open-mpi.org
<mailto:users@lists.open-mpi.org>
>>> <mailto:users@lists.open-mpi.org
<mailto:users@lists.open-mpi.org>>>
>>> https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
>>> <https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>>
>>>
<https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
>>> <https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>>>
>>>
>>>
>>>
>>> -- Dipl.-Inf. Joseph Schuchart
>>> High Performance Computing Center Stuttgart (HLRS)
>>> Nobelstr. 19
>>> D-70569 Stuttgart
>>>
>>> Tel.: +49(0)711-68565890
>>> Fax: +49(0)711-6856832
>>> E-Mail: schuch...@hlrs.de
<mailto:schuch...@hlrs.de> <mailto:schuch...@hlrs.de
<mailto:schuch...@hlrs.de>>
>>> <mailto:schuch...@hlrs.de <mailto:schuch...@hlrs.de>
<mailto:schuch...@hlrs.de <mailto:schuch...@hlrs.de>>>
>>> _______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
<mailto:users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>>
>>> <mailto:users@lists.open-mpi.org
<mailto:users@lists.open-mpi.org>
>>> <mailto:users@lists.open-mpi.org
<mailto:users@lists.open-mpi.org>>>
>>> https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
>>> <https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>>
>>> <https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
>>> <https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>>>
>>>
>>>
>>>
>>>
>>> -- Jeff Hammond
>>> jeff.scie...@gmail.com <mailto:jeff.scie...@gmail.com>
<mailto:jeff.scie...@gmail.com <mailto:jeff.scie...@gmail.com>>
>>> <mailto:jeff.scie...@gmail.com
<mailto:jeff.scie...@gmail.com> <mailto:jeff.scie...@gmail.com
<mailto:jeff.scie...@gmail.com>>>
>>> http://jeffhammond.github.io/
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
<mailto:users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>>
>>> https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
>>> <https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>>
>>>
>>>
>>>
>>> -- Dipl.-Inf. Joseph Schuchart
>>> High Performance Computing Center Stuttgart (HLRS)
>>> Nobelstr. 19
>>> D-70569 Stuttgart
>>>
>>> Tel.: +49(0)711-68565890
>>> Fax: +49(0)711-6856832
>>> E-Mail: schuch...@hlrs.de <mailto:schuch...@hlrs.de>
<mailto:schuch...@hlrs.de <mailto:schuch...@hlrs.de>>
>>> _______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
<mailto:users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>>
>>> https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
>>> <https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>>
>>>
>>>
>>>
>>>
>>> --
>>> Jeff Hammond
>>> jeff.scie...@gmail.com <mailto:jeff.scie...@gmail.com>
<mailto:jeff.scie...@gmail.com <mailto:jeff.scie...@gmail.com>>
>>> http://jeffhammond.github.io/
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>>> https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
>>>
>>
>>
>
>
>--
>Dipl.-Inf. Joseph Schuchart
>High Performance Computing Center Stuttgart (HLRS)
>Nobelstr. 19
>D-70569 Stuttgart
>
>Tel.: +49(0)711-68565890
>Fax: +49(0)711-6856832
>E-Mail: schuch...@hlrs.de <mailto:schuch...@hlrs.de>
>_______________________________________________
>users mailing list
>users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
>https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
_______________________________________________
users mailing list
users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/users
<https://lists.open-mpi.org/mailman/listinfo/users>
--
Jeff Hammond
jeff.scie...@gmail.com <mailto:jeff.scie...@gmail.com>
http://jeffhammond.github.io/
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users