Re: [OMPI users] mmaped memory and openib btl.

Emmanuel Thomé Wed, 12 Nov 2014 13:51:46 -0500 (EST)

yes I confirm. Thanks for saying that this is the supposed behaviour.

In the binary, the code goes to munmap@plt, which goes to the libc,
not to libopen-pal.so


libc is 2.13-38+deb7u1

I'm a total noob at got/plt relocations. What is the mechanism which
should make the opal relocation win over the libc one ?

E.


On Wed, Nov 12, 2014 at 7:40 PM, Jeff Squyres (jsquyres)
<jsquy...@cisco.com> wrote:
> FWIW, munmap is *supposed* to be intercepted.  Can you confirm that when your 
> application calls munmap, it doesn't make a call to libopen-pal.so?
>
> It should be calling this (1-line) function:
>
> -----
> /* intercept munmap, as the user can give back memory that way as well. */
> OPAL_DECLSPEC int munmap(void* addr, size_t len)
> {
>     return opal_memory_linux_free_ptmalloc2_munmap(addr, len, 0);
> }
> -----
>
>
>
> On Nov 12, 2014, at 11:08 AM, Emmanuel Thomé <emmanuel.th...@gmail.com> wrote:
>
>> As far as I have been able to understand while looking at the code, it
>> very much seems that Joshua pointed out the exact cause for the issue.
>>
>> munmap'ing a virtual address space region does not evict it from
>> mpool_grdma->pool->lru_list . If a later mmap happens to return the
>> same address (a priori tied to different physical location), the
>> userspace believes this segment is already registered, and eventually
>> the transfer is directed to a bogus location.
>>
>> This also seems to match this old discussion:
>>
>> http://lists.openfabrics.org/pipermail/general/2009-April/058786.html
>>
>> although I didn't read the whole discussion there, it very much seems
>> that the proposal for moving the pinning/caching logic to the kernel
>> did not make it, unfortunately.
>>
>> So are we here in the situation where this "munmap should be
>> intercepted" logic actually proves too fragile ? (in that it's not
>> intercepted in my case). The memory MCA in my configuration is:
>>              MCA memory: linux (MCA v2.0, API v2.0, Component v1.8.3)
>>
>> I traced a bit what happens at the mmap call, it seems to go straight
>> to the libc, not via openmpi first.
>>
>> For the time being, I think I'll have to consider any mmap()/munmap()
>> rather unsafe to play with in an openmpi application.
>>
>> E.
>>
>> P.S: a last version of the test case is attached.
>>
>> Le 11 nov. 2014 19:48, "Emmanuel Thomé" <emmanuel.th...@gmail.com> a écrit :
>>>
>>> Thanks a lot for your analysis. This seems consistent with what I can
>>> obtain by playing around with my different test cases.
>>>
>>> It seems that munmap() does *not* unregister the memory chunk from the
>>> cache. I suppose this is the reason for the bug.
>>>
>>> In fact using mmap(..., MAP_ANONYMOUS | MAP_PRIVATE) and munmap() as
>>> substitutes for malloc()/free() trigger the same problem.
>>>
>>> It looks to me that there is an oversight in the OPAL hooks around the
>>> memory functions, then. Do you agree ?
>>>
>>> E.
>>>
>>> On Tue, Nov 11, 2014 at 3:17 PM, Joshua Ladd <jladd.m...@gmail.com> wrote:
>>>> I was able to reproduce your issue and I think I understand the problem a
>>>> bit better at least. This demonstrates exactly what I was pointing to:
>>>>
>>>> It looks like when the test switches over from eager RDMA (I'll explain in 
>>>> a
>>>> second), to doing a rendezvous protocol working entirely in user buffer
>>>> space things go bad.
>>>>
>>>> If you're input is smaller than some threshold, the eager RDMA limit, then
>>>> the contents of your user buffer are copied into OMPI/OpenIB BTL scratch
>>>> buffers called "eager fragments". This pool of resources is preregistered,
>>>> pinned, and have had their rkeys exchanged. So, in the eager protocol, your
>>>> data is copied into these "locked and loaded" RDMA frags and the put/get is
>>>> handled internally. When the data is received, it's copied back out into
>>>> your buffer. In your setup, this always works.
>>>>
>>>> $mpirun -np 2 --map-by node --bind-to core -mca pml ob1 -mca
>>>> btl_openib_if_include mlx4_0:1 -mca btl_openib_use_eager_rdma 1 -mca
>>>> btl_openib_eager_limit 512 -mca btl openib,self ./ibtest -s 56
>>>> per-node buffer has size 448 bytes
>>>> node 0 iteration 0, lead word received from peer is 0x00000401 [ok]
>>>> node 0 iteration 1, lead word received from peer is 0x00000801 [ok]
>>>> node 0 iteration 2, lead word received from peer is 0x00000c01 [ok]
>>>> node 0 iteration 3, lead word received from peer is 0x00001001 [ok]
>>>>
>>>> When you exceed the eager threshold, this always fails on the second
>>>> iteration. To understand this, you need to understand that there is a
>>>> protocol switch where now your user buffer is used for the transfer. Hence,
>>>> the user buffer is registered with the HCA. This operation is an inherently
>>>> high latency operation and is one of the primary motives for doing
>>>> copy-in/copy-out into preregistered buffers for small, latency sensitive
>>>> ops. For bandwidth bound transfers, the cost to register can be amortized
>>>> over the whole transfer, but it still affects the total bandwidth. In the
>>>> case of a rendezvous protocol where the user buffer is registered, there is
>>>> an optimization mostly used to help improve the numbers in a bandwidth
>>>> benchmark called a registration cache. With registration caching the user
>>>> buffer is registered once and the mkey put into a cache and the memory is
>>>> kept pinned until the system provides some notification via either memory
>>>> hooks in p2p malloc, or ummunotify that the buffer has been freed and this
>>>> signals that the mkey can be evicted from the cache.  On subsequent
>>>> send/recv operations from the same user buffer address, OpenIB BTL will 
>>>> find
>>>> the address in the registration cache and take the cached mkey and avoid
>>>> paying the cost of the memory registration the memory registration and 
>>>> start
>>>> the data transfer.
>>>>
>>>> What I noticed is when the rendezvous protocol kicks in, it always fails on
>>>> the second iteration.
>>>>
>>>> $mpirun -np 2 --map-by node --bind-to core -mca pml ob1 -mca
>>>> btl_openib_if_include mlx4_0:1 -mca btl_openib_use_eager_rdma 1 -mca
>>>> btl_openib_eager_limit 128 -mca btl openib,self ./ibtest -s 56
>>>> per-node buffer has size 448 bytes
>>>> node 0 iteration 0, lead word received from peer is 0x00000401 [ok]
>>>> node 0 iteration 1, lead word received from peer is 0x00000000 [NOK]
>>>> --------------------------------------------------------------------------
>>>>
>>>> So, I suspected it has something to do with the way the virtual address is
>>>> being handled in this case. To test that theory, I just completely disabled
>>>> the registration cache by setting -mca mpi_leave_pinned 0 and things start
>>>> to work:
>>>>
>>>> $mpirun -np 2 --map-by node --bind-to core -mca pml ob1 -mca
>>>> btl_openib_if_include mlx4_0:1 -mca btl_openib_use_eager_rdma 1 -mca
>>>> btl_openib_eager_limit 128 -mca mpi_leave_pinned 0 -mca btl openib,self
>>>> ./ibtest -s 56
>>>> per-node buffer has size 448 bytes
>>>> node 0 iteration 0, lead word received from peer is 0x00000401 [ok]
>>>> node 0 iteration 1, lead word received from peer is 0x00000801 [ok]
>>>> node 0 iteration 2, lead word received from peer is 0x00000c01 [ok]
>>>> node 0 iteration 3, lead word received from peer is 0x00001001 [ok]
>>>>
>>>> I don't know enough about memory hooks or the registration cache
>>>> implementation to speak with any authority, but it looks like this is where
>>>> the issue resides. As a workaround, can you try your original experiment
>>>> with -mca mpi_leave_pinned 0 and see if you get consistent results.
>>>>
>>>>
>>>> Josh
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Nov 11, 2014 at 7:07 AM, Emmanuel Thomé <emmanuel.th...@gmail.com>
>>>> wrote:
>>>>>
>>>>> Hi again,
>>>>>
>>>>> I've been able to simplify my test case significantly. It now runs
>>>>> with 2 nodes, and only a single MPI_Send / MPI_Recv pair is used.
>>>>>
>>>>> The pattern is as follows.
>>>>>
>>>>> *  - ranks 0 and 1 both own a local buffer.
>>>>> *  - each fills it with (deterministically known) data.
>>>>> *  - rank 0 collects the data from rank 1's local buffer
>>>>> *    (whose contents should be no mystery), and writes this to a
>>>>> *    file-backed mmaped area.
>>>>> *  - rank 0 compares what it receives with what it knows it *should
>>>>> *  have* received.
>>>>>
>>>>> The test fails if:
>>>>>
>>>>> *  - the openib btl is used among the 2 nodes
>>>>> *  - a file-backed mmaped area is used for receiving the data.
>>>>> *  - the write is done to a newly created file.
>>>>> *  - per-node buffer is large enough.
>>>>>
>>>>> For a per-node buffer size above 12kb (12240 bytes to be exact), my
>>>>> program fails, since the MPI_Recv does not receive the correct data
>>>>> chunk (it just gets zeroes).
>>>>>
>>>>> I attach the simplified test case. I hope someone will be able to
>>>>> reproduce the problem.
>>>>>
>>>>> Best regards,
>>>>>
>>>>> E.
>>>>>
>>>>>
>>>>> On Mon, Nov 10, 2014 at 5:48 PM, Emmanuel Thomé
>>>>> <emmanuel.th...@gmail.com> wrote:
>>>>>> Thanks for your answer.
>>>>>>
>>>>>> On Mon, Nov 10, 2014 at 4:31 PM, Joshua Ladd <jladd.m...@gmail.com>
>>>>>> wrote:
>>>>>>> Just really quick off the top of my head, mmaping relies on the virtual
>>>>>>> memory subsystem, whereas IB RDMA operations rely on physical memory
>>>>>>> being
>>>>>>> pinned (unswappable.)
>>>>>>
>>>>>> Yes. Does that mean that the result of computations should be
>>>>>> undefined if I happen to give a user buffer which corresponds to a
>>>>>> file ? That would be surprising.
>>>>>>
>>>>>>> For a large message transfer, the OpenIB BTL will
>>>>>>> register the user buffer, which will pin the pages and make them
>>>>>>> unswappable.
>>>>>>
>>>>>> Yes. But what are the semantics of pinning the VM area pointed to by
>>>>>> ptr if ptr happens to be mmaped from a file ?
>>>>>>
>>>>>>> If the data being transfered is small, you'll copy-in/out to
>>>>>>> internal bounce buffers and you shouldn't have issues.
>>>>>>
>>>>>> Are you saying that the openib layer does have provision in this case
>>>>>> for letting the RDMA happen with a pinned physical memory range, and
>>>>>> later perform the copy to the file-backed mmaped range ? That would
>>>>>> make perfect sense indeed, although I don't have enough familiarity
>>>>>> with the OMPI code to see where it happens, and more importantly
>>>>>> whether the completion properly waits for this post-RDMA copy to
>>>>>> complete.
>>>>>>
>>>>>>
>>>>>>> 1.If you try to just bcast a few kilobytes of data using this
>>>>>>> technique, do
>>>>>>> you run into issues?
>>>>>>
>>>>>> No. All "simpler" attempts were successful, unfortunately. Can you be
>>>>>> a little bit more precise about what scenario you imagine ? The
>>>>>> setting "all ranks mmap a local file, and rank 0 broadcasts there" is
>>>>>> successful.
>>>>>>
>>>>>>> 2. How large is the data in the collective (input and output), is
>>>>>>> in_place
>>>>>>> used? I'm guess it's large enough that the BTL tries to work with the
>>>>>>> user
>>>>>>> buffer.
>>>>>>
>>>>>> MPI_IN_PLACE is used in reduce_scatter and allgather in the code.
>>>>>> Collectives are with communicators of 2 nodes, and we're talking (for
>>>>>> the smallest failing run) 8kb per node (i.e. 16kb total for an
>>>>>> allgather).
>>>>>>
>>>>>> E.
>>>>>>
>>>>>>> On Mon, Nov 10, 2014 at 9:29 AM, Emmanuel Thomé
>>>>>>> <emmanuel.th...@gmail.com>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I'm stumbling on a problem related to the openib btl in
>>>>>>>> openmpi-1.[78].*, and the (I think legitimate) use of file-backed
>>>>>>>> mmaped areas for receiving data through MPI collective calls.
>>>>>>>>
>>>>>>>> A test case is attached. I've tried to make is reasonably small,
>>>>>>>> although I recognize that it's not extra thin. The test case is a
>>>>>>>> trimmed down version of what I witness in the context of a rather
>>>>>>>> large program, so there is no claim of relevance of the test case
>>>>>>>> itself. It's here just to trigger the desired misbehaviour. The test
>>>>>>>> case contains some detailed information on what is done, and the
>>>>>>>> experiments I did.
>>>>>>>>
>>>>>>>> In a nutshell, the problem is as follows.
>>>>>>>>
>>>>>>>> - I do a computation, which involves MPI_Reduce_scatter and
>>>>>>>> MPI_Allgather.
>>>>>>>> - I save the result to a file (collective operation).
>>>>>>>>
>>>>>>>> *If* I save the file using something such as:
>>>>>>>> fd = open("blah", ...
>>>>>>>> area = mmap(..., fd, )
>>>>>>>> MPI_Gather(..., area, ...)
>>>>>>>> *AND* the MPI_Reduce_scatter is done with an alternative
>>>>>>>> implementation (which I believe is correct)
>>>>>>>> *AND* communication is done through the openib btl,
>>>>>>>>
>>>>>>>> then the file which gets saved is inconsistent with what is obtained
>>>>>>>> with the normal MPI_Reduce_scatter (alghough memory areas do coincide
>>>>>>>> before the save).
>>>>>>>>
>>>>>>>> I tried to dig a bit in the openib internals, but all I've been able
>>>>>>>> to witness was beyond my expertise (an RDMA read not transferring the
>>>>>>>> expected data, but I'm too uncomfortable with this layer to say
>>>>>>>> anything I'm sure about).
>>>>>>>>
>>>>>>>> Tests have been done with several openmpi versions including 1.8.3, on
>>>>>>>> a debian wheezy (7.5) + OFED 2.3 cluster.
>>>>>>>>
>>>>>>>> It would be great if someone could tell me if he is able to reproduce
>>>>>>>> the bug, or tell me whether something which is done in this test case
>>>>>>>> is illegal in any respect. I'd be glad to provide further information
>>>>>>>> which could be of any help.
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>>
>>>>>>>> E. Thomé.
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> us...@open-mpi.org
>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>> Link to this post:
>>>>>>>> http://www.open-mpi.org/community/lists/users/2014/11/25730.php
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> us...@open-mpi.org
>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>> Link to this post:
>>>>>>> http://www.open-mpi.org/community/lists/users/2014/11/25732.php
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> Link to this post:
>>>>> http://www.open-mpi.org/community/lists/users/2014/11/25740.php
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post:
>>>> http://www.open-mpi.org/community/lists/users/2014/11/25743.php
>> <prog6.c>_______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2014/11/25775.php
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/11/25779.php

Re: [OMPI users] mmaped memory and openib btl.

Reply via email to