Re: [OMPI users] mmaped memory and openib btl.

Emmanuel Thomé Thu, 13 Nov 2014 13:09:54 -0500 (EST)

Hi,

It turns out that the DT_NEEDED libs for my a.out are:
Dynamic Section:
  NEEDED               libmpi.so.1
  NEEDED               libpthread.so.0
  NEEDED               libc.so.6
which is absolutely consistent with the link command line:
catrel-44 ~ $ mpicc -W -Wall -std=c99 -O0 -g prog6.c -show
gcc -W -Wall -std=c99 -O0 -g prog6.c -pthread -Wl,-rpath -Wl,/usr/lib
-Wl,--enable-new-dtags -lmpi



As a consequence, the libc wins over libopen-pal, since it appears
deeper in the DSO resolution:
catrel-44 ~ $ ldd ./a.out
        linux-vdso.so.1 =>  (0x00007fffc5811000)
        libmpi.so.1 => /usr/lib/libmpi.so.1 (0x00007fa5fd904000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0
(0x00007fa5fd6d5000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fa5fd349000)
        libopen-rte.so.7 => /usr/lib/libopen-rte.so.7 (0x00007fa5fd0cd000)
        libopen-pal.so.6 => /usr/lib/libopen-pal.so.6 (0x00007fa5fcdf9000)
        libnuma.so.1 => /usr/lib/libnuma.so.1 (0x00007fa5fcbed000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fa5fc9e9000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fa5fc7e1000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fa5fc55e000)
        libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00007fa5fc35b000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fa5fdbe0000)

If I explicitly add -lopen-pal to the link command line, or if I pass
--openmpi:linkall to the mpicc wrapper, then libopen-pal appears
before the libc, and wins the contest for the munmap relocation. Which
makes my test pass successfully.

Is there supposed to be any smarter mechanism for having the
libopen-pal relocation win, rather than just DSO precedence ?

Best regards,

E.

On Wed, Nov 12, 2014 at 7:51 PM, Emmanuel Thomé
<[email protected]> wrote:
> yes I confirm. Thanks for saying that this is the supposed behaviour.
>
> In the binary, the code goes to munmap@plt, which goes to the libc,
> not to libopen-pal.so
>
> libc is 2.13-38+deb7u1
>
> I'm a total noob at got/plt relocations. What is the mechanism which
> should make the opal relocation win over the libc one ?
>
> E.
>
>
> On Wed, Nov 12, 2014 at 7:40 PM, Jeff Squyres (jsquyres)
> <[email protected]> wrote:
>> FWIW, munmap is *supposed* to be intercepted.  Can you confirm that when 
>> your application calls munmap, it doesn't make a call to libopen-pal.so?
>>
>> It should be calling this (1-line) function:
>>
>> -----
>> /* intercept munmap, as the user can give back memory that way as well. */
>> OPAL_DECLSPEC int munmap(void* addr, size_t len)
>> {
>>     return opal_memory_linux_free_ptmalloc2_munmap(addr, len, 0);
>> }
>> -----
>>
>>
>>
>> On Nov 12, 2014, at 11:08 AM, Emmanuel Thomé <[email protected]> 
>> wrote:
>>
>>> As far as I have been able to understand while looking at the code, it
>>> very much seems that Joshua pointed out the exact cause for the issue.
>>>
>>> munmap'ing a virtual address space region does not evict it from
>>> mpool_grdma->pool->lru_list . If a later mmap happens to return the
>>> same address (a priori tied to different physical location), the
>>> userspace believes this segment is already registered, and eventually
>>> the transfer is directed to a bogus location.
>>>
>>> This also seems to match this old discussion:
>>>
>>> http://lists.openfabrics.org/pipermail/general/2009-April/058786.html
>>>
>>> although I didn't read the whole discussion there, it very much seems
>>> that the proposal for moving the pinning/caching logic to the kernel
>>> did not make it, unfortunately.
>>>
>>> So are we here in the situation where this "munmap should be
>>> intercepted" logic actually proves too fragile ? (in that it's not
>>> intercepted in my case). The memory MCA in my configuration is:
>>>              MCA memory: linux (MCA v2.0, API v2.0, Component v1.8.3)
>>>
>>> I traced a bit what happens at the mmap call, it seems to go straight
>>> to the libc, not via openmpi first.
>>>
>>> For the time being, I think I'll have to consider any mmap()/munmap()
>>> rather unsafe to play with in an openmpi application.
>>>
>>> E.
>>>
>>> P.S: a last version of the test case is attached.
>>>
>>> Le 11 nov. 2014 19:48, "Emmanuel Thomé" <[email protected]> a écrit :
>>>>
>>>> Thanks a lot for your analysis. This seems consistent with what I can
>>>> obtain by playing around with my different test cases.
>>>>
>>>> It seems that munmap() does *not* unregister the memory chunk from the
>>>> cache. I suppose this is the reason for the bug.
>>>>
>>>> In fact using mmap(..., MAP_ANONYMOUS | MAP_PRIVATE) and munmap() as
>>>> substitutes for malloc()/free() trigger the same problem.
>>>>
>>>> It looks to me that there is an oversight in the OPAL hooks around the
>>>> memory functions, then. Do you agree ?
>>>>
>>>> E.
>>>>
>>>> On Tue, Nov 11, 2014 at 3:17 PM, Joshua Ladd <[email protected]> wrote:
>>>>> I was able to reproduce your issue and I think I understand the problem a
>>>>> bit better at least. This demonstrates exactly what I was pointing to:
>>>>>
>>>>> It looks like when the test switches over from eager RDMA (I'll explain 
>>>>> in a
>>>>> second), to doing a rendezvous protocol working entirely in user buffer
>>>>> space things go bad.
>>>>>
>>>>> If you're input is smaller than some threshold, the eager RDMA limit, then
>>>>> the contents of your user buffer are copied into OMPI/OpenIB BTL scratch
>>>>> buffers called "eager fragments". This pool of resources is preregistered,
>>>>> pinned, and have had their rkeys exchanged. So, in the eager protocol, 
>>>>> your
>>>>> data is copied into these "locked and loaded" RDMA frags and the put/get 
>>>>> is
>>>>> handled internally. When the data is received, it's copied back out into
>>>>> your buffer. In your setup, this always works.
>>>>>
>>>>> $mpirun -np 2 --map-by node --bind-to core -mca pml ob1 -mca
>>>>> btl_openib_if_include mlx4_0:1 -mca btl_openib_use_eager_rdma 1 -mca
>>>>> btl_openib_eager_limit 512 -mca btl openib,self ./ibtest -s 56
>>>>> per-node buffer has size 448 bytes
>>>>> node 0 iteration 0, lead word received from peer is 0x00000401 [ok]
>>>>> node 0 iteration 1, lead word received from peer is 0x00000801 [ok]
>>>>> node 0 iteration 2, lead word received from peer is 0x00000c01 [ok]
>>>>> node 0 iteration 3, lead word received from peer is 0x00001001 [ok]
>>>>>
>>>>> When you exceed the eager threshold, this always fails on the second
>>>>> iteration. To understand this, you need to understand that there is a
>>>>> protocol switch where now your user buffer is used for the transfer. 
>>>>> Hence,
>>>>> the user buffer is registered with the HCA. This operation is an 
>>>>> inherently
>>>>> high latency operation and is one of the primary motives for doing
>>>>> copy-in/copy-out into preregistered buffers for small, latency sensitive
>>>>> ops. For bandwidth bound transfers, the cost to register can be amortized
>>>>> over the whole transfer, but it still affects the total bandwidth. In the
>>>>> case of a rendezvous protocol where the user buffer is registered, there 
>>>>> is
>>>>> an optimization mostly used to help improve the numbers in a bandwidth
>>>>> benchmark called a registration cache. With registration caching the user
>>>>> buffer is registered once and the mkey put into a cache and the memory is
>>>>> kept pinned until the system provides some notification via either memory
>>>>> hooks in p2p malloc, or ummunotify that the buffer has been freed and this
>>>>> signals that the mkey can be evicted from the cache.  On subsequent
>>>>> send/recv operations from the same user buffer address, OpenIB BTL will 
>>>>> find
>>>>> the address in the registration cache and take the cached mkey and avoid
>>>>> paying the cost of the memory registration the memory registration and 
>>>>> start
>>>>> the data transfer.
>>>>>
>>>>> What I noticed is when the rendezvous protocol kicks in, it always fails 
>>>>> on
>>>>> the second iteration.
>>>>>
>>>>> $mpirun -np 2 --map-by node --bind-to core -mca pml ob1 -mca
>>>>> btl_openib_if_include mlx4_0:1 -mca btl_openib_use_eager_rdma 1 -mca
>>>>> btl_openib_eager_limit 128 -mca btl openib,self ./ibtest -s 56
>>>>> per-node buffer has size 448 bytes
>>>>> node 0 iteration 0, lead word received from peer is 0x00000401 [ok]
>>>>> node 0 iteration 1, lead word received from peer is 0x00000000 [NOK]
>>>>> --------------------------------------------------------------------------
>>>>>
>>>>> So, I suspected it has something to do with the way the virtual address is
>>>>> being handled in this case. To test that theory, I just completely 
>>>>> disabled
>>>>> the registration cache by setting -mca mpi_leave_pinned 0 and things start
>>>>> to work:
>>>>>
>>>>> $mpirun -np 2 --map-by node --bind-to core -mca pml ob1 -mca
>>>>> btl_openib_if_include mlx4_0:1 -mca btl_openib_use_eager_rdma 1 -mca
>>>>> btl_openib_eager_limit 128 -mca mpi_leave_pinned 0 -mca btl openib,self
>>>>> ./ibtest -s 56
>>>>> per-node buffer has size 448 bytes
>>>>> node 0 iteration 0, lead word received from peer is 0x00000401 [ok]
>>>>> node 0 iteration 1, lead word received from peer is 0x00000801 [ok]
>>>>> node 0 iteration 2, lead word received from peer is 0x00000c01 [ok]
>>>>> node 0 iteration 3, lead word received from peer is 0x00001001 [ok]
>>>>>
>>>>> I don't know enough about memory hooks or the registration cache
>>>>> implementation to speak with any authority, but it looks like this is 
>>>>> where
>>>>> the issue resides. As a workaround, can you try your original experiment
>>>>> with -mca mpi_leave_pinned 0 and see if you get consistent results.
>>>>>
>>>>>
>>>>> Josh
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Nov 11, 2014 at 7:07 AM, Emmanuel Thomé <[email protected]>
>>>>> wrote:
>>>>>>
>>>>>> Hi again,
>>>>>>
>>>>>> I've been able to simplify my test case significantly. It now runs
>>>>>> with 2 nodes, and only a single MPI_Send / MPI_Recv pair is used.
>>>>>>
>>>>>> The pattern is as follows.
>>>>>>
>>>>>> *  - ranks 0 and 1 both own a local buffer.
>>>>>> *  - each fills it with (deterministically known) data.
>>>>>> *  - rank 0 collects the data from rank 1's local buffer
>>>>>> *    (whose contents should be no mystery), and writes this to a
>>>>>> *    file-backed mmaped area.
>>>>>> *  - rank 0 compares what it receives with what it knows it *should
>>>>>> *  have* received.
>>>>>>
>>>>>> The test fails if:
>>>>>>
>>>>>> *  - the openib btl is used among the 2 nodes
>>>>>> *  - a file-backed mmaped area is used for receiving the data.
>>>>>> *  - the write is done to a newly created file.
>>>>>> *  - per-node buffer is large enough.
>>>>>>
>>>>>> For a per-node buffer size above 12kb (12240 bytes to be exact), my
>>>>>> program fails, since the MPI_Recv does not receive the correct data
>>>>>> chunk (it just gets zeroes).
>>>>>>
>>>>>> I attach the simplified test case. I hope someone will be able to
>>>>>> reproduce the problem.
>>>>>>
>>>>>> Best regards,
>>>>>>
>>>>>> E.
>>>>>>
>>>>>>
>>>>>> On Mon, Nov 10, 2014 at 5:48 PM, Emmanuel Thomé
>>>>>> <[email protected]> wrote:
>>>>>>> Thanks for your answer.
>>>>>>>
>>>>>>> On Mon, Nov 10, 2014 at 4:31 PM, Joshua Ladd <[email protected]>
>>>>>>> wrote:
>>>>>>>> Just really quick off the top of my head, mmaping relies on the virtual
>>>>>>>> memory subsystem, whereas IB RDMA operations rely on physical memory
>>>>>>>> being
>>>>>>>> pinned (unswappable.)
>>>>>>>
>>>>>>> Yes. Does that mean that the result of computations should be
>>>>>>> undefined if I happen to give a user buffer which corresponds to a
>>>>>>> file ? That would be surprising.
>>>>>>>
>>>>>>>> For a large message transfer, the OpenIB BTL will
>>>>>>>> register the user buffer, which will pin the pages and make them
>>>>>>>> unswappable.
>>>>>>>
>>>>>>> Yes. But what are the semantics of pinning the VM area pointed to by
>>>>>>> ptr if ptr happens to be mmaped from a file ?
>>>>>>>
>>>>>>>> If the data being transfered is small, you'll copy-in/out to
>>>>>>>> internal bounce buffers and you shouldn't have issues.
>>>>>>>
>>>>>>> Are you saying that the openib layer does have provision in this case
>>>>>>> for letting the RDMA happen with a pinned physical memory range, and
>>>>>>> later perform the copy to the file-backed mmaped range ? That would
>>>>>>> make perfect sense indeed, although I don't have enough familiarity
>>>>>>> with the OMPI code to see where it happens, and more importantly
>>>>>>> whether the completion properly waits for this post-RDMA copy to
>>>>>>> complete.
>>>>>>>
>>>>>>>
>>>>>>>> 1.If you try to just bcast a few kilobytes of data using this
>>>>>>>> technique, do
>>>>>>>> you run into issues?
>>>>>>>
>>>>>>> No. All "simpler" attempts were successful, unfortunately. Can you be
>>>>>>> a little bit more precise about what scenario you imagine ? The
>>>>>>> setting "all ranks mmap a local file, and rank 0 broadcasts there" is
>>>>>>> successful.
>>>>>>>
>>>>>>>> 2. How large is the data in the collective (input and output), is
>>>>>>>> in_place
>>>>>>>> used? I'm guess it's large enough that the BTL tries to work with the
>>>>>>>> user
>>>>>>>> buffer.
>>>>>>>
>>>>>>> MPI_IN_PLACE is used in reduce_scatter and allgather in the code.
>>>>>>> Collectives are with communicators of 2 nodes, and we're talking (for
>>>>>>> the smallest failing run) 8kb per node (i.e. 16kb total for an
>>>>>>> allgather).
>>>>>>>
>>>>>>> E.
>>>>>>>
>>>>>>>> On Mon, Nov 10, 2014 at 9:29 AM, Emmanuel Thomé
>>>>>>>> <[email protected]>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I'm stumbling on a problem related to the openib btl in
>>>>>>>>> openmpi-1.[78].*, and the (I think legitimate) use of file-backed
>>>>>>>>> mmaped areas for receiving data through MPI collective calls.
>>>>>>>>>
>>>>>>>>> A test case is attached. I've tried to make is reasonably small,
>>>>>>>>> although I recognize that it's not extra thin. The test case is a
>>>>>>>>> trimmed down version of what I witness in the context of a rather
>>>>>>>>> large program, so there is no claim of relevance of the test case
>>>>>>>>> itself. It's here just to trigger the desired misbehaviour. The test
>>>>>>>>> case contains some detailed information on what is done, and the
>>>>>>>>> experiments I did.
>>>>>>>>>
>>>>>>>>> In a nutshell, the problem is as follows.
>>>>>>>>>
>>>>>>>>> - I do a computation, which involves MPI_Reduce_scatter and
>>>>>>>>> MPI_Allgather.
>>>>>>>>> - I save the result to a file (collective operation).
>>>>>>>>>
>>>>>>>>> *If* I save the file using something such as:
>>>>>>>>> fd = open("blah", ...
>>>>>>>>> area = mmap(..., fd, )
>>>>>>>>> MPI_Gather(..., area, ...)
>>>>>>>>> *AND* the MPI_Reduce_scatter is done with an alternative
>>>>>>>>> implementation (which I believe is correct)
>>>>>>>>> *AND* communication is done through the openib btl,
>>>>>>>>>
>>>>>>>>> then the file which gets saved is inconsistent with what is obtained
>>>>>>>>> with the normal MPI_Reduce_scatter (alghough memory areas do coincide
>>>>>>>>> before the save).
>>>>>>>>>
>>>>>>>>> I tried to dig a bit in the openib internals, but all I've been able
>>>>>>>>> to witness was beyond my expertise (an RDMA read not transferring the
>>>>>>>>> expected data, but I'm too uncomfortable with this layer to say
>>>>>>>>> anything I'm sure about).
>>>>>>>>>
>>>>>>>>> Tests have been done with several openmpi versions including 1.8.3, on
>>>>>>>>> a debian wheezy (7.5) + OFED 2.3 cluster.
>>>>>>>>>
>>>>>>>>> It would be great if someone could tell me if he is able to reproduce
>>>>>>>>> the bug, or tell me whether something which is done in this test case
>>>>>>>>> is illegal in any respect. I'd be glad to provide further information
>>>>>>>>> which could be of any help.
>>>>>>>>>
>>>>>>>>> Best regards,
>>>>>>>>>
>>>>>>>>> E. Thomé.
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> [email protected]
>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>> Link to this post:
>>>>>>>>> http://www.open-mpi.org/community/lists/users/2014/11/25730.php
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> [email protected]
>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>> Link to this post:
>>>>>>>> http://www.open-mpi.org/community/lists/users/2014/11/25732.php
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> [email protected]
>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> Link to this post:
>>>>>> http://www.open-mpi.org/community/lists/users/2014/11/25740.php
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> [email protected]
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> Link to this post:
>>>>> http://www.open-mpi.org/community/lists/users/2014/11/25743.php
>>> <prog6.c>_______________________________________________
>>> users mailing list
>>> [email protected]
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2014/11/25775.php
>>
>>
>> --
>> Jeff Squyres
>> [email protected]
>> For corporate legal information go to: 
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>> _______________________________________________
>> users mailing list
>> [email protected]
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2014/11/25779.php

Re: [OMPI users] mmaped memory and openib btl.

Reply via email to