Re: [OMPI users] mmaped memory and openib btl.

Emmanuel Thomé Sat, 29 Nov 2014 08:49:32 -0500 (EST)

Hi,

I am still affected by the bug which I reported in the thread below
(munmapped area lingers in registered memory cache). I'd just like to
know if this is recognized as a defect, and whether a fix could be
considered, or if instead I should consider the failure I observe as
being "normal behavior", or if somehow there's something weird in the
tests I'm running.


The explanation I have is that since the compilation command line
created by the mpicc wrapper only contains -lmpi and not -lopen-pal,
the functions in OPAL which are supposed to wrap some libc functions
(munmap in my case) are not activated: indeed, the libc comes first in
the list of dynamically loaded objects to be searched for relocations
(while -lopen-pal comes only at the second level, since it is only
here because it is triggered by -lmpi).

Does this situation make any sense ?

Regards,

E.



On Thu, Nov 13, 2014 at 7:09 PM, Emmanuel Thomé
<emmanuel.th...@gmail.com> wrote:
> Hi,
>
> It turns out that the DT_NEEDED libs for my a.out are:
> Dynamic Section:
>   NEEDED               libmpi.so.1
>   NEEDED               libpthread.so.0
>   NEEDED               libc.so.6
> which is absolutely consistent with the link command line:
> catrel-44 ~ $ mpicc -W -Wall -std=c99 -O0 -g prog6.c -show
> gcc -W -Wall -std=c99 -O0 -g prog6.c -pthread -Wl,-rpath -Wl,/usr/lib
> -Wl,--enable-new-dtags -lmpi
>
>
> As a consequence, the libc wins over libopen-pal, since it appears
> deeper in the DSO resolution:
> catrel-44 ~ $ ldd ./a.out
>         linux-vdso.so.1 =>  (0x00007fffc5811000)
>         libmpi.so.1 => /usr/lib/libmpi.so.1 (0x00007fa5fd904000)
>         libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0
> (0x00007fa5fd6d5000)
>         libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fa5fd349000)
>         libopen-rte.so.7 => /usr/lib/libopen-rte.so.7 (0x00007fa5fd0cd000)
>         libopen-pal.so.6 => /usr/lib/libopen-pal.so.6 (0x00007fa5fcdf9000)
>         libnuma.so.1 => /usr/lib/libnuma.so.1 (0x00007fa5fcbed000)
>         libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fa5fc9e9000)
>         librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fa5fc7e1000)
>         libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fa5fc55e000)
>         libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 
> (0x00007fa5fc35b000)
>         /lib64/ld-linux-x86-64.so.2 (0x00007fa5fdbe0000)
>
> If I explicitly add -lopen-pal to the link command line, or if I pass
> --openmpi:linkall to the mpicc wrapper, then libopen-pal appears
> before the libc, and wins the contest for the munmap relocation. Which
> makes my test pass successfully.
>
> Is there supposed to be any smarter mechanism for having the
> libopen-pal relocation win, rather than just DSO precedence ?
>
> Best regards,
>
> E.
>
> On Wed, Nov 12, 2014 at 7:51 PM, Emmanuel Thomé
> <emmanuel.th...@gmail.com> wrote:
>> yes I confirm. Thanks for saying that this is the supposed behaviour.
>>
>> In the binary, the code goes to munmap@plt, which goes to the libc,
>> not to libopen-pal.so
>>
>> libc is 2.13-38+deb7u1
>>
>> I'm a total noob at got/plt relocations. What is the mechanism which
>> should make the opal relocation win over the libc one ?
>>
>> E.
>>
>>
>> On Wed, Nov 12, 2014 at 7:40 PM, Jeff Squyres (jsquyres)
>> <jsquy...@cisco.com> wrote:
>>> FWIW, munmap is *supposed* to be intercepted.  Can you confirm that when 
>>> your application calls munmap, it doesn't make a call to libopen-pal.so?
>>>
>>> It should be calling this (1-line) function:
>>>
>>> -----
>>> /* intercept munmap, as the user can give back memory that way as well. */
>>> OPAL_DECLSPEC int munmap(void* addr, size_t len)
>>> {
>>>     return opal_memory_linux_free_ptmalloc2_munmap(addr, len, 0);
>>> }
>>> -----
>>>
>>>
>>>
>>> On Nov 12, 2014, at 11:08 AM, Emmanuel Thomé <emmanuel.th...@gmail.com> 
>>> wrote:
>>>
>>>> As far as I have been able to understand while looking at the code, it
>>>> very much seems that Joshua pointed out the exact cause for the issue.
>>>>
>>>> munmap'ing a virtual address space region does not evict it from
>>>> mpool_grdma->pool->lru_list . If a later mmap happens to return the
>>>> same address (a priori tied to different physical location), the
>>>> userspace believes this segment is already registered, and eventually
>>>> the transfer is directed to a bogus location.
>>>>
>>>> This also seems to match this old discussion:
>>>>
>>>> http://lists.openfabrics.org/pipermail/general/2009-April/058786.html
>>>>
>>>> although I didn't read the whole discussion there, it very much seems
>>>> that the proposal for moving the pinning/caching logic to the kernel
>>>> did not make it, unfortunately.
>>>>
>>>> So are we here in the situation where this "munmap should be
>>>> intercepted" logic actually proves too fragile ? (in that it's not
>>>> intercepted in my case). The memory MCA in my configuration is:
>>>>              MCA memory: linux (MCA v2.0, API v2.0, Component v1.8.3)
>>>>
>>>> I traced a bit what happens at the mmap call, it seems to go straight
>>>> to the libc, not via openmpi first.
>>>>
>>>> For the time being, I think I'll have to consider any mmap()/munmap()
>>>> rather unsafe to play with in an openmpi application.
>>>>
>>>> E.
>>>>
>>>> P.S: a last version of the test case is attached.
>>>>
>>>> Le 11 nov. 2014 19:48, "Emmanuel Thomé" <emmanuel.th...@gmail.com> a écrit 
>>>> :
>>>>>
>>>>> Thanks a lot for your analysis. This seems consistent with what I can
>>>>> obtain by playing around with my different test cases.
>>>>>
>>>>> It seems that munmap() does *not* unregister the memory chunk from the
>>>>> cache. I suppose this is the reason for the bug.
>>>>>
>>>>> In fact using mmap(..., MAP_ANONYMOUS | MAP_PRIVATE) and munmap() as
>>>>> substitutes for malloc()/free() trigger the same problem.
>>>>>
>>>>> It looks to me that there is an oversight in the OPAL hooks around the
>>>>> memory functions, then. Do you agree ?
>>>>>
>>>>> E.
>>>>>
>>>>> On Tue, Nov 11, 2014 at 3:17 PM, Joshua Ladd <jladd.m...@gmail.com> wrote:
>>>>>> I was able to reproduce your issue and I think I understand the problem a
>>>>>> bit better at least. This demonstrates exactly what I was pointing to:
>>>>>>
>>>>>> It looks like when the test switches over from eager RDMA (I'll explain 
>>>>>> in a
>>>>>> second), to doing a rendezvous protocol working entirely in user buffer
>>>>>> space things go bad.
>>>>>>
>>>>>> If you're input is smaller than some threshold, the eager RDMA limit, 
>>>>>> then
>>>>>> the contents of your user buffer are copied into OMPI/OpenIB BTL scratch
>>>>>> buffers called "eager fragments". This pool of resources is 
>>>>>> preregistered,
>>>>>> pinned, and have had their rkeys exchanged. So, in the eager protocol, 
>>>>>> your
>>>>>> data is copied into these "locked and loaded" RDMA frags and the put/get 
>>>>>> is
>>>>>> handled internally. When the data is received, it's copied back out into
>>>>>> your buffer. In your setup, this always works.
>>>>>>
>>>>>> $mpirun -np 2 --map-by node --bind-to core -mca pml ob1 -mca
>>>>>> btl_openib_if_include mlx4_0:1 -mca btl_openib_use_eager_rdma 1 -mca
>>>>>> btl_openib_eager_limit 512 -mca btl openib,self ./ibtest -s 56
>>>>>> per-node buffer has size 448 bytes
>>>>>> node 0 iteration 0, lead word received from peer is 0x00000401 [ok]
>>>>>> node 0 iteration 1, lead word received from peer is 0x00000801 [ok]
>>>>>> node 0 iteration 2, lead word received from peer is 0x00000c01 [ok]
>>>>>> node 0 iteration 3, lead word received from peer is 0x00001001 [ok]
>>>>>>
>>>>>> When you exceed the eager threshold, this always fails on the second
>>>>>> iteration. To understand this, you need to understand that there is a
>>>>>> protocol switch where now your user buffer is used for the transfer. 
>>>>>> Hence,
>>>>>> the user buffer is registered with the HCA. This operation is an 
>>>>>> inherently
>>>>>> high latency operation and is one of the primary motives for doing
>>>>>> copy-in/copy-out into preregistered buffers for small, latency sensitive
>>>>>> ops. For bandwidth bound transfers, the cost to register can be amortized
>>>>>> over the whole transfer, but it still affects the total bandwidth. In the
>>>>>> case of a rendezvous protocol where the user buffer is registered, there 
>>>>>> is
>>>>>> an optimization mostly used to help improve the numbers in a bandwidth
>>>>>> benchmark called a registration cache. With registration caching the user
>>>>>> buffer is registered once and the mkey put into a cache and the memory is
>>>>>> kept pinned until the system provides some notification via either memory
>>>>>> hooks in p2p malloc, or ummunotify that the buffer has been freed and 
>>>>>> this
>>>>>> signals that the mkey can be evicted from the cache.  On subsequent
>>>>>> send/recv operations from the same user buffer address, OpenIB BTL will 
>>>>>> find
>>>>>> the address in the registration cache and take the cached mkey and avoid
>>>>>> paying the cost of the memory registration the memory registration and 
>>>>>> start
>>>>>> the data transfer.
>>>>>>
>>>>>> What I noticed is when the rendezvous protocol kicks in, it always fails 
>>>>>> on
>>>>>> the second iteration.
>>>>>>
>>>>>> $mpirun -np 2 --map-by node --bind-to core -mca pml ob1 -mca
>>>>>> btl_openib_if_include mlx4_0:1 -mca btl_openib_use_eager_rdma 1 -mca
>>>>>> btl_openib_eager_limit 128 -mca btl openib,self ./ibtest -s 56
>>>>>> per-node buffer has size 448 bytes
>>>>>> node 0 iteration 0, lead word received from peer is 0x00000401 [ok]
>>>>>> node 0 iteration 1, lead word received from peer is 0x00000000 [NOK]
>>>>>> --------------------------------------------------------------------------
>>>>>>
>>>>>> So, I suspected it has something to do with the way the virtual address 
>>>>>> is
>>>>>> being handled in this case. To test that theory, I just completely 
>>>>>> disabled
>>>>>> the registration cache by setting -mca mpi_leave_pinned 0 and things 
>>>>>> start
>>>>>> to work:
>>>>>>
>>>>>> $mpirun -np 2 --map-by node --bind-to core -mca pml ob1 -mca
>>>>>> btl_openib_if_include mlx4_0:1 -mca btl_openib_use_eager_rdma 1 -mca
>>>>>> btl_openib_eager_limit 128 -mca mpi_leave_pinned 0 -mca btl openib,self
>>>>>> ./ibtest -s 56
>>>>>> per-node buffer has size 448 bytes
>>>>>> node 0 iteration 0, lead word received from peer is 0x00000401 [ok]
>>>>>> node 0 iteration 1, lead word received from peer is 0x00000801 [ok]
>>>>>> node 0 iteration 2, lead word received from peer is 0x00000c01 [ok]
>>>>>> node 0 iteration 3, lead word received from peer is 0x00001001 [ok]
>>>>>>
>>>>>> I don't know enough about memory hooks or the registration cache
>>>>>> implementation to speak with any authority, but it looks like this is 
>>>>>> where
>>>>>> the issue resides. As a workaround, can you try your original experiment
>>>>>> with -mca mpi_leave_pinned 0 and see if you get consistent results.
>>>>>>
>>>>>>
>>>>>> Josh
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Nov 11, 2014 at 7:07 AM, Emmanuel Thomé 
>>>>>> <emmanuel.th...@gmail.com>
>>>>>> wrote:
>>>>>>>
>>>>>>> Hi again,
>>>>>>>
>>>>>>> I've been able to simplify my test case significantly. It now runs
>>>>>>> with 2 nodes, and only a single MPI_Send / MPI_Recv pair is used.
>>>>>>>
>>>>>>> The pattern is as follows.
>>>>>>>
>>>>>>> *  - ranks 0 and 1 both own a local buffer.
>>>>>>> *  - each fills it with (deterministically known) data.
>>>>>>> *  - rank 0 collects the data from rank 1's local buffer
>>>>>>> *    (whose contents should be no mystery), and writes this to a
>>>>>>> *    file-backed mmaped area.
>>>>>>> *  - rank 0 compares what it receives with what it knows it *should
>>>>>>> *  have* received.
>>>>>>>
>>>>>>> The test fails if:
>>>>>>>
>>>>>>> *  - the openib btl is used among the 2 nodes
>>>>>>> *  - a file-backed mmaped area is used for receiving the data.
>>>>>>> *  - the write is done to a newly created file.
>>>>>>> *  - per-node buffer is large enough.
>>>>>>>
>>>>>>> For a per-node buffer size above 12kb (12240 bytes to be exact), my
>>>>>>> program fails, since the MPI_Recv does not receive the correct data
>>>>>>> chunk (it just gets zeroes).
>>>>>>>
>>>>>>> I attach the simplified test case. I hope someone will be able to
>>>>>>> reproduce the problem.
>>>>>>>
>>>>>>> Best regards,
>>>>>>>
>>>>>>> E.
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Nov 10, 2014 at 5:48 PM, Emmanuel Thomé
>>>>>>> <emmanuel.th...@gmail.com> wrote:
>>>>>>>> Thanks for your answer.
>>>>>>>>
>>>>>>>> On Mon, Nov 10, 2014 at 4:31 PM, Joshua Ladd <jladd.m...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>> Just really quick off the top of my head, mmaping relies on the 
>>>>>>>>> virtual
>>>>>>>>> memory subsystem, whereas IB RDMA operations rely on physical memory
>>>>>>>>> being
>>>>>>>>> pinned (unswappable.)
>>>>>>>>
>>>>>>>> Yes. Does that mean that the result of computations should be
>>>>>>>> undefined if I happen to give a user buffer which corresponds to a
>>>>>>>> file ? That would be surprising.
>>>>>>>>
>>>>>>>>> For a large message transfer, the OpenIB BTL will
>>>>>>>>> register the user buffer, which will pin the pages and make them
>>>>>>>>> unswappable.
>>>>>>>>
>>>>>>>> Yes. But what are the semantics of pinning the VM area pointed to by
>>>>>>>> ptr if ptr happens to be mmaped from a file ?
>>>>>>>>
>>>>>>>>> If the data being transfered is small, you'll copy-in/out to
>>>>>>>>> internal bounce buffers and you shouldn't have issues.
>>>>>>>>
>>>>>>>> Are you saying that the openib layer does have provision in this case
>>>>>>>> for letting the RDMA happen with a pinned physical memory range, and
>>>>>>>> later perform the copy to the file-backed mmaped range ? That would
>>>>>>>> make perfect sense indeed, although I don't have enough familiarity
>>>>>>>> with the OMPI code to see where it happens, and more importantly
>>>>>>>> whether the completion properly waits for this post-RDMA copy to
>>>>>>>> complete.
>>>>>>>>
>>>>>>>>
>>>>>>>>> 1.If you try to just bcast a few kilobytes of data using this
>>>>>>>>> technique, do
>>>>>>>>> you run into issues?
>>>>>>>>
>>>>>>>> No. All "simpler" attempts were successful, unfortunately. Can you be
>>>>>>>> a little bit more precise about what scenario you imagine ? The
>>>>>>>> setting "all ranks mmap a local file, and rank 0 broadcasts there" is
>>>>>>>> successful.
>>>>>>>>
>>>>>>>>> 2. How large is the data in the collective (input and output), is
>>>>>>>>> in_place
>>>>>>>>> used? I'm guess it's large enough that the BTL tries to work with the
>>>>>>>>> user
>>>>>>>>> buffer.
>>>>>>>>
>>>>>>>> MPI_IN_PLACE is used in reduce_scatter and allgather in the code.
>>>>>>>> Collectives are with communicators of 2 nodes, and we're talking (for
>>>>>>>> the smallest failing run) 8kb per node (i.e. 16kb total for an
>>>>>>>> allgather).
>>>>>>>>
>>>>>>>> E.
>>>>>>>>
>>>>>>>>> On Mon, Nov 10, 2014 at 9:29 AM, Emmanuel Thomé
>>>>>>>>> <emmanuel.th...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I'm stumbling on a problem related to the openib btl in
>>>>>>>>>> openmpi-1.[78].*, and the (I think legitimate) use of file-backed
>>>>>>>>>> mmaped areas for receiving data through MPI collective calls.
>>>>>>>>>>
>>>>>>>>>> A test case is attached. I've tried to make is reasonably small,
>>>>>>>>>> although I recognize that it's not extra thin. The test case is a
>>>>>>>>>> trimmed down version of what I witness in the context of a rather
>>>>>>>>>> large program, so there is no claim of relevance of the test case
>>>>>>>>>> itself. It's here just to trigger the desired misbehaviour. The test
>>>>>>>>>> case contains some detailed information on what is done, and the
>>>>>>>>>> experiments I did.
>>>>>>>>>>
>>>>>>>>>> In a nutshell, the problem is as follows.
>>>>>>>>>>
>>>>>>>>>> - I do a computation, which involves MPI_Reduce_scatter and
>>>>>>>>>> MPI_Allgather.
>>>>>>>>>> - I save the result to a file (collective operation).
>>>>>>>>>>
>>>>>>>>>> *If* I save the file using something such as:
>>>>>>>>>> fd = open("blah", ...
>>>>>>>>>> area = mmap(..., fd, )
>>>>>>>>>> MPI_Gather(..., area, ...)
>>>>>>>>>> *AND* the MPI_Reduce_scatter is done with an alternative
>>>>>>>>>> implementation (which I believe is correct)
>>>>>>>>>> *AND* communication is done through the openib btl,
>>>>>>>>>>
>>>>>>>>>> then the file which gets saved is inconsistent with what is obtained
>>>>>>>>>> with the normal MPI_Reduce_scatter (alghough memory areas do coincide
>>>>>>>>>> before the save).
>>>>>>>>>>
>>>>>>>>>> I tried to dig a bit in the openib internals, but all I've been able
>>>>>>>>>> to witness was beyond my expertise (an RDMA read not transferring the
>>>>>>>>>> expected data, but I'm too uncomfortable with this layer to say
>>>>>>>>>> anything I'm sure about).
>>>>>>>>>>
>>>>>>>>>> Tests have been done with several openmpi versions including 1.8.3, 
>>>>>>>>>> on
>>>>>>>>>> a debian wheezy (7.5) + OFED 2.3 cluster.
>>>>>>>>>>
>>>>>>>>>> It would be great if someone could tell me if he is able to reproduce
>>>>>>>>>> the bug, or tell me whether something which is done in this test case
>>>>>>>>>> is illegal in any respect. I'd be glad to provide further information
>>>>>>>>>> which could be of any help.
>>>>>>>>>>
>>>>>>>>>> Best regards,
>>>>>>>>>>
>>>>>>>>>> E. Thomé.
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> users mailing list
>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>> Link to this post:
>>>>>>>>>> http://www.open-mpi.org/community/lists/users/2014/11/25730.php
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> us...@open-mpi.org
>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>> Link to this post:
>>>>>>>>> http://www.open-mpi.org/community/lists/users/2014/11/25732.php
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> us...@open-mpi.org
>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>> Link to this post:
>>>>>>> http://www.open-mpi.org/community/lists/users/2014/11/25740.php
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> Link to this post:
>>>>>> http://www.open-mpi.org/community/lists/users/2014/11/25743.php
>>>> <prog6.c>_______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/users/2014/11/25775.php
>>>
>>>
>>> --
>>> Jeff Squyres
>>> jsquy...@cisco.com
>>> For corporate legal information go to: 
>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2014/11/25779.php

Re: [OMPI users] mmaped memory and openib btl.

Reply via email to