Hi, I am still affected by the bug which I reported in the thread below (munmapped area lingers in registered memory cache). I'd just like to know if this is recognized as a defect, and whether a fix could be considered, or if instead I should consider the failure I observe as being "normal behavior", or if somehow there's something weird in the tests I'm running.
The explanation I have is that since the compilation command line created by the mpicc wrapper only contains -lmpi and not -lopen-pal, the functions in OPAL which are supposed to wrap some libc functions (munmap in my case) are not activated: indeed, the libc comes first in the list of dynamically loaded objects to be searched for relocations (while -lopen-pal comes only at the second level, since it is only here because it is triggered by -lmpi). Does this situation make any sense ? Regards, E. On Thu, Nov 13, 2014 at 7:09 PM, Emmanuel Thomé <emmanuel.th...@gmail.com> wrote: > Hi, > > It turns out that the DT_NEEDED libs for my a.out are: > Dynamic Section: > NEEDED libmpi.so.1 > NEEDED libpthread.so.0 > NEEDED libc.so.6 > which is absolutely consistent with the link command line: > catrel-44 ~ $ mpicc -W -Wall -std=c99 -O0 -g prog6.c -show > gcc -W -Wall -std=c99 -O0 -g prog6.c -pthread -Wl,-rpath -Wl,/usr/lib > -Wl,--enable-new-dtags -lmpi > > > As a consequence, the libc wins over libopen-pal, since it appears > deeper in the DSO resolution: > catrel-44 ~ $ ldd ./a.out > linux-vdso.so.1 => (0x00007fffc5811000) > libmpi.so.1 => /usr/lib/libmpi.so.1 (0x00007fa5fd904000) > libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 > (0x00007fa5fd6d5000) > libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fa5fd349000) > libopen-rte.so.7 => /usr/lib/libopen-rte.so.7 (0x00007fa5fd0cd000) > libopen-pal.so.6 => /usr/lib/libopen-pal.so.6 (0x00007fa5fcdf9000) > libnuma.so.1 => /usr/lib/libnuma.so.1 (0x00007fa5fcbed000) > libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fa5fc9e9000) > librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fa5fc7e1000) > libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fa5fc55e000) > libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 > (0x00007fa5fc35b000) > /lib64/ld-linux-x86-64.so.2 (0x00007fa5fdbe0000) > > If I explicitly add -lopen-pal to the link command line, or if I pass > --openmpi:linkall to the mpicc wrapper, then libopen-pal appears > before the libc, and wins the contest for the munmap relocation. Which > makes my test pass successfully. > > Is there supposed to be any smarter mechanism for having the > libopen-pal relocation win, rather than just DSO precedence ? > > Best regards, > > E. > > On Wed, Nov 12, 2014 at 7:51 PM, Emmanuel Thomé > <emmanuel.th...@gmail.com> wrote: >> yes I confirm. Thanks for saying that this is the supposed behaviour. >> >> In the binary, the code goes to munmap@plt, which goes to the libc, >> not to libopen-pal.so >> >> libc is 2.13-38+deb7u1 >> >> I'm a total noob at got/plt relocations. What is the mechanism which >> should make the opal relocation win over the libc one ? >> >> E. >> >> >> On Wed, Nov 12, 2014 at 7:40 PM, Jeff Squyres (jsquyres) >> <jsquy...@cisco.com> wrote: >>> FWIW, munmap is *supposed* to be intercepted. Can you confirm that when >>> your application calls munmap, it doesn't make a call to libopen-pal.so? >>> >>> It should be calling this (1-line) function: >>> >>> ----- >>> /* intercept munmap, as the user can give back memory that way as well. */ >>> OPAL_DECLSPEC int munmap(void* addr, size_t len) >>> { >>> return opal_memory_linux_free_ptmalloc2_munmap(addr, len, 0); >>> } >>> ----- >>> >>> >>> >>> On Nov 12, 2014, at 11:08 AM, Emmanuel Thomé <emmanuel.th...@gmail.com> >>> wrote: >>> >>>> As far as I have been able to understand while looking at the code, it >>>> very much seems that Joshua pointed out the exact cause for the issue. >>>> >>>> munmap'ing a virtual address space region does not evict it from >>>> mpool_grdma->pool->lru_list . If a later mmap happens to return the >>>> same address (a priori tied to different physical location), the >>>> userspace believes this segment is already registered, and eventually >>>> the transfer is directed to a bogus location. >>>> >>>> This also seems to match this old discussion: >>>> >>>> http://lists.openfabrics.org/pipermail/general/2009-April/058786.html >>>> >>>> although I didn't read the whole discussion there, it very much seems >>>> that the proposal for moving the pinning/caching logic to the kernel >>>> did not make it, unfortunately. >>>> >>>> So are we here in the situation where this "munmap should be >>>> intercepted" logic actually proves too fragile ? (in that it's not >>>> intercepted in my case). The memory MCA in my configuration is: >>>> MCA memory: linux (MCA v2.0, API v2.0, Component v1.8.3) >>>> >>>> I traced a bit what happens at the mmap call, it seems to go straight >>>> to the libc, not via openmpi first. >>>> >>>> For the time being, I think I'll have to consider any mmap()/munmap() >>>> rather unsafe to play with in an openmpi application. >>>> >>>> E. >>>> >>>> P.S: a last version of the test case is attached. >>>> >>>> Le 11 nov. 2014 19:48, "Emmanuel Thomé" <emmanuel.th...@gmail.com> a écrit >>>> : >>>>> >>>>> Thanks a lot for your analysis. This seems consistent with what I can >>>>> obtain by playing around with my different test cases. >>>>> >>>>> It seems that munmap() does *not* unregister the memory chunk from the >>>>> cache. I suppose this is the reason for the bug. >>>>> >>>>> In fact using mmap(..., MAP_ANONYMOUS | MAP_PRIVATE) and munmap() as >>>>> substitutes for malloc()/free() trigger the same problem. >>>>> >>>>> It looks to me that there is an oversight in the OPAL hooks around the >>>>> memory functions, then. Do you agree ? >>>>> >>>>> E. >>>>> >>>>> On Tue, Nov 11, 2014 at 3:17 PM, Joshua Ladd <jladd.m...@gmail.com> wrote: >>>>>> I was able to reproduce your issue and I think I understand the problem a >>>>>> bit better at least. This demonstrates exactly what I was pointing to: >>>>>> >>>>>> It looks like when the test switches over from eager RDMA (I'll explain >>>>>> in a >>>>>> second), to doing a rendezvous protocol working entirely in user buffer >>>>>> space things go bad. >>>>>> >>>>>> If you're input is smaller than some threshold, the eager RDMA limit, >>>>>> then >>>>>> the contents of your user buffer are copied into OMPI/OpenIB BTL scratch >>>>>> buffers called "eager fragments". This pool of resources is >>>>>> preregistered, >>>>>> pinned, and have had their rkeys exchanged. So, in the eager protocol, >>>>>> your >>>>>> data is copied into these "locked and loaded" RDMA frags and the put/get >>>>>> is >>>>>> handled internally. When the data is received, it's copied back out into >>>>>> your buffer. In your setup, this always works. >>>>>> >>>>>> $mpirun -np 2 --map-by node --bind-to core -mca pml ob1 -mca >>>>>> btl_openib_if_include mlx4_0:1 -mca btl_openib_use_eager_rdma 1 -mca >>>>>> btl_openib_eager_limit 512 -mca btl openib,self ./ibtest -s 56 >>>>>> per-node buffer has size 448 bytes >>>>>> node 0 iteration 0, lead word received from peer is 0x00000401 [ok] >>>>>> node 0 iteration 1, lead word received from peer is 0x00000801 [ok] >>>>>> node 0 iteration 2, lead word received from peer is 0x00000c01 [ok] >>>>>> node 0 iteration 3, lead word received from peer is 0x00001001 [ok] >>>>>> >>>>>> When you exceed the eager threshold, this always fails on the second >>>>>> iteration. To understand this, you need to understand that there is a >>>>>> protocol switch where now your user buffer is used for the transfer. >>>>>> Hence, >>>>>> the user buffer is registered with the HCA. This operation is an >>>>>> inherently >>>>>> high latency operation and is one of the primary motives for doing >>>>>> copy-in/copy-out into preregistered buffers for small, latency sensitive >>>>>> ops. For bandwidth bound transfers, the cost to register can be amortized >>>>>> over the whole transfer, but it still affects the total bandwidth. In the >>>>>> case of a rendezvous protocol where the user buffer is registered, there >>>>>> is >>>>>> an optimization mostly used to help improve the numbers in a bandwidth >>>>>> benchmark called a registration cache. With registration caching the user >>>>>> buffer is registered once and the mkey put into a cache and the memory is >>>>>> kept pinned until the system provides some notification via either memory >>>>>> hooks in p2p malloc, or ummunotify that the buffer has been freed and >>>>>> this >>>>>> signals that the mkey can be evicted from the cache. On subsequent >>>>>> send/recv operations from the same user buffer address, OpenIB BTL will >>>>>> find >>>>>> the address in the registration cache and take the cached mkey and avoid >>>>>> paying the cost of the memory registration the memory registration and >>>>>> start >>>>>> the data transfer. >>>>>> >>>>>> What I noticed is when the rendezvous protocol kicks in, it always fails >>>>>> on >>>>>> the second iteration. >>>>>> >>>>>> $mpirun -np 2 --map-by node --bind-to core -mca pml ob1 -mca >>>>>> btl_openib_if_include mlx4_0:1 -mca btl_openib_use_eager_rdma 1 -mca >>>>>> btl_openib_eager_limit 128 -mca btl openib,self ./ibtest -s 56 >>>>>> per-node buffer has size 448 bytes >>>>>> node 0 iteration 0, lead word received from peer is 0x00000401 [ok] >>>>>> node 0 iteration 1, lead word received from peer is 0x00000000 [NOK] >>>>>> -------------------------------------------------------------------------- >>>>>> >>>>>> So, I suspected it has something to do with the way the virtual address >>>>>> is >>>>>> being handled in this case. To test that theory, I just completely >>>>>> disabled >>>>>> the registration cache by setting -mca mpi_leave_pinned 0 and things >>>>>> start >>>>>> to work: >>>>>> >>>>>> $mpirun -np 2 --map-by node --bind-to core -mca pml ob1 -mca >>>>>> btl_openib_if_include mlx4_0:1 -mca btl_openib_use_eager_rdma 1 -mca >>>>>> btl_openib_eager_limit 128 -mca mpi_leave_pinned 0 -mca btl openib,self >>>>>> ./ibtest -s 56 >>>>>> per-node buffer has size 448 bytes >>>>>> node 0 iteration 0, lead word received from peer is 0x00000401 [ok] >>>>>> node 0 iteration 1, lead word received from peer is 0x00000801 [ok] >>>>>> node 0 iteration 2, lead word received from peer is 0x00000c01 [ok] >>>>>> node 0 iteration 3, lead word received from peer is 0x00001001 [ok] >>>>>> >>>>>> I don't know enough about memory hooks or the registration cache >>>>>> implementation to speak with any authority, but it looks like this is >>>>>> where >>>>>> the issue resides. As a workaround, can you try your original experiment >>>>>> with -mca mpi_leave_pinned 0 and see if you get consistent results. >>>>>> >>>>>> >>>>>> Josh >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Nov 11, 2014 at 7:07 AM, Emmanuel Thomé >>>>>> <emmanuel.th...@gmail.com> >>>>>> wrote: >>>>>>> >>>>>>> Hi again, >>>>>>> >>>>>>> I've been able to simplify my test case significantly. It now runs >>>>>>> with 2 nodes, and only a single MPI_Send / MPI_Recv pair is used. >>>>>>> >>>>>>> The pattern is as follows. >>>>>>> >>>>>>> * - ranks 0 and 1 both own a local buffer. >>>>>>> * - each fills it with (deterministically known) data. >>>>>>> * - rank 0 collects the data from rank 1's local buffer >>>>>>> * (whose contents should be no mystery), and writes this to a >>>>>>> * file-backed mmaped area. >>>>>>> * - rank 0 compares what it receives with what it knows it *should >>>>>>> * have* received. >>>>>>> >>>>>>> The test fails if: >>>>>>> >>>>>>> * - the openib btl is used among the 2 nodes >>>>>>> * - a file-backed mmaped area is used for receiving the data. >>>>>>> * - the write is done to a newly created file. >>>>>>> * - per-node buffer is large enough. >>>>>>> >>>>>>> For a per-node buffer size above 12kb (12240 bytes to be exact), my >>>>>>> program fails, since the MPI_Recv does not receive the correct data >>>>>>> chunk (it just gets zeroes). >>>>>>> >>>>>>> I attach the simplified test case. I hope someone will be able to >>>>>>> reproduce the problem. >>>>>>> >>>>>>> Best regards, >>>>>>> >>>>>>> E. >>>>>>> >>>>>>> >>>>>>> On Mon, Nov 10, 2014 at 5:48 PM, Emmanuel Thomé >>>>>>> <emmanuel.th...@gmail.com> wrote: >>>>>>>> Thanks for your answer. >>>>>>>> >>>>>>>> On Mon, Nov 10, 2014 at 4:31 PM, Joshua Ladd <jladd.m...@gmail.com> >>>>>>>> wrote: >>>>>>>>> Just really quick off the top of my head, mmaping relies on the >>>>>>>>> virtual >>>>>>>>> memory subsystem, whereas IB RDMA operations rely on physical memory >>>>>>>>> being >>>>>>>>> pinned (unswappable.) >>>>>>>> >>>>>>>> Yes. Does that mean that the result of computations should be >>>>>>>> undefined if I happen to give a user buffer which corresponds to a >>>>>>>> file ? That would be surprising. >>>>>>>> >>>>>>>>> For a large message transfer, the OpenIB BTL will >>>>>>>>> register the user buffer, which will pin the pages and make them >>>>>>>>> unswappable. >>>>>>>> >>>>>>>> Yes. But what are the semantics of pinning the VM area pointed to by >>>>>>>> ptr if ptr happens to be mmaped from a file ? >>>>>>>> >>>>>>>>> If the data being transfered is small, you'll copy-in/out to >>>>>>>>> internal bounce buffers and you shouldn't have issues. >>>>>>>> >>>>>>>> Are you saying that the openib layer does have provision in this case >>>>>>>> for letting the RDMA happen with a pinned physical memory range, and >>>>>>>> later perform the copy to the file-backed mmaped range ? That would >>>>>>>> make perfect sense indeed, although I don't have enough familiarity >>>>>>>> with the OMPI code to see where it happens, and more importantly >>>>>>>> whether the completion properly waits for this post-RDMA copy to >>>>>>>> complete. >>>>>>>> >>>>>>>> >>>>>>>>> 1.If you try to just bcast a few kilobytes of data using this >>>>>>>>> technique, do >>>>>>>>> you run into issues? >>>>>>>> >>>>>>>> No. All "simpler" attempts were successful, unfortunately. Can you be >>>>>>>> a little bit more precise about what scenario you imagine ? The >>>>>>>> setting "all ranks mmap a local file, and rank 0 broadcasts there" is >>>>>>>> successful. >>>>>>>> >>>>>>>>> 2. How large is the data in the collective (input and output), is >>>>>>>>> in_place >>>>>>>>> used? I'm guess it's large enough that the BTL tries to work with the >>>>>>>>> user >>>>>>>>> buffer. >>>>>>>> >>>>>>>> MPI_IN_PLACE is used in reduce_scatter and allgather in the code. >>>>>>>> Collectives are with communicators of 2 nodes, and we're talking (for >>>>>>>> the smallest failing run) 8kb per node (i.e. 16kb total for an >>>>>>>> allgather). >>>>>>>> >>>>>>>> E. >>>>>>>> >>>>>>>>> On Mon, Nov 10, 2014 at 9:29 AM, Emmanuel Thomé >>>>>>>>> <emmanuel.th...@gmail.com> >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> I'm stumbling on a problem related to the openib btl in >>>>>>>>>> openmpi-1.[78].*, and the (I think legitimate) use of file-backed >>>>>>>>>> mmaped areas for receiving data through MPI collective calls. >>>>>>>>>> >>>>>>>>>> A test case is attached. I've tried to make is reasonably small, >>>>>>>>>> although I recognize that it's not extra thin. The test case is a >>>>>>>>>> trimmed down version of what I witness in the context of a rather >>>>>>>>>> large program, so there is no claim of relevance of the test case >>>>>>>>>> itself. It's here just to trigger the desired misbehaviour. The test >>>>>>>>>> case contains some detailed information on what is done, and the >>>>>>>>>> experiments I did. >>>>>>>>>> >>>>>>>>>> In a nutshell, the problem is as follows. >>>>>>>>>> >>>>>>>>>> - I do a computation, which involves MPI_Reduce_scatter and >>>>>>>>>> MPI_Allgather. >>>>>>>>>> - I save the result to a file (collective operation). >>>>>>>>>> >>>>>>>>>> *If* I save the file using something such as: >>>>>>>>>> fd = open("blah", ... >>>>>>>>>> area = mmap(..., fd, ) >>>>>>>>>> MPI_Gather(..., area, ...) >>>>>>>>>> *AND* the MPI_Reduce_scatter is done with an alternative >>>>>>>>>> implementation (which I believe is correct) >>>>>>>>>> *AND* communication is done through the openib btl, >>>>>>>>>> >>>>>>>>>> then the file which gets saved is inconsistent with what is obtained >>>>>>>>>> with the normal MPI_Reduce_scatter (alghough memory areas do coincide >>>>>>>>>> before the save). >>>>>>>>>> >>>>>>>>>> I tried to dig a bit in the openib internals, but all I've been able >>>>>>>>>> to witness was beyond my expertise (an RDMA read not transferring the >>>>>>>>>> expected data, but I'm too uncomfortable with this layer to say >>>>>>>>>> anything I'm sure about). >>>>>>>>>> >>>>>>>>>> Tests have been done with several openmpi versions including 1.8.3, >>>>>>>>>> on >>>>>>>>>> a debian wheezy (7.5) + OFED 2.3 cluster. >>>>>>>>>> >>>>>>>>>> It would be great if someone could tell me if he is able to reproduce >>>>>>>>>> the bug, or tell me whether something which is done in this test case >>>>>>>>>> is illegal in any respect. I'd be glad to provide further information >>>>>>>>>> which could be of any help. >>>>>>>>>> >>>>>>>>>> Best regards, >>>>>>>>>> >>>>>>>>>> E. Thomé. >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> users mailing list >>>>>>>>>> us...@open-mpi.org >>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>> Link to this post: >>>>>>>>>> http://www.open-mpi.org/community/lists/users/2014/11/25730.php >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> users mailing list >>>>>>>>> us...@open-mpi.org >>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>> Link to this post: >>>>>>>>> http://www.open-mpi.org/community/lists/users/2014/11/25732.php >>>>>>> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> Link to this post: >>>>>>> http://www.open-mpi.org/community/lists/users/2014/11/25740.php >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> Link to this post: >>>>>> http://www.open-mpi.org/community/lists/users/2014/11/25743.php >>>> <prog6.c>_______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2014/11/25775.php >>> >>> >>> -- >>> Jeff Squyres >>> jsquy...@cisco.com >>> For corporate legal information go to: >>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2014/11/25779.php