yes I confirm. Thanks for saying that this is the supposed behaviour. In the binary, the code goes to munmap@plt, which goes to the libc, not to libopen-pal.so
libc is 2.13-38+deb7u1 I'm a total noob at got/plt relocations. What is the mechanism which should make the opal relocation win over the libc one ? E. On Wed, Nov 12, 2014 at 7:40 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> wrote: > FWIW, munmap is *supposed* to be intercepted. Can you confirm that when your > application calls munmap, it doesn't make a call to libopen-pal.so? > > It should be calling this (1-line) function: > > ----- > /* intercept munmap, as the user can give back memory that way as well. */ > OPAL_DECLSPEC int munmap(void* addr, size_t len) > { > return opal_memory_linux_free_ptmalloc2_munmap(addr, len, 0); > } > ----- > > > > On Nov 12, 2014, at 11:08 AM, Emmanuel Thomé <emmanuel.th...@gmail.com> wrote: > >> As far as I have been able to understand while looking at the code, it >> very much seems that Joshua pointed out the exact cause for the issue. >> >> munmap'ing a virtual address space region does not evict it from >> mpool_grdma->pool->lru_list . If a later mmap happens to return the >> same address (a priori tied to different physical location), the >> userspace believes this segment is already registered, and eventually >> the transfer is directed to a bogus location. >> >> This also seems to match this old discussion: >> >> http://lists.openfabrics.org/pipermail/general/2009-April/058786.html >> >> although I didn't read the whole discussion there, it very much seems >> that the proposal for moving the pinning/caching logic to the kernel >> did not make it, unfortunately. >> >> So are we here in the situation where this "munmap should be >> intercepted" logic actually proves too fragile ? (in that it's not >> intercepted in my case). The memory MCA in my configuration is: >> MCA memory: linux (MCA v2.0, API v2.0, Component v1.8.3) >> >> I traced a bit what happens at the mmap call, it seems to go straight >> to the libc, not via openmpi first. >> >> For the time being, I think I'll have to consider any mmap()/munmap() >> rather unsafe to play with in an openmpi application. >> >> E. >> >> P.S: a last version of the test case is attached. >> >> Le 11 nov. 2014 19:48, "Emmanuel Thomé" <emmanuel.th...@gmail.com> a écrit : >>> >>> Thanks a lot for your analysis. This seems consistent with what I can >>> obtain by playing around with my different test cases. >>> >>> It seems that munmap() does *not* unregister the memory chunk from the >>> cache. I suppose this is the reason for the bug. >>> >>> In fact using mmap(..., MAP_ANONYMOUS | MAP_PRIVATE) and munmap() as >>> substitutes for malloc()/free() trigger the same problem. >>> >>> It looks to me that there is an oversight in the OPAL hooks around the >>> memory functions, then. Do you agree ? >>> >>> E. >>> >>> On Tue, Nov 11, 2014 at 3:17 PM, Joshua Ladd <jladd.m...@gmail.com> wrote: >>>> I was able to reproduce your issue and I think I understand the problem a >>>> bit better at least. This demonstrates exactly what I was pointing to: >>>> >>>> It looks like when the test switches over from eager RDMA (I'll explain in >>>> a >>>> second), to doing a rendezvous protocol working entirely in user buffer >>>> space things go bad. >>>> >>>> If you're input is smaller than some threshold, the eager RDMA limit, then >>>> the contents of your user buffer are copied into OMPI/OpenIB BTL scratch >>>> buffers called "eager fragments". This pool of resources is preregistered, >>>> pinned, and have had their rkeys exchanged. So, in the eager protocol, your >>>> data is copied into these "locked and loaded" RDMA frags and the put/get is >>>> handled internally. When the data is received, it's copied back out into >>>> your buffer. In your setup, this always works. >>>> >>>> $mpirun -np 2 --map-by node --bind-to core -mca pml ob1 -mca >>>> btl_openib_if_include mlx4_0:1 -mca btl_openib_use_eager_rdma 1 -mca >>>> btl_openib_eager_limit 512 -mca btl openib,self ./ibtest -s 56 >>>> per-node buffer has size 448 bytes >>>> node 0 iteration 0, lead word received from peer is 0x00000401 [ok] >>>> node 0 iteration 1, lead word received from peer is 0x00000801 [ok] >>>> node 0 iteration 2, lead word received from peer is 0x00000c01 [ok] >>>> node 0 iteration 3, lead word received from peer is 0x00001001 [ok] >>>> >>>> When you exceed the eager threshold, this always fails on the second >>>> iteration. To understand this, you need to understand that there is a >>>> protocol switch where now your user buffer is used for the transfer. Hence, >>>> the user buffer is registered with the HCA. This operation is an inherently >>>> high latency operation and is one of the primary motives for doing >>>> copy-in/copy-out into preregistered buffers for small, latency sensitive >>>> ops. For bandwidth bound transfers, the cost to register can be amortized >>>> over the whole transfer, but it still affects the total bandwidth. In the >>>> case of a rendezvous protocol where the user buffer is registered, there is >>>> an optimization mostly used to help improve the numbers in a bandwidth >>>> benchmark called a registration cache. With registration caching the user >>>> buffer is registered once and the mkey put into a cache and the memory is >>>> kept pinned until the system provides some notification via either memory >>>> hooks in p2p malloc, or ummunotify that the buffer has been freed and this >>>> signals that the mkey can be evicted from the cache. On subsequent >>>> send/recv operations from the same user buffer address, OpenIB BTL will >>>> find >>>> the address in the registration cache and take the cached mkey and avoid >>>> paying the cost of the memory registration the memory registration and >>>> start >>>> the data transfer. >>>> >>>> What I noticed is when the rendezvous protocol kicks in, it always fails on >>>> the second iteration. >>>> >>>> $mpirun -np 2 --map-by node --bind-to core -mca pml ob1 -mca >>>> btl_openib_if_include mlx4_0:1 -mca btl_openib_use_eager_rdma 1 -mca >>>> btl_openib_eager_limit 128 -mca btl openib,self ./ibtest -s 56 >>>> per-node buffer has size 448 bytes >>>> node 0 iteration 0, lead word received from peer is 0x00000401 [ok] >>>> node 0 iteration 1, lead word received from peer is 0x00000000 [NOK] >>>> -------------------------------------------------------------------------- >>>> >>>> So, I suspected it has something to do with the way the virtual address is >>>> being handled in this case. To test that theory, I just completely disabled >>>> the registration cache by setting -mca mpi_leave_pinned 0 and things start >>>> to work: >>>> >>>> $mpirun -np 2 --map-by node --bind-to core -mca pml ob1 -mca >>>> btl_openib_if_include mlx4_0:1 -mca btl_openib_use_eager_rdma 1 -mca >>>> btl_openib_eager_limit 128 -mca mpi_leave_pinned 0 -mca btl openib,self >>>> ./ibtest -s 56 >>>> per-node buffer has size 448 bytes >>>> node 0 iteration 0, lead word received from peer is 0x00000401 [ok] >>>> node 0 iteration 1, lead word received from peer is 0x00000801 [ok] >>>> node 0 iteration 2, lead word received from peer is 0x00000c01 [ok] >>>> node 0 iteration 3, lead word received from peer is 0x00001001 [ok] >>>> >>>> I don't know enough about memory hooks or the registration cache >>>> implementation to speak with any authority, but it looks like this is where >>>> the issue resides. As a workaround, can you try your original experiment >>>> with -mca mpi_leave_pinned 0 and see if you get consistent results. >>>> >>>> >>>> Josh >>>> >>>> >>>> >>>> >>>> >>>> On Tue, Nov 11, 2014 at 7:07 AM, Emmanuel Thomé <emmanuel.th...@gmail.com> >>>> wrote: >>>>> >>>>> Hi again, >>>>> >>>>> I've been able to simplify my test case significantly. It now runs >>>>> with 2 nodes, and only a single MPI_Send / MPI_Recv pair is used. >>>>> >>>>> The pattern is as follows. >>>>> >>>>> * - ranks 0 and 1 both own a local buffer. >>>>> * - each fills it with (deterministically known) data. >>>>> * - rank 0 collects the data from rank 1's local buffer >>>>> * (whose contents should be no mystery), and writes this to a >>>>> * file-backed mmaped area. >>>>> * - rank 0 compares what it receives with what it knows it *should >>>>> * have* received. >>>>> >>>>> The test fails if: >>>>> >>>>> * - the openib btl is used among the 2 nodes >>>>> * - a file-backed mmaped area is used for receiving the data. >>>>> * - the write is done to a newly created file. >>>>> * - per-node buffer is large enough. >>>>> >>>>> For a per-node buffer size above 12kb (12240 bytes to be exact), my >>>>> program fails, since the MPI_Recv does not receive the correct data >>>>> chunk (it just gets zeroes). >>>>> >>>>> I attach the simplified test case. I hope someone will be able to >>>>> reproduce the problem. >>>>> >>>>> Best regards, >>>>> >>>>> E. >>>>> >>>>> >>>>> On Mon, Nov 10, 2014 at 5:48 PM, Emmanuel Thomé >>>>> <emmanuel.th...@gmail.com> wrote: >>>>>> Thanks for your answer. >>>>>> >>>>>> On Mon, Nov 10, 2014 at 4:31 PM, Joshua Ladd <jladd.m...@gmail.com> >>>>>> wrote: >>>>>>> Just really quick off the top of my head, mmaping relies on the virtual >>>>>>> memory subsystem, whereas IB RDMA operations rely on physical memory >>>>>>> being >>>>>>> pinned (unswappable.) >>>>>> >>>>>> Yes. Does that mean that the result of computations should be >>>>>> undefined if I happen to give a user buffer which corresponds to a >>>>>> file ? That would be surprising. >>>>>> >>>>>>> For a large message transfer, the OpenIB BTL will >>>>>>> register the user buffer, which will pin the pages and make them >>>>>>> unswappable. >>>>>> >>>>>> Yes. But what are the semantics of pinning the VM area pointed to by >>>>>> ptr if ptr happens to be mmaped from a file ? >>>>>> >>>>>>> If the data being transfered is small, you'll copy-in/out to >>>>>>> internal bounce buffers and you shouldn't have issues. >>>>>> >>>>>> Are you saying that the openib layer does have provision in this case >>>>>> for letting the RDMA happen with a pinned physical memory range, and >>>>>> later perform the copy to the file-backed mmaped range ? That would >>>>>> make perfect sense indeed, although I don't have enough familiarity >>>>>> with the OMPI code to see where it happens, and more importantly >>>>>> whether the completion properly waits for this post-RDMA copy to >>>>>> complete. >>>>>> >>>>>> >>>>>>> 1.If you try to just bcast a few kilobytes of data using this >>>>>>> technique, do >>>>>>> you run into issues? >>>>>> >>>>>> No. All "simpler" attempts were successful, unfortunately. Can you be >>>>>> a little bit more precise about what scenario you imagine ? The >>>>>> setting "all ranks mmap a local file, and rank 0 broadcasts there" is >>>>>> successful. >>>>>> >>>>>>> 2. How large is the data in the collective (input and output), is >>>>>>> in_place >>>>>>> used? I'm guess it's large enough that the BTL tries to work with the >>>>>>> user >>>>>>> buffer. >>>>>> >>>>>> MPI_IN_PLACE is used in reduce_scatter and allgather in the code. >>>>>> Collectives are with communicators of 2 nodes, and we're talking (for >>>>>> the smallest failing run) 8kb per node (i.e. 16kb total for an >>>>>> allgather). >>>>>> >>>>>> E. >>>>>> >>>>>>> On Mon, Nov 10, 2014 at 9:29 AM, Emmanuel Thomé >>>>>>> <emmanuel.th...@gmail.com> >>>>>>> wrote: >>>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> I'm stumbling on a problem related to the openib btl in >>>>>>>> openmpi-1.[78].*, and the (I think legitimate) use of file-backed >>>>>>>> mmaped areas for receiving data through MPI collective calls. >>>>>>>> >>>>>>>> A test case is attached. I've tried to make is reasonably small, >>>>>>>> although I recognize that it's not extra thin. The test case is a >>>>>>>> trimmed down version of what I witness in the context of a rather >>>>>>>> large program, so there is no claim of relevance of the test case >>>>>>>> itself. It's here just to trigger the desired misbehaviour. The test >>>>>>>> case contains some detailed information on what is done, and the >>>>>>>> experiments I did. >>>>>>>> >>>>>>>> In a nutshell, the problem is as follows. >>>>>>>> >>>>>>>> - I do a computation, which involves MPI_Reduce_scatter and >>>>>>>> MPI_Allgather. >>>>>>>> - I save the result to a file (collective operation). >>>>>>>> >>>>>>>> *If* I save the file using something such as: >>>>>>>> fd = open("blah", ... >>>>>>>> area = mmap(..., fd, ) >>>>>>>> MPI_Gather(..., area, ...) >>>>>>>> *AND* the MPI_Reduce_scatter is done with an alternative >>>>>>>> implementation (which I believe is correct) >>>>>>>> *AND* communication is done through the openib btl, >>>>>>>> >>>>>>>> then the file which gets saved is inconsistent with what is obtained >>>>>>>> with the normal MPI_Reduce_scatter (alghough memory areas do coincide >>>>>>>> before the save). >>>>>>>> >>>>>>>> I tried to dig a bit in the openib internals, but all I've been able >>>>>>>> to witness was beyond my expertise (an RDMA read not transferring the >>>>>>>> expected data, but I'm too uncomfortable with this layer to say >>>>>>>> anything I'm sure about). >>>>>>>> >>>>>>>> Tests have been done with several openmpi versions including 1.8.3, on >>>>>>>> a debian wheezy (7.5) + OFED 2.3 cluster. >>>>>>>> >>>>>>>> It would be great if someone could tell me if he is able to reproduce >>>>>>>> the bug, or tell me whether something which is done in this test case >>>>>>>> is illegal in any respect. I'd be glad to provide further information >>>>>>>> which could be of any help. >>>>>>>> >>>>>>>> Best regards, >>>>>>>> >>>>>>>> E. Thomé. >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org >>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>> Link to this post: >>>>>>>> http://www.open-mpi.org/community/lists/users/2014/11/25730.php >>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> Link to this post: >>>>>>> http://www.open-mpi.org/community/lists/users/2014/11/25732.php >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/users/2014/11/25740.php >>>> >>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2014/11/25743.php >> <prog6.c>_______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2014/11/25775.php > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/11/25779.php