FWIW, munmap is *supposed* to be intercepted. Can you confirm that when your application calls munmap, it doesn't make a call to libopen-pal.so?
It should be calling this (1-line) function: ----- /* intercept munmap, as the user can give back memory that way as well. */ OPAL_DECLSPEC int munmap(void* addr, size_t len) { return opal_memory_linux_free_ptmalloc2_munmap(addr, len, 0); } ----- On Nov 12, 2014, at 11:08 AM, Emmanuel Thomé <emmanuel.th...@gmail.com> wrote: > As far as I have been able to understand while looking at the code, it > very much seems that Joshua pointed out the exact cause for the issue. > > munmap'ing a virtual address space region does not evict it from > mpool_grdma->pool->lru_list . If a later mmap happens to return the > same address (a priori tied to different physical location), the > userspace believes this segment is already registered, and eventually > the transfer is directed to a bogus location. > > This also seems to match this old discussion: > > http://lists.openfabrics.org/pipermail/general/2009-April/058786.html > > although I didn't read the whole discussion there, it very much seems > that the proposal for moving the pinning/caching logic to the kernel > did not make it, unfortunately. > > So are we here in the situation where this "munmap should be > intercepted" logic actually proves too fragile ? (in that it's not > intercepted in my case). The memory MCA in my configuration is: > MCA memory: linux (MCA v2.0, API v2.0, Component v1.8.3) > > I traced a bit what happens at the mmap call, it seems to go straight > to the libc, not via openmpi first. > > For the time being, I think I'll have to consider any mmap()/munmap() > rather unsafe to play with in an openmpi application. > > E. > > P.S: a last version of the test case is attached. > > Le 11 nov. 2014 19:48, "Emmanuel Thomé" <emmanuel.th...@gmail.com> a écrit : >> >> Thanks a lot for your analysis. This seems consistent with what I can >> obtain by playing around with my different test cases. >> >> It seems that munmap() does *not* unregister the memory chunk from the >> cache. I suppose this is the reason for the bug. >> >> In fact using mmap(..., MAP_ANONYMOUS | MAP_PRIVATE) and munmap() as >> substitutes for malloc()/free() trigger the same problem. >> >> It looks to me that there is an oversight in the OPAL hooks around the >> memory functions, then. Do you agree ? >> >> E. >> >> On Tue, Nov 11, 2014 at 3:17 PM, Joshua Ladd <jladd.m...@gmail.com> wrote: >>> I was able to reproduce your issue and I think I understand the problem a >>> bit better at least. This demonstrates exactly what I was pointing to: >>> >>> It looks like when the test switches over from eager RDMA (I'll explain in a >>> second), to doing a rendezvous protocol working entirely in user buffer >>> space things go bad. >>> >>> If you're input is smaller than some threshold, the eager RDMA limit, then >>> the contents of your user buffer are copied into OMPI/OpenIB BTL scratch >>> buffers called "eager fragments". This pool of resources is preregistered, >>> pinned, and have had their rkeys exchanged. So, in the eager protocol, your >>> data is copied into these "locked and loaded" RDMA frags and the put/get is >>> handled internally. When the data is received, it's copied back out into >>> your buffer. In your setup, this always works. >>> >>> $mpirun -np 2 --map-by node --bind-to core -mca pml ob1 -mca >>> btl_openib_if_include mlx4_0:1 -mca btl_openib_use_eager_rdma 1 -mca >>> btl_openib_eager_limit 512 -mca btl openib,self ./ibtest -s 56 >>> per-node buffer has size 448 bytes >>> node 0 iteration 0, lead word received from peer is 0x00000401 [ok] >>> node 0 iteration 1, lead word received from peer is 0x00000801 [ok] >>> node 0 iteration 2, lead word received from peer is 0x00000c01 [ok] >>> node 0 iteration 3, lead word received from peer is 0x00001001 [ok] >>> >>> When you exceed the eager threshold, this always fails on the second >>> iteration. To understand this, you need to understand that there is a >>> protocol switch where now your user buffer is used for the transfer. Hence, >>> the user buffer is registered with the HCA. This operation is an inherently >>> high latency operation and is one of the primary motives for doing >>> copy-in/copy-out into preregistered buffers for small, latency sensitive >>> ops. For bandwidth bound transfers, the cost to register can be amortized >>> over the whole transfer, but it still affects the total bandwidth. In the >>> case of a rendezvous protocol where the user buffer is registered, there is >>> an optimization mostly used to help improve the numbers in a bandwidth >>> benchmark called a registration cache. With registration caching the user >>> buffer is registered once and the mkey put into a cache and the memory is >>> kept pinned until the system provides some notification via either memory >>> hooks in p2p malloc, or ummunotify that the buffer has been freed and this >>> signals that the mkey can be evicted from the cache. On subsequent >>> send/recv operations from the same user buffer address, OpenIB BTL will find >>> the address in the registration cache and take the cached mkey and avoid >>> paying the cost of the memory registration the memory registration and start >>> the data transfer. >>> >>> What I noticed is when the rendezvous protocol kicks in, it always fails on >>> the second iteration. >>> >>> $mpirun -np 2 --map-by node --bind-to core -mca pml ob1 -mca >>> btl_openib_if_include mlx4_0:1 -mca btl_openib_use_eager_rdma 1 -mca >>> btl_openib_eager_limit 128 -mca btl openib,self ./ibtest -s 56 >>> per-node buffer has size 448 bytes >>> node 0 iteration 0, lead word received from peer is 0x00000401 [ok] >>> node 0 iteration 1, lead word received from peer is 0x00000000 [NOK] >>> -------------------------------------------------------------------------- >>> >>> So, I suspected it has something to do with the way the virtual address is >>> being handled in this case. To test that theory, I just completely disabled >>> the registration cache by setting -mca mpi_leave_pinned 0 and things start >>> to work: >>> >>> $mpirun -np 2 --map-by node --bind-to core -mca pml ob1 -mca >>> btl_openib_if_include mlx4_0:1 -mca btl_openib_use_eager_rdma 1 -mca >>> btl_openib_eager_limit 128 -mca mpi_leave_pinned 0 -mca btl openib,self >>> ./ibtest -s 56 >>> per-node buffer has size 448 bytes >>> node 0 iteration 0, lead word received from peer is 0x00000401 [ok] >>> node 0 iteration 1, lead word received from peer is 0x00000801 [ok] >>> node 0 iteration 2, lead word received from peer is 0x00000c01 [ok] >>> node 0 iteration 3, lead word received from peer is 0x00001001 [ok] >>> >>> I don't know enough about memory hooks or the registration cache >>> implementation to speak with any authority, but it looks like this is where >>> the issue resides. As a workaround, can you try your original experiment >>> with -mca mpi_leave_pinned 0 and see if you get consistent results. >>> >>> >>> Josh >>> >>> >>> >>> >>> >>> On Tue, Nov 11, 2014 at 7:07 AM, Emmanuel Thomé <emmanuel.th...@gmail.com> >>> wrote: >>>> >>>> Hi again, >>>> >>>> I've been able to simplify my test case significantly. It now runs >>>> with 2 nodes, and only a single MPI_Send / MPI_Recv pair is used. >>>> >>>> The pattern is as follows. >>>> >>>> * - ranks 0 and 1 both own a local buffer. >>>> * - each fills it with (deterministically known) data. >>>> * - rank 0 collects the data from rank 1's local buffer >>>> * (whose contents should be no mystery), and writes this to a >>>> * file-backed mmaped area. >>>> * - rank 0 compares what it receives with what it knows it *should >>>> * have* received. >>>> >>>> The test fails if: >>>> >>>> * - the openib btl is used among the 2 nodes >>>> * - a file-backed mmaped area is used for receiving the data. >>>> * - the write is done to a newly created file. >>>> * - per-node buffer is large enough. >>>> >>>> For a per-node buffer size above 12kb (12240 bytes to be exact), my >>>> program fails, since the MPI_Recv does not receive the correct data >>>> chunk (it just gets zeroes). >>>> >>>> I attach the simplified test case. I hope someone will be able to >>>> reproduce the problem. >>>> >>>> Best regards, >>>> >>>> E. >>>> >>>> >>>> On Mon, Nov 10, 2014 at 5:48 PM, Emmanuel Thomé >>>> <emmanuel.th...@gmail.com> wrote: >>>>> Thanks for your answer. >>>>> >>>>> On Mon, Nov 10, 2014 at 4:31 PM, Joshua Ladd <jladd.m...@gmail.com> >>>>> wrote: >>>>>> Just really quick off the top of my head, mmaping relies on the virtual >>>>>> memory subsystem, whereas IB RDMA operations rely on physical memory >>>>>> being >>>>>> pinned (unswappable.) >>>>> >>>>> Yes. Does that mean that the result of computations should be >>>>> undefined if I happen to give a user buffer which corresponds to a >>>>> file ? That would be surprising. >>>>> >>>>>> For a large message transfer, the OpenIB BTL will >>>>>> register the user buffer, which will pin the pages and make them >>>>>> unswappable. >>>>> >>>>> Yes. But what are the semantics of pinning the VM area pointed to by >>>>> ptr if ptr happens to be mmaped from a file ? >>>>> >>>>>> If the data being transfered is small, you'll copy-in/out to >>>>>> internal bounce buffers and you shouldn't have issues. >>>>> >>>>> Are you saying that the openib layer does have provision in this case >>>>> for letting the RDMA happen with a pinned physical memory range, and >>>>> later perform the copy to the file-backed mmaped range ? That would >>>>> make perfect sense indeed, although I don't have enough familiarity >>>>> with the OMPI code to see where it happens, and more importantly >>>>> whether the completion properly waits for this post-RDMA copy to >>>>> complete. >>>>> >>>>> >>>>>> 1.If you try to just bcast a few kilobytes of data using this >>>>>> technique, do >>>>>> you run into issues? >>>>> >>>>> No. All "simpler" attempts were successful, unfortunately. Can you be >>>>> a little bit more precise about what scenario you imagine ? The >>>>> setting "all ranks mmap a local file, and rank 0 broadcasts there" is >>>>> successful. >>>>> >>>>>> 2. How large is the data in the collective (input and output), is >>>>>> in_place >>>>>> used? I'm guess it's large enough that the BTL tries to work with the >>>>>> user >>>>>> buffer. >>>>> >>>>> MPI_IN_PLACE is used in reduce_scatter and allgather in the code. >>>>> Collectives are with communicators of 2 nodes, and we're talking (for >>>>> the smallest failing run) 8kb per node (i.e. 16kb total for an >>>>> allgather). >>>>> >>>>> E. >>>>> >>>>>> On Mon, Nov 10, 2014 at 9:29 AM, Emmanuel Thomé >>>>>> <emmanuel.th...@gmail.com> >>>>>> wrote: >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I'm stumbling on a problem related to the openib btl in >>>>>>> openmpi-1.[78].*, and the (I think legitimate) use of file-backed >>>>>>> mmaped areas for receiving data through MPI collective calls. >>>>>>> >>>>>>> A test case is attached. I've tried to make is reasonably small, >>>>>>> although I recognize that it's not extra thin. The test case is a >>>>>>> trimmed down version of what I witness in the context of a rather >>>>>>> large program, so there is no claim of relevance of the test case >>>>>>> itself. It's here just to trigger the desired misbehaviour. The test >>>>>>> case contains some detailed information on what is done, and the >>>>>>> experiments I did. >>>>>>> >>>>>>> In a nutshell, the problem is as follows. >>>>>>> >>>>>>> - I do a computation, which involves MPI_Reduce_scatter and >>>>>>> MPI_Allgather. >>>>>>> - I save the result to a file (collective operation). >>>>>>> >>>>>>> *If* I save the file using something such as: >>>>>>> fd = open("blah", ... >>>>>>> area = mmap(..., fd, ) >>>>>>> MPI_Gather(..., area, ...) >>>>>>> *AND* the MPI_Reduce_scatter is done with an alternative >>>>>>> implementation (which I believe is correct) >>>>>>> *AND* communication is done through the openib btl, >>>>>>> >>>>>>> then the file which gets saved is inconsistent with what is obtained >>>>>>> with the normal MPI_Reduce_scatter (alghough memory areas do coincide >>>>>>> before the save). >>>>>>> >>>>>>> I tried to dig a bit in the openib internals, but all I've been able >>>>>>> to witness was beyond my expertise (an RDMA read not transferring the >>>>>>> expected data, but I'm too uncomfortable with this layer to say >>>>>>> anything I'm sure about). >>>>>>> >>>>>>> Tests have been done with several openmpi versions including 1.8.3, on >>>>>>> a debian wheezy (7.5) + OFED 2.3 cluster. >>>>>>> >>>>>>> It would be great if someone could tell me if he is able to reproduce >>>>>>> the bug, or tell me whether something which is done in this test case >>>>>>> is illegal in any respect. I'd be glad to provide further information >>>>>>> which could be of any help. >>>>>>> >>>>>>> Best regards, >>>>>>> >>>>>>> E. Thomé. >>>>>>> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> Link to this post: >>>>>>> http://www.open-mpi.org/community/lists/users/2014/11/25730.php >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> Link to this post: >>>>>> http://www.open-mpi.org/community/lists/users/2014/11/25732.php >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2014/11/25740.php >>> >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2014/11/25743.php > <prog6.c>_______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/11/25775.php -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/