You could just disable leave pinned: -mca mpi_leave_pinned 0 -mca mpi_leave_pinned_pipeline 0
This will fix the issue but may reduce performance. Not sure why the munmap wrapper is failing to execute but this will get you running. -Nathan Hjelm HPC-5, LANL On Wed, Nov 12, 2014 at 05:08:06PM +0100, Emmanuel Thomé wrote: > As far as I have been able to understand while looking at the code, it > very much seems that Joshua pointed out the exact cause for the issue. > > munmap'ing a virtual address space region does not evict it from > mpool_grdma->pool->lru_list . If a later mmap happens to return the > same address (a priori tied to different physical location), the > userspace believes this segment is already registered, and eventually > the transfer is directed to a bogus location. > > This also seems to match this old discussion: > > http://lists.openfabrics.org/pipermail/general/2009-April/058786.html > > although I didn't read the whole discussion there, it very much seems > that the proposal for moving the pinning/caching logic to the kernel > did not make it, unfortunately. > > So are we here in the situation where this "munmap should be > intercepted" logic actually proves too fragile ? (in that it's not > intercepted in my case). The memory MCA in my configuration is: > MCA memory: linux (MCA v2.0, API v2.0, Component v1.8.3) > > I traced a bit what happens at the mmap call, it seems to go straight > to the libc, not via openmpi first. > > For the time being, I think I'll have to consider any mmap()/munmap() > rather unsafe to play with in an openmpi application. > > E. > > P.S: a last version of the test case is attached. > > Le 11 nov. 2014 19:48, "Emmanuel Thomé" <emmanuel.th...@gmail.com> a écrit : > > > > Thanks a lot for your analysis. This seems consistent with what I can > > obtain by playing around with my different test cases. > > > > It seems that munmap() does *not* unregister the memory chunk from the > > cache. I suppose this is the reason for the bug. > > > > In fact using mmap(..., MAP_ANONYMOUS | MAP_PRIVATE) and munmap() as > > substitutes for malloc()/free() trigger the same problem. > > > > It looks to me that there is an oversight in the OPAL hooks around the > > memory functions, then. Do you agree ? > > > > E. > > > > On Tue, Nov 11, 2014 at 3:17 PM, Joshua Ladd <jladd.m...@gmail.com> wrote: > > > I was able to reproduce your issue and I think I understand the problem a > > > bit better at least. This demonstrates exactly what I was pointing to: > > > > > > It looks like when the test switches over from eager RDMA (I'll explain > > > in a > > > second), to doing a rendezvous protocol working entirely in user buffer > > > space things go bad. > > > > > > If you're input is smaller than some threshold, the eager RDMA limit, then > > > the contents of your user buffer are copied into OMPI/OpenIB BTL scratch > > > buffers called "eager fragments". This pool of resources is preregistered, > > > pinned, and have had their rkeys exchanged. So, in the eager protocol, > > > your > > > data is copied into these "locked and loaded" RDMA frags and the put/get > > > is > > > handled internally. When the data is received, it's copied back out into > > > your buffer. In your setup, this always works. > > > > > > $mpirun -np 2 --map-by node --bind-to core -mca pml ob1 -mca > > > btl_openib_if_include mlx4_0:1 -mca btl_openib_use_eager_rdma 1 -mca > > > btl_openib_eager_limit 512 -mca btl openib,self ./ibtest -s 56 > > > per-node buffer has size 448 bytes > > > node 0 iteration 0, lead word received from peer is 0x00000401 [ok] > > > node 0 iteration 1, lead word received from peer is 0x00000801 [ok] > > > node 0 iteration 2, lead word received from peer is 0x00000c01 [ok] > > > node 0 iteration 3, lead word received from peer is 0x00001001 [ok] > > > > > > When you exceed the eager threshold, this always fails on the second > > > iteration. To understand this, you need to understand that there is a > > > protocol switch where now your user buffer is used for the transfer. > > > Hence, > > > the user buffer is registered with the HCA. This operation is an > > > inherently > > > high latency operation and is one of the primary motives for doing > > > copy-in/copy-out into preregistered buffers for small, latency sensitive > > > ops. For bandwidth bound transfers, the cost to register can be amortized > > > over the whole transfer, but it still affects the total bandwidth. In the > > > case of a rendezvous protocol where the user buffer is registered, there > > > is > > > an optimization mostly used to help improve the numbers in a bandwidth > > > benchmark called a registration cache. With registration caching the user > > > buffer is registered once and the mkey put into a cache and the memory is > > > kept pinned until the system provides some notification via either memory > > > hooks in p2p malloc, or ummunotify that the buffer has been freed and this > > > signals that the mkey can be evicted from the cache. On subsequent > > > send/recv operations from the same user buffer address, OpenIB BTL will > > > find > > > the address in the registration cache and take the cached mkey and avoid > > > paying the cost of the memory registration the memory registration and > > > start > > > the data transfer. > > > > > > What I noticed is when the rendezvous protocol kicks in, it always fails > > > on > > > the second iteration. > > > > > > $mpirun -np 2 --map-by node --bind-to core -mca pml ob1 -mca > > > btl_openib_if_include mlx4_0:1 -mca btl_openib_use_eager_rdma 1 -mca > > > btl_openib_eager_limit 128 -mca btl openib,self ./ibtest -s 56 > > > per-node buffer has size 448 bytes > > > node 0 iteration 0, lead word received from peer is 0x00000401 [ok] > > > node 0 iteration 1, lead word received from peer is 0x00000000 [NOK] > > > -------------------------------------------------------------------------- > > > > > > So, I suspected it has something to do with the way the virtual address is > > > being handled in this case. To test that theory, I just completely > > > disabled > > > the registration cache by setting -mca mpi_leave_pinned 0 and things start > > > to work: > > > > > > $mpirun -np 2 --map-by node --bind-to core -mca pml ob1 -mca > > > btl_openib_if_include mlx4_0:1 -mca btl_openib_use_eager_rdma 1 -mca > > > btl_openib_eager_limit 128 -mca mpi_leave_pinned 0 -mca btl openib,self > > > ./ibtest -s 56 > > > per-node buffer has size 448 bytes > > > node 0 iteration 0, lead word received from peer is 0x00000401 [ok] > > > node 0 iteration 1, lead word received from peer is 0x00000801 [ok] > > > node 0 iteration 2, lead word received from peer is 0x00000c01 [ok] > > > node 0 iteration 3, lead word received from peer is 0x00001001 [ok] > > > > > > I don't know enough about memory hooks or the registration cache > > > implementation to speak with any authority, but it looks like this is > > > where > > > the issue resides. As a workaround, can you try your original experiment > > > with -mca mpi_leave_pinned 0 and see if you get consistent results. > > > > > > > > > Josh > > > > > > > > > > > > > > > > > > On Tue, Nov 11, 2014 at 7:07 AM, Emmanuel Thomé <emmanuel.th...@gmail.com> > > > wrote: > > >> > > >> Hi again, > > >> > > >> I've been able to simplify my test case significantly. It now runs > > >> with 2 nodes, and only a single MPI_Send / MPI_Recv pair is used. > > >> > > >> The pattern is as follows. > > >> > > >> * - ranks 0 and 1 both own a local buffer. > > >> * - each fills it with (deterministically known) data. > > >> * - rank 0 collects the data from rank 1's local buffer > > >> * (whose contents should be no mystery), and writes this to a > > >> * file-backed mmaped area. > > >> * - rank 0 compares what it receives with what it knows it *should > > >> * have* received. > > >> > > >> The test fails if: > > >> > > >> * - the openib btl is used among the 2 nodes > > >> * - a file-backed mmaped area is used for receiving the data. > > >> * - the write is done to a newly created file. > > >> * - per-node buffer is large enough. > > >> > > >> For a per-node buffer size above 12kb (12240 bytes to be exact), my > > >> program fails, since the MPI_Recv does not receive the correct data > > >> chunk (it just gets zeroes). > > >> > > >> I attach the simplified test case. I hope someone will be able to > > >> reproduce the problem. > > >> > > >> Best regards, > > >> > > >> E. > > >> > > >> > > >> On Mon, Nov 10, 2014 at 5:48 PM, Emmanuel Thomé > > >> <emmanuel.th...@gmail.com> wrote: > > >> > Thanks for your answer. > > >> > > > >> > On Mon, Nov 10, 2014 at 4:31 PM, Joshua Ladd <jladd.m...@gmail.com> > > >> > wrote: > > >> >> Just really quick off the top of my head, mmaping relies on the > > >> >> virtual > > >> >> memory subsystem, whereas IB RDMA operations rely on physical memory > > >> >> being > > >> >> pinned (unswappable.) > > >> > > > >> > Yes. Does that mean that the result of computations should be > > >> > undefined if I happen to give a user buffer which corresponds to a > > >> > file ? That would be surprising. > > >> > > > >> >> For a large message transfer, the OpenIB BTL will > > >> >> register the user buffer, which will pin the pages and make them > > >> >> unswappable. > > >> > > > >> > Yes. But what are the semantics of pinning the VM area pointed to by > > >> > ptr if ptr happens to be mmaped from a file ? > > >> > > > >> >> If the data being transfered is small, you'll copy-in/out to > > >> >> internal bounce buffers and you shouldn't have issues. > > >> > > > >> > Are you saying that the openib layer does have provision in this case > > >> > for letting the RDMA happen with a pinned physical memory range, and > > >> > later perform the copy to the file-backed mmaped range ? That would > > >> > make perfect sense indeed, although I don't have enough familiarity > > >> > with the OMPI code to see where it happens, and more importantly > > >> > whether the completion properly waits for this post-RDMA copy to > > >> > complete. > > >> > > > >> > > > >> >> 1.If you try to just bcast a few kilobytes of data using this > > >> >> technique, do > > >> >> you run into issues? > > >> > > > >> > No. All "simpler" attempts were successful, unfortunately. Can you be > > >> > a little bit more precise about what scenario you imagine ? The > > >> > setting "all ranks mmap a local file, and rank 0 broadcasts there" is > > >> > successful. > > >> > > > >> >> 2. How large is the data in the collective (input and output), is > > >> >> in_place > > >> >> used? I'm guess it's large enough that the BTL tries to work with the > > >> >> user > > >> >> buffer. > > >> > > > >> > MPI_IN_PLACE is used in reduce_scatter and allgather in the code. > > >> > Collectives are with communicators of 2 nodes, and we're talking (for > > >> > the smallest failing run) 8kb per node (i.e. 16kb total for an > > >> > allgather). > > >> > > > >> > E. > > >> > > > >> >> On Mon, Nov 10, 2014 at 9:29 AM, Emmanuel Thomé > > >> >> <emmanuel.th...@gmail.com> > > >> >> wrote: > > >> >>> > > >> >>> Hi, > > >> >>> > > >> >>> I'm stumbling on a problem related to the openib btl in > > >> >>> openmpi-1.[78].*, and the (I think legitimate) use of file-backed > > >> >>> mmaped areas for receiving data through MPI collective calls. > > >> >>> > > >> >>> A test case is attached. I've tried to make is reasonably small, > > >> >>> although I recognize that it's not extra thin. The test case is a > > >> >>> trimmed down version of what I witness in the context of a rather > > >> >>> large program, so there is no claim of relevance of the test case > > >> >>> itself. It's here just to trigger the desired misbehaviour. The test > > >> >>> case contains some detailed information on what is done, and the > > >> >>> experiments I did. > > >> >>> > > >> >>> In a nutshell, the problem is as follows. > > >> >>> > > >> >>> - I do a computation, which involves MPI_Reduce_scatter and > > >> >>> MPI_Allgather. > > >> >>> - I save the result to a file (collective operation). > > >> >>> > > >> >>> *If* I save the file using something such as: > > >> >>> fd = open("blah", ... > > >> >>> area = mmap(..., fd, ) > > >> >>> MPI_Gather(..., area, ...) > > >> >>> *AND* the MPI_Reduce_scatter is done with an alternative > > >> >>> implementation (which I believe is correct) > > >> >>> *AND* communication is done through the openib btl, > > >> >>> > > >> >>> then the file which gets saved is inconsistent with what is obtained > > >> >>> with the normal MPI_Reduce_scatter (alghough memory areas do coincide > > >> >>> before the save). > > >> >>> > > >> >>> I tried to dig a bit in the openib internals, but all I've been able > > >> >>> to witness was beyond my expertise (an RDMA read not transferring the > > >> >>> expected data, but I'm too uncomfortable with this layer to say > > >> >>> anything I'm sure about). > > >> >>> > > >> >>> Tests have been done with several openmpi versions including 1.8.3, > > >> >>> on > > >> >>> a debian wheezy (7.5) + OFED 2.3 cluster. > > >> >>> > > >> >>> It would be great if someone could tell me if he is able to reproduce > > >> >>> the bug, or tell me whether something which is done in this test case > > >> >>> is illegal in any respect. I'd be glad to provide further information > > >> >>> which could be of any help. > > >> >>> > > >> >>> Best regards, > > >> >>> > > >> >>> E. Thomé. > > >> >>> > > >> >>> _______________________________________________ > > >> >>> users mailing list > > >> >>> us...@open-mpi.org > > >> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > > >> >>> Link to this post: > > >> >>> http://www.open-mpi.org/community/lists/users/2014/11/25730.php > > >> >> > > >> >> > > >> >> > > >> >> _______________________________________________ > > >> >> users mailing list > > >> >> us...@open-mpi.org > > >> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > > >> >> Link to this post: > > >> >> http://www.open-mpi.org/community/lists/users/2014/11/25732.php > > >> > > >> _______________________________________________ > > >> users mailing list > > >> us...@open-mpi.org > > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > > >> Link to this post: > > >> http://www.open-mpi.org/community/lists/users/2014/11/25740.php > > > > > > > > > > > > _______________________________________________ > > > users mailing list > > > us...@open-mpi.org > > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > > > Link to this post: > > > http://www.open-mpi.org/community/lists/users/2014/11/25743.php > #define _GNU_SOURCE > #include <sys/types.h> > #include <sys/stat.h> > #include <fcntl.h> > #include <stdio.h> > #include <stdint.h> > #include <stdlib.h> > #include <string.h> > #include <unistd.h> > #include <assert.h> > #include <mpi.h> > #include <sys/mman.h> > > /* This test file illustrates how in certain circumstances, an mmap area > * cannot correctly receive data sent from an MPI_Send call. > * > * This program wants to run on 2 distinct nodes connected with > * infiniband. > * > * Normal behaviour of the program consists in printing output similar > * to: > node 0 iteration 0, lead word received from peer is 0x00000401 [ok] > node 0 iteration 1, lead word received from peer is 0x00000801 [ok] > node 0 iteration 2, lead word received from peer is 0x00000c01 [ok] > node 0 iteration 3, lead word received from peer is 0x00001001 [ok] > * > * Abnormal behaviour is when the job ends with MPI_Abort after printing > * a line such as: > node 0 iteration 1, lead word received from peer is 0x00000000 [NOK] > * > * Each iteration of the main loop does the same thing. > * - rank 0 allocates a buffer with mmap > * - rank 1 sends data there with MPI_Send > * - rank 0 verifies that the data has been correctly received. > * - rank 0 frees the buffer with munmap > * > * The final check performed by rank 0 fails if the following conditions > * are met: > * > * - the openib btl is used among the 2 nodes > * - allocation is done via mmap/munmap (not via malloc/free) > * - the send is large enough. > * > * The first condition is controlled by the btl mca. > * The size of the transfer is controlled by the -s command line > * argument */ > > /* For compiling, one may do: > > MPI=$HOME/Packages/openmpi-1.8.3 > $MPI/bin/mpicc -W -Wall -std=c99 -O0 -g prog5.c > > * For running, assuming /tmp/hosts contains the list of 2 nodes, and > * $SSH is used to connect to these: > > SSH_AUTH_SOCK= DISPLAY= $MPI/bin/mpiexec -machinefile /tmp/hosts --mca > plm_rsh_agent $SSH --mca rmaps_base_mapping_policy node -n 2 ./a.out -s 2048 > > */ > > /* > * Tested (FAIL means that setting USE_MMAP_FOR_FILE_IO above yields to a > * program failure, while we succeed if it is unset). > * > * IB boards MCX353A-FCBT, fw rev 2.32.5100, > MLNX_OFED_LINUX-2.3-1.0.1-debian7.5-x86_64 > * openmpi-1.8.4rc1 FAIL (ok with --mca btl ^openib) > * openmpi-1.8.3 FAIL (ok with --mca btl ^openib) > * > * A previous, longer test case also failed with IB boards MHGH29-XTC. > */ > > > /* Passing --mca mpi_leave_pinned 0 eliminates the bug */ > > int main(int argc, char * argv[]) > { > MPI_Init(&argc, &argv); > int size; > int rank; > int eitems = 1530; /* eitems >= 1530 seem to fail on my cluster */ > MPI_Comm_rank(MPI_COMM_WORLD, &rank); > MPI_Comm_size(MPI_COMM_WORLD, &size); > > if (size != 2) abort(); > > int use_mmap = 1; > > for(argc--, argv++; argc ; ) { > if (argc >= 2 && strcmp(argv[0], "-s") == 0) { > eitems = atoi(argv[1]); > argc -= 2; > argv += 2; > continue; > } > if (strcmp(argv[0], "-malloc") == 0) { > use_mmap = 0; > argc--, argv++; > continue; > } > fprintf(stderr, "Unexpected: %s\n", argv[0]); > exit(EXIT_FAILURE); > } > > size_t chunksize = eitems * sizeof(unsigned long); > size_t wsiz = ((chunksize - 1) | (sysconf (_SC_PAGESIZE)-1)) + 1; > > unsigned long * localbuf = malloc(chunksize); > > for(int iter = 0 ; iter < 4 ; iter++) { > unsigned long magic = (1 + iter) << 10; > > int ok = 1; > > if (rank == 1) { > for(int item = 0 ; item < eitems ; item++) { > localbuf[item] = magic + rank; > } > MPI_Send(localbuf, eitems, MPI_UNSIGNED_LONG, 0, 0, > MPI_COMM_WORLD); > } else { > unsigned long * recvbuf; > if (use_mmap) { > recvbuf = mmap(NULL, wsiz, PROT_READ | PROT_WRITE, > MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); > } else { > recvbuf = malloc(wsiz); > } > MPI_Recv(recvbuf, eitems, MPI_UNSIGNED_LONG, !rank, 0, > MPI_COMM_WORLD, MPI_STATUS_IGNORE); > ok = (*recvbuf == magic + !rank); > fprintf(stderr, "node %d iteration %d, lead word received from > peer is 0x%08lx [%s]\n", rank, iter, *recvbuf, ok?"ok":"NOK"); > if (use_mmap) { > munmap(recvbuf, wsiz); > } else { > free(recvbuf); > } > } > > /* only rank 0 has performed a new check */ > MPI_Bcast(&ok, 1, MPI_INT, 0, MPI_COMM_WORLD); > > if (!ok) MPI_Abort(MPI_COMM_WORLD, 1); > } > free(localbuf); > > MPI_Finalize(); > } > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/11/25775.php
pgpkUhPS3woMt.pgp
Description: PGP signature