[OMPI users] potential bug with MPI_Win_fence() in openmpi-1.8.4
OpenMPI developers, We've had issues (memory errors) with OpenMPI - and code in PETSc library that uses MPI_Win_fence(). Vagrind shows memory corruption deep inside OpenMPI function stack. I'm attaching a potential patch that appears to fix this issue for us. [the corresponding valgrind trace is listed in the patch header] Perhaps there is a more appropriate fix for this memory corruption. Could you check on this? [Sorry I don't have a pure MPI test code to demonstrate this error - but a PETSc test example code consistantly reproduces this issue] Thanks, Satishcommit ffdd25d6f4beef42a50d34f70bfe75bde077370d Author: Satish Balay Date: Wed Apr 29 22:33:06 2015 -0500 openmpi: potential bugfix for PETSc sf example balay@asterix /home/balay/petsc/src/vec/is/sf/examples/tutorials (master=) $ /home/balay/petsc/arch-ompi/bin/mpiexec -n 2 valgrind --tool=memcheck -q --dsymutil=yes --num-callers=40 --track-origins=yes ./ex2 -sf_type window PetscSF Object: 2 MPI processes type: window synchronization=FENCE sort=rank-order [0] Number of roots=1, leaves=2, remote ranks=2 [0] 0 <- (0,0) [0] 1 <- (1,0) [1] Number of roots=1, leaves=2, remote ranks=2 [1] 0 <- (1,0) [1] 1 <- (0,0) ==14815== Invalid write of size 2 ==14815==at 0x4C2E36B: memcpy@@GLIBC_2.14 (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so) ==14815==by 0x8AFDABD: ompi_datatype_set_args (ompi_datatype_args.c:167) ==14815==by 0x8AFF0F3: __ompi_datatype_create_from_args (ompi_datatype_args.c:718) ==14815==by 0x8AFEC0E: __ompi_datatype_create_from_packed_description (ompi_datatype_args.c:649) ==14815==by 0x8AFF5D6: ompi_datatype_create_from_packed_description (ompi_datatype_args.c:788) ==14815==by 0xF727F0E: ompi_osc_base_datatype_create (osc_base_obj_convert.h:52) ==14815==by 0xF728424: datatype_create (osc_rdma_data_move.c:333) ==14815==by 0xF72887D: process_get (osc_rdma_data_move.c:536) ==14815==by 0xF72A856: process_frag (osc_rdma_data_move.c:1593) ==14815==by 0xF72AA35: ompi_osc_rdma_callback (osc_rdma_data_move.c:1656) ==14815==by 0xECCF0DD: ompi_request_complete (request.h:402) ==14815==by 0xECCF4EA: recv_request_pml_complete (pml_ob1_recvreq.h:181) ==14815==by 0xECCFF87: mca_pml_ob1_recv_frag_callback_match (pml_ob1_recvfrag.c:243) ==14815==by 0xE68F875: mca_btl_vader_check_fboxes (btl_vader_fbox.h:220) ==14815==by 0xE690D82: mca_btl_vader_component_progress (btl_vader_component.c:695) ==14815==by 0x9A9E9F2: opal_progress (opal_progress.c:187) ==14815==by 0xECCA70A: opal_condition_wait (condition.h:78) ==14815==by 0xECCA7F4: ompi_request_wait_completion (request.h:381) ==14815==by 0xECCAF69: mca_pml_ob1_recv (pml_ob1_irecv.c:109) ==14815==by 0xFD8938D: ompi_coll_tuned_reduce_intra_basic_linear (coll_tuned_reduce.c:677) ==14815==by 0xFD79C26: ompi_coll_tuned_reduce_intra_dec_fixed (coll_tuned_decision_fixed.c:386) ==14815==by 0xF0F3B91: mca_coll_basic_reduce_scatter_block_intra (coll_basic_reduce_scatter_block.c:96) ==14815==by 0xF72BC58: ompi_osc_rdma_fence (osc_rdma_active_target.c:140) ==14815==by 0x8B47078: PMPI_Win_fence (pwin_fence.c:59) ==14815==by 0x5106D8F: PetscSFRestoreWindow (sfwindow.c:348) ==14815==by 0x51092DA: PetscSFBcastEnd_Window (sfwindow.c:510) ==14815==by 0x51303D6: PetscSFBcastEnd (sf.c:957) ==14815==by 0x401DD3: main (ex2.c:81) ==14815== Address 0x101c3b98 is 0 bytes after a block of size 72 alloc'd ==14815==at 0x4C29BCF: malloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so) ==14815==by 0x8AFD755: ompi_datatype_set_args (ompi_datatype_args.c:123) ==14815==by 0x8AFF0F3: __ompi_datatype_create_from_args (ompi_datatype_args.c:718) ==14815==by 0x8AFEC0E: __ompi_datatype_create_from_packed_description (ompi_datatype_args.c:649) ==14815==by 0x8AFF5D6: ompi_datatype_create_from_packed_description (ompi_datatype_args.c:788) ==14815==by 0xF727F0E: ompi_osc_base_datatype_create (osc_base_obj_convert.h:52) ==14815==by 0xF728424: datatype_create (osc_rdma_data_move.c:333) ==14815==by 0xF72887D: process_get (osc_rdma_data_move.c:536) ==14815==by 0xF72A856: process_frag (osc_rdma_data_move.c:1593) ==14815==by 0xF72AA35: ompi_osc_rdma_callback (osc_rdma_data_move.c:1656) ==14815==by 0xECCF0DD: ompi_request_complete (request.h:402) ==14815==by 0xECCF4EA: recv_request_pml_complete (pml_ob1_recvreq.h:181) ==14815==by 0xECCFF87: mca_pml_ob1_recv_frag_callback_match (pml_ob1_recvfrag.c:243) ==14815==by 0xE68F875: mca_btl_vader_check_fboxes (btl_vader_fbox.h:220) ==14815==by 0xE690D82: mca_btl_vader_component_progress (btl_vader_component.c:69
Re: [OMPI users] [petsc-dev] potential bug with MPI_Win_fence() in openmpi-1.8.4
Thanks for checking and getting a more appropriate fix in. I've just tried this out - and the PETSc test code runs fine with it. BTW: There is one inconsistancy in ompi/datatype/ompi_datatype_args.c [that I noticed] - that you might want to check. Perhaps the second line should be "(DC) * sizeof(MPI_Datatype)"? >>>>>>>>> int length = sizeof(ompi_datatype_args_t) + (IC) * sizeof(int) + \ (AC) * sizeof(OPAL_PTRDIFF_TYPE) + (DC) * sizeof(MPI_Datatype); \ pArgs->total_pack_size = (4 + (IC)) * sizeof(int) + \ (AC) * sizeof(OPAL_PTRDIFF_TYPE) + (DC) * sizeof(int); \ <<<<<<<<<<< Satish On Thu, 30 Apr 2015, Matthew Knepley wrote: > On Fri, May 1, 2015 at 4:55 AM, Jeff Squyres (jsquyres) > wrote: > > > Thank you! > > > > George reviewed your patch and adjusted it a bit. We applied it to master > > and it's pending to the release series (v1.8.x). > > > > Was this identified by IBM? > > > https://github.com/open-mpi/ompi/commit/015d3f56cf749ee5ad9ea4428d2f5da72f9bbe08 > > Matt > > > > Would you mind testing a nightly master snapshot? It should be in > > tonight's build: > > > > http://www.open-mpi.org/nightly/master/ > > > > > > > > > On Apr 30, 2015, at 12:50 AM, Satish Balay wrote: > > > > > > OpenMPI developers, > > > > > > We've had issues (memory errors) with OpenMPI - and code in PETSc > > > library that uses MPI_Win_fence(). > > > > > > Vagrind shows memory corruption deep inside OpenMPI function stack. > > > > > > I'm attaching a potential patch that appears to fix this issue for us. > > > [the corresponding valgrind trace is listed in the patch header] > > > > > > Perhaps there is a more appropriate fix for this memory corruption. Could > > > you check on this? > > > > > > [Sorry I don't have a pure MPI test code to demonstrate this error - > > > but a PETSc test example code consistantly reproduces this issue] > > > > > > Thanks, > > > Satish > > > > > > -- > > Jeff Squyres > > jsquy...@cisco.com > > For corporate legal information go to: > > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > > > >
Re: [OMPI users] [petsc-dev] potential bug with MPI_Win_fence() in openmpi-1.8.4
Great! Thanks for checking. Satish On Thu, 30 Apr 2015, George Bosilca wrote: > I went over the code and in fact I think it is correct as is. The length is > for the local representation, which indeed uses pointers to datatype > structures. On the opposite, the total_pack_size represents the amount of > space we would need to store the data in a format that can be sent to > another peer, in which case handling pointers is pointless and we fall back > to int. > > However, I think we are counting twice the space needed for predefined > data. I'll push a patch shortly. > > George. > > > On Thu, Apr 30, 2015 at 3:33 PM, George Bosilca wrote: > > > In the packed representation we store not MPI_Datatypes but a handcrafted > > id for each one. The 2 codes should have been in sync. I'm looking at > > another issue right now, and I'll come back to this one right after. > > > > Thanks for paying attention to the code. > > George. > > > > On Thu, Apr 30, 2015 at 3:13 PM, Satish Balay wrote: > > > >> Thanks for checking and getting a more appropriate fix in. > >> > >> I've just tried this out - and the PETSc test code runs fine with it. > >> > >> BTW: There is one inconsistancy in ompi/datatype/ompi_datatype_args.c > >> [that I noticed] - that you might want to check. > >> Perhaps the second line should be "(DC) * sizeof(MPI_Datatype)"? > >> > >> >>>>>>>>> > >> int length = sizeof(ompi_datatype_args_t) + (IC) * sizeof(int) + \ > >> (AC) * sizeof(OPAL_PTRDIFF_TYPE) + (DC) * > >> sizeof(MPI_Datatype); \ > >> > >> > >>pArgs->total_pack_size = (4 + (IC)) * sizeof(int) + \ > >> (AC) * sizeof(OPAL_PTRDIFF_TYPE) + (DC) * sizeof(int); \ > >> <<<<<<<<<<< > >> > >> Satish > >> > >> > >> On Thu, 30 Apr 2015, Matthew Knepley wrote: > >> > >> > On Fri, May 1, 2015 at 4:55 AM, Jeff Squyres (jsquyres) < > >> jsquy...@cisco.com> > >> > wrote: > >> > > >> > > Thank you! > >> > > > >> > > George reviewed your patch and adjusted it a bit. We applied it to > >> master > >> > > and it's pending to the release series (v1.8.x). > >> > > > >> > > >> > Was this identified by IBM? > >> > > >> > > >> > > >> https://github.com/open-mpi/ompi/commit/015d3f56cf749ee5ad9ea4428d2f5da72f9bbe08 > >> > > >> > Matt > >> > > >> > > >> > > Would you mind testing a nightly master snapshot? It should be in > >> > > tonight's build: > >> > > > >> > > http://www.open-mpi.org/nightly/master/ > >> > > > >> > > > >> > > > >> > > > On Apr 30, 2015, at 12:50 AM, Satish Balay > >> wrote: > >> > > > > >> > > > OpenMPI developers, > >> > > > > >> > > > We've had issues (memory errors) with OpenMPI - and code in PETSc > >> > > > library that uses MPI_Win_fence(). > >> > > > > >> > > > Vagrind shows memory corruption deep inside OpenMPI function stack. > >> > > > > >> > > > I'm attaching a potential patch that appears to fix this issue for > >> us. > >> > > > [the corresponding valgrind trace is listed in the patch header] > >> > > > > >> > > > Perhaps there is a more appropriate fix for this memory corruption. > >> Could > >> > > > you check on this? > >> > > > > >> > > > [Sorry I don't have a pure MPI test code to demonstrate this error - > >> > > > but a PETSc test example code consistantly reproduces this issue] > >> > > > > >> > > > Thanks, > >> > > > Satish > >> > > > >> > > > >> > > -- > >> > > Jeff Squyres > >> > > jsquy...@cisco.com > >> > > For corporate legal information go to: > >> > > http://www.cisco.com/web/about/doing_business/legal/cri/ > >> > > > >> > > > >> > > >> > > >> > > >> > >> ___ > >> users mailing list > >> us...@open-mpi.org > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > >> Link to this post: > >> http://www.open-mpi.org/community/lists/users/2015/04/26823.php > >> > > > > >