[OMPI users] potential bug with MPI_Win_fence() in openmpi-1.8.4

2015-04-30 Thread Satish Balay
OpenMPI developers,

We've had issues (memory errors) with OpenMPI - and code in PETSc
library that uses MPI_Win_fence().

Vagrind shows memory corruption deep inside OpenMPI function stack.

I'm attaching a potential patch that appears to fix this issue for us.
[the corresponding valgrind trace is listed in the patch header]

Perhaps there is a more appropriate fix for this memory corruption. Could
you check on this?

[Sorry I don't have a pure MPI test code to demonstrate this error -
but a PETSc test example code consistantly reproduces this issue]

Thanks,
Satishcommit ffdd25d6f4beef42a50d34f70bfe75bde077370d
Author: Satish Balay 
Date:   Wed Apr 29 22:33:06 2015 -0500

openmpi: potential bugfix for  PETSc sf example

balay@asterix /home/balay/petsc/src/vec/is/sf/examples/tutorials (master=)
$ /home/balay/petsc/arch-ompi/bin/mpiexec -n 2 valgrind --tool=memcheck -q 
--dsymutil=yes --num-callers=40 --track-origins=yes ./ex2
-sf_type window
PetscSF Object: 2 MPI processes
  type: window
synchronization=FENCE sort=rank-order
  [0] Number of roots=1, leaves=2, remote ranks=2
  [0] 0 <- (0,0)
  [0] 1 <- (1,0)
  [1] Number of roots=1, leaves=2, remote ranks=2
  [1] 0 <- (1,0)
  [1] 1 <- (0,0)
==14815== Invalid write of size 2
==14815==at 0x4C2E36B: memcpy@@GLIBC_2.14 (in 
/usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==14815==by 0x8AFDABD: ompi_datatype_set_args (ompi_datatype_args.c:167)
==14815==by 0x8AFF0F3: __ompi_datatype_create_from_args 
(ompi_datatype_args.c:718)
==14815==by 0x8AFEC0E: __ompi_datatype_create_from_packed_description 
(ompi_datatype_args.c:649)
==14815==by 0x8AFF5D6: ompi_datatype_create_from_packed_description 
(ompi_datatype_args.c:788)
==14815==by 0xF727F0E: ompi_osc_base_datatype_create 
(osc_base_obj_convert.h:52)
==14815==by 0xF728424: datatype_create (osc_rdma_data_move.c:333)
==14815==by 0xF72887D: process_get (osc_rdma_data_move.c:536)
==14815==by 0xF72A856: process_frag (osc_rdma_data_move.c:1593)
==14815==by 0xF72AA35: ompi_osc_rdma_callback 
(osc_rdma_data_move.c:1656)
==14815==by 0xECCF0DD: ompi_request_complete (request.h:402)
==14815==by 0xECCF4EA: recv_request_pml_complete (pml_ob1_recvreq.h:181)
==14815==by 0xECCFF87: mca_pml_ob1_recv_frag_callback_match 
(pml_ob1_recvfrag.c:243)
==14815==by 0xE68F875: mca_btl_vader_check_fboxes (btl_vader_fbox.h:220)
==14815==by 0xE690D82: mca_btl_vader_component_progress 
(btl_vader_component.c:695)
==14815==by 0x9A9E9F2: opal_progress (opal_progress.c:187)
==14815==by 0xECCA70A: opal_condition_wait (condition.h:78)
==14815==by 0xECCA7F4: ompi_request_wait_completion (request.h:381)
==14815==by 0xECCAF69: mca_pml_ob1_recv (pml_ob1_irecv.c:109)
==14815==by 0xFD8938D: ompi_coll_tuned_reduce_intra_basic_linear 
(coll_tuned_reduce.c:677)
==14815==by 0xFD79C26: ompi_coll_tuned_reduce_intra_dec_fixed 
(coll_tuned_decision_fixed.c:386)
==14815==by 0xF0F3B91: mca_coll_basic_reduce_scatter_block_intra 
(coll_basic_reduce_scatter_block.c:96)
==14815==by 0xF72BC58: ompi_osc_rdma_fence 
(osc_rdma_active_target.c:140)
==14815==by 0x8B47078: PMPI_Win_fence (pwin_fence.c:59)
==14815==by 0x5106D8F: PetscSFRestoreWindow (sfwindow.c:348)
==14815==by 0x51092DA: PetscSFBcastEnd_Window (sfwindow.c:510)
==14815==by 0x51303D6: PetscSFBcastEnd (sf.c:957)
==14815==by 0x401DD3: main (ex2.c:81)
==14815==  Address 0x101c3b98 is 0 bytes after a block of size 72 alloc'd
==14815==at 0x4C29BCF: malloc (in 
/usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==14815==by 0x8AFD755: ompi_datatype_set_args (ompi_datatype_args.c:123)
==14815==by 0x8AFF0F3: __ompi_datatype_create_from_args 
(ompi_datatype_args.c:718)
==14815==by 0x8AFEC0E: __ompi_datatype_create_from_packed_description 
(ompi_datatype_args.c:649)
==14815==by 0x8AFF5D6: ompi_datatype_create_from_packed_description 
(ompi_datatype_args.c:788)
==14815==by 0xF727F0E: ompi_osc_base_datatype_create 
(osc_base_obj_convert.h:52)
==14815==by 0xF728424: datatype_create (osc_rdma_data_move.c:333)
==14815==by 0xF72887D: process_get (osc_rdma_data_move.c:536)
==14815==by 0xF72A856: process_frag (osc_rdma_data_move.c:1593)
==14815==by 0xF72AA35: ompi_osc_rdma_callback 
(osc_rdma_data_move.c:1656)
==14815==by 0xECCF0DD: ompi_request_complete (request.h:402)
==14815==by 0xECCF4EA: recv_request_pml_complete (pml_ob1_recvreq.h:181)
==14815==by 0xECCFF87: mca_pml_ob1_recv_frag_callback_match 
(pml_ob1_recvfrag.c:243)
==14815==by 0xE68F875: mca_btl_vader_check_fboxes (btl_vader_fbox.h:220)
==14815==by 0xE690D82: mca_btl_vader_component_progress 
(btl_vader_component.c:69

Re: [OMPI users] [petsc-dev] potential bug with MPI_Win_fence() in openmpi-1.8.4

2015-04-30 Thread Satish Balay
Thanks for checking and getting a more appropriate fix in.

I've just tried this out - and the PETSc test code runs fine with it.

BTW: There is one inconsistancy in ompi/datatype/ompi_datatype_args.c
[that I noticed] - that you might want to check.
Perhaps the second line should be  "(DC) * sizeof(MPI_Datatype)"?

>>>>>>>>>
int length = sizeof(ompi_datatype_args_t) + (IC) * sizeof(int) + \
(AC) * sizeof(OPAL_PTRDIFF_TYPE) + (DC) * sizeof(MPI_Datatype); \


   pArgs->total_pack_size = (4 + (IC)) * sizeof(int) + \
(AC) * sizeof(OPAL_PTRDIFF_TYPE) + (DC) * sizeof(int);  \
<<<<<<<<<<<

Satish


On Thu, 30 Apr 2015, Matthew Knepley wrote:

> On Fri, May 1, 2015 at 4:55 AM, Jeff Squyres (jsquyres) 
> wrote:
> 
> > Thank you!
> >
> > George reviewed your patch and adjusted it a bit.  We applied it to master
> > and it's pending to the release series (v1.8.x).
> >
> 
> Was this identified by IBM?
> 
> 
> https://github.com/open-mpi/ompi/commit/015d3f56cf749ee5ad9ea4428d2f5da72f9bbe08
> 
>  Matt
> 
> 
> > Would you mind testing a nightly master snapshot?  It should be in
> > tonight's build:
> >
> > http://www.open-mpi.org/nightly/master/
> >
> >
> >
> > > On Apr 30, 2015, at 12:50 AM, Satish Balay  wrote:
> > >
> > > OpenMPI developers,
> > >
> > > We've had issues (memory errors) with OpenMPI - and code in PETSc
> > > library that uses MPI_Win_fence().
> > >
> > > Vagrind shows memory corruption deep inside OpenMPI function stack.
> > >
> > > I'm attaching a potential patch that appears to fix this issue for us.
> > > [the corresponding valgrind trace is listed in the patch header]
> > >
> > > Perhaps there is a more appropriate fix for this memory corruption. Could
> > > you check on this?
> > >
> > > [Sorry I don't have a pure MPI test code to demonstrate this error -
> > > but a PETSc test example code consistantly reproduces this issue]
> > >
> > > Thanks,
> > > Satish
> >
> >
> > --
> > Jeff Squyres
> > jsquy...@cisco.com
> > For corporate legal information go to:
> > http://www.cisco.com/web/about/doing_business/legal/cri/
> >
> >
> 
> 
> 



Re: [OMPI users] [petsc-dev] potential bug with MPI_Win_fence() in openmpi-1.8.4

2015-04-30 Thread Satish Balay
Great! Thanks for checking.

Satish

On Thu, 30 Apr 2015, George Bosilca wrote:

> I went over the code and in fact I think it is correct as is. The length is
> for the local representation, which indeed uses pointers to datatype
> structures. On the opposite, the total_pack_size represents the amount of
> space we would need to store the data in a format that can be sent to
> another peer, in which case handling pointers is pointless and we fall back
> to int.
> 
> However, I think we are counting twice the space needed for predefined
> data. I'll push a patch shortly.
> 
>   George.
> 
> 
> On Thu, Apr 30, 2015 at 3:33 PM, George Bosilca  wrote:
> 
> > In the packed representation we store not MPI_Datatypes but a handcrafted
> > id for each one. The 2 codes should have been in sync. I'm looking at
> > another issue right now, and I'll come back to this one right after.
> >
> > Thanks for paying attention to the code.
> >   George.
> >
> > On Thu, Apr 30, 2015 at 3:13 PM, Satish Balay  wrote:
> >
> >> Thanks for checking and getting a more appropriate fix in.
> >>
> >> I've just tried this out - and the PETSc test code runs fine with it.
> >>
> >> BTW: There is one inconsistancy in ompi/datatype/ompi_datatype_args.c
> >> [that I noticed] - that you might want to check.
> >> Perhaps the second line should be  "(DC) * sizeof(MPI_Datatype)"?
> >>
> >> >>>>>>>>>
> >> int length = sizeof(ompi_datatype_args_t) + (IC) * sizeof(int) + \
> >> (AC) * sizeof(OPAL_PTRDIFF_TYPE) + (DC) *
> >> sizeof(MPI_Datatype); \
> >>
> >>
> >>pArgs->total_pack_size = (4 + (IC)) * sizeof(int) + \
> >> (AC) * sizeof(OPAL_PTRDIFF_TYPE) + (DC) * sizeof(int);  \
> >> <<<<<<<<<<<
> >>
> >> Satish
> >>
> >>
> >> On Thu, 30 Apr 2015, Matthew Knepley wrote:
> >>
> >> > On Fri, May 1, 2015 at 4:55 AM, Jeff Squyres (jsquyres) <
> >> jsquy...@cisco.com>
> >> > wrote:
> >> >
> >> > > Thank you!
> >> > >
> >> > > George reviewed your patch and adjusted it a bit.  We applied it to
> >> master
> >> > > and it's pending to the release series (v1.8.x).
> >> > >
> >> >
> >> > Was this identified by IBM?
> >> >
> >> >
> >> >
> >> https://github.com/open-mpi/ompi/commit/015d3f56cf749ee5ad9ea4428d2f5da72f9bbe08
> >> >
> >> >  Matt
> >> >
> >> >
> >> > > Would you mind testing a nightly master snapshot?  It should be in
> >> > > tonight's build:
> >> > >
> >> > > http://www.open-mpi.org/nightly/master/
> >> > >
> >> > >
> >> > >
> >> > > > On Apr 30, 2015, at 12:50 AM, Satish Balay 
> >> wrote:
> >> > > >
> >> > > > OpenMPI developers,
> >> > > >
> >> > > > We've had issues (memory errors) with OpenMPI - and code in PETSc
> >> > > > library that uses MPI_Win_fence().
> >> > > >
> >> > > > Vagrind shows memory corruption deep inside OpenMPI function stack.
> >> > > >
> >> > > > I'm attaching a potential patch that appears to fix this issue for
> >> us.
> >> > > > [the corresponding valgrind trace is listed in the patch header]
> >> > > >
> >> > > > Perhaps there is a more appropriate fix for this memory corruption.
> >> Could
> >> > > > you check on this?
> >> > > >
> >> > > > [Sorry I don't have a pure MPI test code to demonstrate this error -
> >> > > > but a PETSc test example code consistantly reproduces this issue]
> >> > > >
> >> > > > Thanks,
> >> > > > Satish
> >> > >
> >> > >
> >> > > --
> >> > > Jeff Squyres
> >> > > jsquy...@cisco.com
> >> > > For corporate legal information go to:
> >> > > http://www.cisco.com/web/about/doing_business/legal/cri/
> >> > >
> >> > >
> >> >
> >> >
> >> >
> >>
> >> ___
> >> users mailing list
> >> us...@open-mpi.org
> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> Link to this post:
> >> http://www.open-mpi.org/community/lists/users/2015/04/26823.php
> >>
> >
> >
>