Hi,  I am attempting to debug a memory corruption in an mpi program using
valgrind.  Howver, when I run with valgrind I get semi-random segfaults and
valgrind messages with the openmpi library.  Here is an example of such a
seg fault:

==6153==
==6153== Invalid read of size 8
==6153==    at 0x19102EA0: (within
/usr/lib/openmpi/lib/openmpi/mca_btl_sm.so)
==6153==    by 0x182ABACB: (within
/usr/lib/openmpi/lib/openmpi/mca_pml_ob1.so)
==6153==    by 0x182A3040: (within
/usr/lib/openmpi/lib/openmpi/mca_pml_ob1.so)
==6153==    by 0xB425DD3: PMPI_Isend (in
/usr/lib/openmpi/lib/libmpi.so.0.0.0)
==6153==    by 0x7B83DA8: int Uintah::SFC<double>::MergeExchange<unsigned
char>(int, std::vector<Uintah::History<unsigned char>,
std::allocator<Uintah::History<unsigned char> > >&,
std::vector<Uintah::History<unsigned char>,
std::allocator<Uintah::History<unsigned char> > >&,
std::vector<Uintah::History<unsigned char>,
std::allocator<Uintah::History<unsigned char> > >&) (SFC.h:2989)
==6153==    by 0x7B84A8F: void Uintah::SFC<double>::Batchers<unsigned
char>(std::vector<Uintah::History<unsigned char>,
std::allocator<Uintah::History<unsigned char> > >&,
std::vector<Uintah::History<unsigned char>,
std::allocator<Uintah::History<unsigned char> > >&,
std::vector<Uintah::History<unsigned char>,
std::allocator<Uintah::History<unsigned char> > >&) (SFC.h:3730)
==6153==    by 0x7B8857B: void Uintah::SFC<double>::Cleanup<unsigned
char>(std::vector<Uintah::History<unsigned char>,
std::allocator<Uintah::History<unsigned char> > >&,
std::vector<Uintah::History<unsigned char>,
std::allocator<Uintah::History<unsigned char> > >&,
std::vector<Uintah::History<unsigned char>,
std::allocator<Uintah::History<unsigned char> > >&) (SFC.h:3695)
==6153==    by 0x7B88CC6: void Uintah::SFC<double>::Parallel0<3, unsigned
char>() (SFC.h:2928)
==6153==    by 0x7C00AAB: void Uintah::SFC<double>::Parallel<3, unsigned
char>() (SFC.h:1108)
==6153==    by 0x7C0EF39: void Uintah::SFC<double>::GenerateDim<3>(int)
(SFC.h:694)
==6153==    by 0x7C0F0F2: Uintah::SFC<double>::GenerateCurve(int)
(SFC.h:670)
==6153==    by 0x7B30CAC:
Uintah::DynamicLoadBalancer::useSFC(Uintah::Handle<Uintah::Level> const&,
int*) (DynamicLoadBalancer.cc:429)
==6153==  Address 0x10 is not stack'd, malloc'd or (recently) free'd
^G^G^GThread "main"(pid 6153) caught signal SIGSEGV at address (nil)
(segmentation violation)

Looking at the code for our isend at SFC.h:298 does not seem to have any
errors:

=============================================
  MergeInfo<BITS> myinfo,theirinfo;

  MPI_Request srequest, rrequest;
  MPI_Status status;

  myinfo.n=n;
  if(n!=0)
  {
    myinfo.min=sendbuf[0].bits;
    myinfo.max=sendbuf[n-1].bits;
  }
  //cout << rank << " n:" << n << " min:" << (int)myinfo.min << "max:" <<
(int)myinfo.max << endl;

  MPI_Isend(&myinfo,sizeof(MergeInfo<BITS>),MPI_BYTE,to,0,Comm,&srequest);
==============================================

myinfo is a struct located on the stack, to is the rank of the processor
that the message is being sent to, and srequest is also on the stack.  When
I don't run with valgrind my program runs past this point just fine.

I am currently using openmpi 1.3 from the debian unstable branch.  I also
see the same type of segfault in a different portion of the code involving
an MPI_Allgather which can be seen below:

==============================================
==22736== Use of uninitialised value of size 8
==22736==    at 0x19104775: mca_btl_sm_component_progress (opal_list.h:322)
==22736==    by 0x1382CE09: opal_progress (opal_progress.c:207)
==22736==    by 0xB404264: ompi_request_default_wait_all (condition.h:99)
==22736==    by 0x1A1ADC16: ompi_coll_tuned_sendrecv_actual
(coll_tuned_util.c:55)
==22736==    by 0x1A1B61E1: ompi_coll_tuned_allgatherv_intra_bruck
(coll_tuned_util.h:60)
==22736==    by 0xB418B2E: PMPI_Allgatherv (pallgatherv.c:121)
==22736==    by 0x646CCF7: Uintah::Level::setBCTypes() (Level.cc:728)
==22736==    by 0x646D823: Uintah::Level::finalizeLevel() (Level.cc:537)
==22736==    by 0x6465457:
Uintah::Grid::problemSetup(Uintah::Handle<Uintah::ProblemSpec> const&,
Uintah::ProcessorGroup const*, bool) (Grid.cc:866)
==22736==    by 0x8345759: Uintah::SimulationController::gridSetup()
(SimulationController.cc:243)
==22736==    by 0x834F418: Uintah::AMRSimulationController::run()
(AMRSimulationController.cc:117)
==22736==    by 0x4089AE: main (sus.cc:629)
==22736==
==22736== Invalid read of size 8
==22736==    at 0x19104775: mca_btl_sm_component_progress (opal_list.h:322)
==22736==    by 0x1382CE09: opal_progress (opal_progress.c:207)
==22736==    by 0xB404264: ompi_request_default_wait_all (condition.h:99)
==22736==    by 0x1A1ADC16: ompi_coll_tuned_sendrecv_actual
(coll_tuned_util.c:55)
==22736==    by 0x1A1B61E1: ompi_coll_tuned_allgatherv_intra_bruck
(coll_tuned_util.h:60)
==22736==    by 0xB418B2E: PMPI_Allgatherv (pallgatherv.c:121)
==22736==    by 0x646CCF7: Uintah::Level::setBCTypes() (Level.cc:728)
==22736==    by 0x646D823: Uintah::Level::finalizeLevel() (Level.cc:537)
==22736==    by 0x6465457:
Uintah::Grid::problemSetup(Uintah::Handle<Uintah::ProblemSpec> const&,
Uintah::ProcessorGroup const*, bool) (Grid.cc:866)
==22736==    by 0x8345759: Uintah::SimulationController::gridSetup()
(SimulationController.cc:243)
==22736==    by 0x834F418: Uintah::AMRSimulationController::run()
(AMRSimulationController.cc:117)
==22736==    by 0x4089AE: main (sus.cc:629)
================================================================

Are these problems with openmpi and is there any known work arounds?

Thanks,
Justin

Reply via email to