(Sorry if this is posted twice, I sent the same email yesterday but it never appeared on the list).

Hi, I am attempting to debug a memory corruption in an mpi program using valgrind. However, when I run with valgrind I get semi-random segfaults and valgrind messages with the openmpi library. Here is an example of such a seg fault:

==6153==
==6153== Invalid read of size 8
==6153==    at 0x19102EA0: (within /usr/lib/openmpi/lib/openmpi/
mca_btl_sm.so)
==6153== by 0x182ABACB: (within /usr/lib/openmpi/lib/openmpi/mca_pml_ob1.so) ==6153== by 0x182A3040: (within /usr/lib/openmpi/lib/openmpi/mca_pml_ob1.so) ==6153== by 0xB425DD3: PMPI_Isend (in /usr/lib/openmpi/lib/libmpi.so.0.0.0) ==6153== by 0x7B83DA8: int Uintah::SFC<double>::MergeExchange<unsigned char>(int, std::vector<Uintah::History<unsigned char>, std::allocator<Uintah::History<unsigned char> > >&, std::vector<Uintah::History<unsigned char>, std::allocator<Uintah::History<unsigned char> > >&, std::vector<Uintah::History<unsigned char>, std::allocator<Uintah::History<unsigned char> > >&) (SFC.h:2989) ==6153== by 0x7B84A8F: void Uintah::SFC<double>::Batchers<unsigned char>(std::vector<Uintah::History<unsigned char>, std::allocator<Uintah::History<unsigned char> > >&, std::vector<Uintah::History<unsigned char>, std::allocator<Uintah::History<unsigned char> > >&, std::vector<Uintah::History<unsigned char>, std::allocator<Uintah::History<unsigned char> > >&) (SFC.h:3730) ==6153== by 0x7B8857B: void Uintah::SFC<double>::Cleanup<unsigned char>(std::vector<Uintah::History<unsigned char>, std::allocator<Uintah::History<unsigned char> > >&, std::vector<Uintah::History<unsigned char>, std::allocator<Uintah::History<unsigned char> > >&, std::vector<Uintah::History<unsigned char>, std::allocator<Uintah::History<unsigned char> > >&) (SFC.h:3695) ==6153== by 0x7B88CC6: void Uintah::SFC<double>::Parallel0<3, unsigned char>() (SFC.h:2928) ==6153== by 0x7C00AAB: void Uintah::SFC<double>::Parallel<3, unsigned char>() (SFC.h:1108) ==6153== by 0x7C0EF39: void Uintah::SFC<double>::GenerateDim<3>(int) (SFC.h:694) ==6153== by 0x7C0F0F2: Uintah::SFC<double>::GenerateCurve(int) (SFC.h:670) ==6153== by 0x7B30CAC: Uintah::DynamicLoadBalancer::useSFC(Uintah::Handle<Uintah::Level> const&, int*) (DynamicLoadBalancer.cc:429)
==6153==  Address 0x10 is not stack'd, malloc'd or (recently) free'd
^G^G^GThread "main"(pid 6153) caught signal SIGSEGV at address (nil) (segmentation violation)

Looking at the code for our isend at SFC.h:298 does not seem to have any errors:
=============================================
 MergeInfo<BITS> myinfo,theirinfo;

 MPI_Request srequest, rrequest;
 MPI_Status status;

 myinfo.n=n;
 if(n!=0)
 {
   myinfo.min=sendbuf[0].bits;
   myinfo.max=sendbuf[n-1].bits;
 }
//cout << rank << " n:" << n << " min:" << (int)myinfo.min << "max:" << (int)myinfo.max << endl;

 MPI_Isend(&myinfo,sizeof(MergeInfo<BITS>),MPI_BYTE,to,0,Comm,&srequest);
==============================================

myinfo is a struct located on the stack, to is the rank of the processor that the message is being sent to, and srequest is also on the stack. In addition this message is waited on prior to exiting this block of code so they still exist on the stack. When I don't run with valgrind my program runs past this point just fine. I am currently using openmpi 1.3 from the debian unstable branch. I also see the same type of segfault in a different portion of the code involving an MPI_Allgather which can be seen below:

==============================================
==22736== Use of uninitialised value of size 8
==22736==    at 0x19104775: mca_btl_sm_component_progress (opal_list.h:322)
==22736==    by 0x1382CE09: opal_progress (opal_progress.c:207)
==22736==    by 0xB404264: ompi_request_default_wait_all (condition.h:99)
==22736== by 0x1A1ADC16: ompi_coll_tuned_sendrecv_actual (coll_tuned_util.c:55) ==22736== by 0x1A1B61E1: ompi_coll_tuned_allgatherv_intra_bruck (coll_tuned_util.h:60)
==22736==    by 0xB418B2E: PMPI_Allgatherv (pallgatherv.c:121)
==22736==    by 0x646CCF7: Uintah::Level::setBCTypes() (Level.cc:728)
==22736==    by 0x646D823: Uintah::Level::finalizeLevel() (Level.cc:537)
==22736== by 0x6465457: Uintah::Grid::problemSetup(Uintah::Handle<Uintah::ProblemSpec> const&, Uintah::ProcessorGroup const*, bool) (Grid.cc:866) ==22736== by 0x8345759: Uintah::SimulationController::gridSetup() (SimulationController.cc:243) ==22736== by 0x834F418: Uintah::AMRSimulationController::run() (AMRSimulationController.cc:117)
==22736==    by 0x4089AE: main (sus.cc:629)
==22736==
==22736== Invalid read of size 8
==22736==    at 0x19104775: mca_btl_sm_component_progress (opal_list.h:322)
==22736==    by 0x1382CE09: opal_progress (opal_progress.c:207)
==22736==    by 0xB404264: ompi_request_default_wait_all (condition.h:99)
==22736== by 0x1A1ADC16: ompi_coll_tuned_sendrecv_actual (coll_tuned_util.c:55) ==22736== by 0x1A1B61E1: ompi_coll_tuned_allgatherv_intra_bruck (coll_tuned_util.h:60)
==22736==    by 0xB418B2E: PMPI_Allgatherv (pallgatherv.c:121)
==22736==    by 0x646CCF7: Uintah::Level::setBCTypes() (Level.cc:728)
==22736==    by 0x646D823: Uintah::Level::finalizeLevel() (Level.cc:537)
==22736== by 0x6465457: Uintah::Grid::problemSetup(Uintah::Handle<Uintah::ProblemSpec> const&, Uintah::ProcessorGroup const*, bool) (Grid.cc:866) ==22736== by 0x8345759: Uintah::SimulationController::gridSetup() (SimulationController.cc:243) ==22736== by 0x834F418: Uintah::AMRSimulationController::run() (AMRSimulationController.cc:117)
==22736==    by 0x4089AE: main (sus.cc:629)
================================================================

Are these problems with openmpi and is there any known work arounds?

Thanks,
Justin

Reply via email to