Hi, I am attempting to debug a memory corruption in an mpi program using valgrind. Howver, when I run with valgrind I get semi-random segfaults and valgrind messages with the openmpi library. Here is an example of such a seg fault:
==6153== ==6153== Invalid read of size 8 ==6153== at 0x19102EA0: (within /usr/lib/openmpi/lib/openmpi/mca_btl_sm.so) ==6153== by 0x182ABACB: (within /usr/lib/openmpi/lib/openmpi/mca_pml_ob1.so) ==6153== by 0x182A3040: (within /usr/lib/openmpi/lib/openmpi/mca_pml_ob1.so) ==6153== by 0xB425DD3: PMPI_Isend (in /usr/lib/openmpi/lib/libmpi.so.0.0.0) ==6153== by 0x7B83DA8: int Uintah::SFC<double>::MergeExchange<unsigned char>(int, std::vector<Uintah::History<unsigned char>, std::allocator<Uintah::History<unsigned char> > >&, std::vector<Uintah::History<unsigned char>, std::allocator<Uintah::History<unsigned char> > >&, std::vector<Uintah::History<unsigned char>, std::allocator<Uintah::History<unsigned char> > >&) (SFC.h:2989) ==6153== by 0x7B84A8F: void Uintah::SFC<double>::Batchers<unsigned char>(std::vector<Uintah::History<unsigned char>, std::allocator<Uintah::History<unsigned char> > >&, std::vector<Uintah::History<unsigned char>, std::allocator<Uintah::History<unsigned char> > >&, std::vector<Uintah::History<unsigned char>, std::allocator<Uintah::History<unsigned char> > >&) (SFC.h:3730) ==6153== by 0x7B8857B: void Uintah::SFC<double>::Cleanup<unsigned char>(std::vector<Uintah::History<unsigned char>, std::allocator<Uintah::History<unsigned char> > >&, std::vector<Uintah::History<unsigned char>, std::allocator<Uintah::History<unsigned char> > >&, std::vector<Uintah::History<unsigned char>, std::allocator<Uintah::History<unsigned char> > >&) (SFC.h:3695) ==6153== by 0x7B88CC6: void Uintah::SFC<double>::Parallel0<3, unsigned char>() (SFC.h:2928) ==6153== by 0x7C00AAB: void Uintah::SFC<double>::Parallel<3, unsigned char>() (SFC.h:1108) ==6153== by 0x7C0EF39: void Uintah::SFC<double>::GenerateDim<3>(int) (SFC.h:694) ==6153== by 0x7C0F0F2: Uintah::SFC<double>::GenerateCurve(int) (SFC.h:670) ==6153== by 0x7B30CAC: Uintah::DynamicLoadBalancer::useSFC(Uintah::Handle<Uintah::Level> const&, int*) (DynamicLoadBalancer.cc:429) ==6153== Address 0x10 is not stack'd, malloc'd or (recently) free'd ^G^G^GThread "main"(pid 6153) caught signal SIGSEGV at address (nil) (segmentation violation) Looking at the code for our isend at SFC.h:298 does not seem to have any errors: ============================================= MergeInfo<BITS> myinfo,theirinfo; MPI_Request srequest, rrequest; MPI_Status status; myinfo.n=n; if(n!=0) { myinfo.min=sendbuf[0].bits; myinfo.max=sendbuf[n-1].bits; } //cout << rank << " n:" << n << " min:" << (int)myinfo.min << "max:" << (int)myinfo.max << endl; MPI_Isend(&myinfo,sizeof(MergeInfo<BITS>),MPI_BYTE,to,0,Comm,&srequest); ============================================== myinfo is a struct located on the stack, to is the rank of the processor that the message is being sent to, and srequest is also on the stack. When I don't run with valgrind my program runs past this point just fine. I am currently using openmpi 1.3 from the debian unstable branch. I also see the same type of segfault in a different portion of the code involving an MPI_Allgather which can be seen below: ============================================== ==22736== Use of uninitialised value of size 8 ==22736== at 0x19104775: mca_btl_sm_component_progress (opal_list.h:322) ==22736== by 0x1382CE09: opal_progress (opal_progress.c:207) ==22736== by 0xB404264: ompi_request_default_wait_all (condition.h:99) ==22736== by 0x1A1ADC16: ompi_coll_tuned_sendrecv_actual (coll_tuned_util.c:55) ==22736== by 0x1A1B61E1: ompi_coll_tuned_allgatherv_intra_bruck (coll_tuned_util.h:60) ==22736== by 0xB418B2E: PMPI_Allgatherv (pallgatherv.c:121) ==22736== by 0x646CCF7: Uintah::Level::setBCTypes() (Level.cc:728) ==22736== by 0x646D823: Uintah::Level::finalizeLevel() (Level.cc:537) ==22736== by 0x6465457: Uintah::Grid::problemSetup(Uintah::Handle<Uintah::ProblemSpec> const&, Uintah::ProcessorGroup const*, bool) (Grid.cc:866) ==22736== by 0x8345759: Uintah::SimulationController::gridSetup() (SimulationController.cc:243) ==22736== by 0x834F418: Uintah::AMRSimulationController::run() (AMRSimulationController.cc:117) ==22736== by 0x4089AE: main (sus.cc:629) ==22736== ==22736== Invalid read of size 8 ==22736== at 0x19104775: mca_btl_sm_component_progress (opal_list.h:322) ==22736== by 0x1382CE09: opal_progress (opal_progress.c:207) ==22736== by 0xB404264: ompi_request_default_wait_all (condition.h:99) ==22736== by 0x1A1ADC16: ompi_coll_tuned_sendrecv_actual (coll_tuned_util.c:55) ==22736== by 0x1A1B61E1: ompi_coll_tuned_allgatherv_intra_bruck (coll_tuned_util.h:60) ==22736== by 0xB418B2E: PMPI_Allgatherv (pallgatherv.c:121) ==22736== by 0x646CCF7: Uintah::Level::setBCTypes() (Level.cc:728) ==22736== by 0x646D823: Uintah::Level::finalizeLevel() (Level.cc:537) ==22736== by 0x6465457: Uintah::Grid::problemSetup(Uintah::Handle<Uintah::ProblemSpec> const&, Uintah::ProcessorGroup const*, bool) (Grid.cc:866) ==22736== by 0x8345759: Uintah::SimulationController::gridSetup() (SimulationController.cc:243) ==22736== by 0x834F418: Uintah::AMRSimulationController::run() (AMRSimulationController.cc:117) ==22736== by 0x4089AE: main (sus.cc:629) ================================================================ Are these problems with openmpi and is there any known work arounds? Thanks, Justin