I was able to get rid of the segfaults/invalid reads by disabling the shared memory path. They still reported an error with uninitialized memory in the same spot which I believe is due to the struct being padded for alignment. I added a supression and was able to get past this part just fine.
Thanks, Justin On Thu, Jul 9, 2009 at 5:16 AM, Jeff Squyres <jsquy...@cisco.com> wrote: > On Jul 7, 2009, at 11:47 AM, Justin wrote: > > (Sorry if this is posted twice, I sent the same email yesterday but it >> never appeared on the list). >> >> > Sorry for the delay in replying. FWIW, I got your original message as > well. > > Hi, I am attempting to debug a memory corruption in an mpi program >> using valgrind. However, when I run with valgrind I get semi-random >> segfaults and valgrind messages with the openmpi library. Here is an >> example of such a seg fault: >> >> ==6153== >> ==6153== Invalid read of size 8 >> ==6153== at 0x19102EA0: (within /usr/lib/openmpi/lib/openmpi/ >> mca_btl_sm.so) >> >> ... > >> ==6153== Address 0x10 is not stack'd, malloc'd or (recently) free'd >> ^G^G^GThread "main"(pid 6153) caught signal SIGSEGV at address (nil) >> (segmentation violation) >> >> Looking at the code for our isend at SFC.h:298 does not seem to have any >> errors: >> >> ============================================= >> MergeInfo<BITS> myinfo,theirinfo; >> >> MPI_Request srequest, rrequest; >> MPI_Status status; >> >> myinfo.n=n; >> if(n!=0) >> { >> myinfo.min=sendbuf[0].bits; >> myinfo.max=sendbuf[n-1].bits; >> } >> //cout << rank << " n:" << n << " min:" << (int)myinfo.min << "max:" >> << (int)myinfo.max << endl; >> >> MPI_Isend(&myinfo,sizeof(MergeInfo<BITS>),MPI_BYTE,to,0,Comm,&srequest); >> ============================================== >> >> myinfo is a struct located on the stack, to is the rank of the processor >> that the message is being sent to, and srequest is also on the stack. >> In addition this message is waited on prior to exiting this block of >> code so they still exist on the stack. When I don't run with valgrind >> my program runs past this point just fine. >> >> > Strange. I can't think of an immediate reason as to why this would happen > -- does it also happen if you use a blocking send (vs. an Isend)? Is myinfo > a complex object, or a variable-length object? > > > I am currently using openmpi 1.3 from the debian unstable branch. I >> also see the same type of segfault in a different portion of the code >> involving an MPI_Allgather which can be seen below: >> >> ============================================== >> ==22736== Use of uninitialised value of size 8 >> ==22736== at 0x19104775: mca_btl_sm_component_progress >> (opal_list.h:322) >> ==22736== by 0x1382CE09: opal_progress (opal_progress.c:207) >> ==22736== by 0xB404264: ompi_request_default_wait_all (condition.h:99) >> ==22736== by 0x1A1ADC16: ompi_coll_tuned_sendrecv_actual >> (coll_tuned_util.c:55) >> ==22736== by 0x1A1B61E1: ompi_coll_tuned_allgatherv_intra_bruck >> (coll_tuned_util.h:60) >> ==22736== by 0xB418B2E: PMPI_Allgatherv (pallgatherv.c:121) >> ==22736== by 0x646CCF7: Uintah::Level::setBCTypes() (Level.cc:728) >> ==22736== by 0x646D823: Uintah::Level::finalizeLevel() (Level.cc:537) >> ==22736== by 0x6465457: >> Uintah::Grid::problemSetup(Uintah::Handle<Uintah::ProblemSpec> const&, >> Uintah::ProcessorGroup const*, bool) (Grid.cc:866) >> ==22736== by 0x8345759: Uintah::SimulationController::gridSetup() >> (SimulationController.cc:243) >> ==22736== by 0x834F418: Uintah::AMRSimulationController::run() >> (AMRSimulationController.cc:117) >> ==22736== by 0x4089AE: main (sus.cc:629) >> ==22736== >> ==22736== Invalid read of size 8 >> ==22736== at 0x19104775: mca_btl_sm_component_progress >> (opal_list.h:322) >> ==22736== by 0x1382CE09: opal_progress (opal_progress.c:207) >> ==22736== by 0xB404264: ompi_request_default_wait_all (condition.h:99) >> ==22736== by 0x1A1ADC16: ompi_coll_tuned_sendrecv_actual >> (coll_tuned_util.c:55) >> ==22736== by 0x1A1B61E1: ompi_coll_tuned_allgatherv_intra_bruck >> (coll_tuned_util.h:60) >> ==22736== by 0xB418B2E: PMPI_Allgatherv (pallgatherv.c:121) >> ==22736== by 0x646CCF7: Uintah::Level::setBCTypes() (Level.cc:728) >> ==22736== by 0x646D823: Uintah::Level::finalizeLevel() (Level.cc:537) >> ==22736== by 0x6465457: >> Uintah::Grid::problemSetup(Uintah::Handle<Uintah::ProblemSpec> const&, >> Uintah::ProcessorGroup const*, bool) (Grid.cc:866) >> ==22736== by 0x8345759: Uintah::SimulationController::gridSetup() >> (SimulationController.cc:243) >> ==22736== by 0x834F418: Uintah::AMRSimulationController::run() >> (AMRSimulationController.cc:117) >> ==22736== by 0x4089AE: main (sus.cc:629) >> ================================================================ >> >> Are these problems with openmpi and is there any known work arounds? >> >> > > These are new to me. The problem does seem to occur with OMPI's shared > memory device; you might want to try a different point-to-point device > (e.g., tcp?) to see if the problem goes away. But be aware that the problem > "going away" does not really pinpoint the location of the problem -- moving > to a slower transport (like tcp) may simply change timing such that the > problem does not occur. I.e., the problem could still exist in either your > code or OMPI -- this would simply be a workaround. > > -- > Jeff Squyres > Cisco Systems > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > >