I was able to get rid of  the segfaults/invalid reads by disabling the
shared memory path.  They still reported an error with uninitialized memory
in the same spot which I believe is due to the struct being padded for
alignment.  I added a supression and was able to get past this part just
fine.

Thanks,
Justin

On Thu, Jul 9, 2009 at 5:16 AM, Jeff Squyres <jsquy...@cisco.com> wrote:

> On Jul 7, 2009, at 11:47 AM, Justin wrote:
>
>  (Sorry if this is posted twice, I sent the same email yesterday but it
>> never appeared on the list).
>>
>>
> Sorry for the delay in replying.  FWIW, I got your original message as
> well.
>
>  Hi,  I am attempting to debug a memory corruption in an mpi program
>> using valgrind.  However, when I run with valgrind I get semi-random
>> segfaults and valgrind messages with the openmpi library.  Here is an
>> example of such a seg fault:
>>
>> ==6153==
>> ==6153== Invalid read of size 8
>> ==6153==    at 0x19102EA0: (within /usr/lib/openmpi/lib/openmpi/
>> mca_btl_sm.so)
>>
>>  ...
>
>> ==6153==  Address 0x10 is not stack'd, malloc'd or (recently) free'd
>> ^G^G^GThread "main"(pid 6153) caught signal SIGSEGV at address (nil)
>> (segmentation violation)
>>
>> Looking at the code for our isend at SFC.h:298 does not seem to have any
>> errors:
>>
>> =============================================
>>  MergeInfo<BITS> myinfo,theirinfo;
>>
>>  MPI_Request srequest, rrequest;
>>  MPI_Status status;
>>
>>  myinfo.n=n;
>>  if(n!=0)
>>  {
>>    myinfo.min=sendbuf[0].bits;
>>    myinfo.max=sendbuf[n-1].bits;
>>  }
>>  //cout << rank << " n:" << n << " min:" << (int)myinfo.min << "max:"
>> << (int)myinfo.max << endl;
>>
>>  MPI_Isend(&myinfo,sizeof(MergeInfo<BITS>),MPI_BYTE,to,0,Comm,&srequest);
>> ==============================================
>>
>> myinfo is a struct located on the stack, to is the rank of the processor
>> that the message is being sent to, and srequest is also on the stack.
>> In addition this message is waited on prior to exiting this block of
>> code so they still exist on the stack.  When I don't run with valgrind
>> my program runs past this point just fine.
>>
>>
> Strange.  I can't think of an immediate reason as to why this would happen
> -- does it also happen if you use a blocking send (vs. an Isend)?  Is myinfo
> a complex object, or a variable-length object?
>
>
>  I am currently using openmpi 1.3 from the debian unstable branch.  I
>> also see the same type of segfault in a different portion of the code
>> involving an MPI_Allgather which can be seen below:
>>
>> ==============================================
>> ==22736== Use of uninitialised value of size 8
>> ==22736==    at 0x19104775: mca_btl_sm_component_progress
>> (opal_list.h:322)
>> ==22736==    by 0x1382CE09: opal_progress (opal_progress.c:207)
>> ==22736==    by 0xB404264: ompi_request_default_wait_all (condition.h:99)
>> ==22736==    by 0x1A1ADC16: ompi_coll_tuned_sendrecv_actual
>> (coll_tuned_util.c:55)
>> ==22736==    by 0x1A1B61E1: ompi_coll_tuned_allgatherv_intra_bruck
>> (coll_tuned_util.h:60)
>> ==22736==    by 0xB418B2E: PMPI_Allgatherv (pallgatherv.c:121)
>> ==22736==    by 0x646CCF7: Uintah::Level::setBCTypes() (Level.cc:728)
>> ==22736==    by 0x646D823: Uintah::Level::finalizeLevel() (Level.cc:537)
>> ==22736==    by 0x6465457:
>> Uintah::Grid::problemSetup(Uintah::Handle<Uintah::ProblemSpec> const&,
>> Uintah::ProcessorGroup const*, bool) (Grid.cc:866)
>> ==22736==    by 0x8345759: Uintah::SimulationController::gridSetup()
>> (SimulationController.cc:243)
>> ==22736==    by 0x834F418: Uintah::AMRSimulationController::run()
>> (AMRSimulationController.cc:117)
>> ==22736==    by 0x4089AE: main (sus.cc:629)
>> ==22736==
>> ==22736== Invalid read of size 8
>> ==22736==    at 0x19104775: mca_btl_sm_component_progress
>> (opal_list.h:322)
>> ==22736==    by 0x1382CE09: opal_progress (opal_progress.c:207)
>> ==22736==    by 0xB404264: ompi_request_default_wait_all (condition.h:99)
>> ==22736==    by 0x1A1ADC16: ompi_coll_tuned_sendrecv_actual
>> (coll_tuned_util.c:55)
>> ==22736==    by 0x1A1B61E1: ompi_coll_tuned_allgatherv_intra_bruck
>> (coll_tuned_util.h:60)
>> ==22736==    by 0xB418B2E: PMPI_Allgatherv (pallgatherv.c:121)
>> ==22736==    by 0x646CCF7: Uintah::Level::setBCTypes() (Level.cc:728)
>> ==22736==    by 0x646D823: Uintah::Level::finalizeLevel() (Level.cc:537)
>> ==22736==    by 0x6465457:
>> Uintah::Grid::problemSetup(Uintah::Handle<Uintah::ProblemSpec> const&,
>> Uintah::ProcessorGroup const*, bool) (Grid.cc:866)
>> ==22736==    by 0x8345759: Uintah::SimulationController::gridSetup()
>> (SimulationController.cc:243)
>> ==22736==    by 0x834F418: Uintah::AMRSimulationController::run()
>> (AMRSimulationController.cc:117)
>> ==22736==    by 0x4089AE: main (sus.cc:629)
>> ================================================================
>>
>> Are these problems with openmpi and is there any known work arounds?
>>
>>
>
> These are new to me.  The problem does seem to occur with OMPI's shared
> memory device; you might want to try a different point-to-point device
> (e.g., tcp?) to see if the problem goes away.  But be aware that the problem
> "going away" does not really pinpoint the location of the problem -- moving
> to a slower transport (like tcp) may simply change timing such that the
> problem does not occur.  I.e., the problem could still exist in either your
> code or OMPI -- this would simply be a workaround.
>
> --
> Jeff Squyres
> Cisco Systems
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>

Reply via email to