To Whom This May Concern:

I am having problems with an OpenMPI application I am writing on the
Solaris/Intel AMD64 platform.  I am using OpenMPI 1.3.2 which I compiled
myself using the Solaris C/C++ compiler.

My application was crashing (signal 11) inside a call to malloc() which my
code was running.  I thought it might be a memory overflow error that was
causing this, so I fired up Purify.  Purify found several problems inside
the the OpenMPI library.  I think one of the errors is serious and might be
causing the problems I was looking for.

The serious error is an Array Bounds Write (ABW) which occurred 824 times
through 312 calls to MPI_Recv().  This error means that something inside the
OpenMPI library is writing to memory where it shouldn't be writing to.  Here
is the stack trace at the time of this error:

Stack Trace 1 (Occurred 596 times)

memcpy rtlib.o
unpack_predefined_data [datatype_unpack.h:41]
MEMCPY_CSUM( _destination, *(SOURCE), _copy_blength, (CONVERTOR) );
ompi_generic_simple_unpack [datatype_unpack.c:419]
ompi_convertor_unpack [convertor.c:314]
mca_pml_ob1_recv_frag_callback_match [pml_ob1_recvfrag.c:218]
mca_btl_sm_component_progress [btl_sm_component.c:427]
opal_progress [opal_progress.c:207]
opal_condition_wait [condition.h:99]
<Writing 64 bytes to 0x821f738 in heap (16 bytes at 0x821f768 illegal).>
<Address 0x821f738 is 616 bytes into a malloc'd block at 0x821f4d0 of 664
bytes.>

Stack Trace 2 (Occurred 228 times)

memcpy rtlib.o
unpack_predefined_data [datatype_unpack.h:41]
MEMCPY_CSUM( _destination, *(SOURCE), _copy_blength, (CONVERTOR) );
ompi_generic_simple_unpack [datatype_unpack.c:419]
ompi_convertor_unpack [convertor.c:314]
mca_pml_ob1_recv_request_progress_match [pml_ob1_recvreq.c:624]
mca_pml_ob1_Recv_req_start [pml_ob1_recvreq.c:1008]
mca_pml_ob1_recv [pml_ob1_irecv.c:103]
MPI_Recv [precv.c:75]
<Writing 64 bytes to 0x821f738 in heap (16 bytes at 0x821f768 illegal).>
<Address 0x821f738 is 616 bytes into a malloc'd block at 0x821f4d0 of 664
bytes.>


I'm not that familiar with the under workings of the openmpi library, but I
tried to debug it anyway.  I noticed that it was copying a lot of extra
bytes for MPI_LONG and MPI_DOUBLE types.  On my system, MPI_LONG should have
been 4 bytes, but was copying 16 bytes.  Also, MPI_DOUBLE should have been 8
bytes, but was copying 64 bytes.  It seems the _copy_blength variable was
being set to high, but I'm not sure why.  The above error also shows 64
bytes being read, where my debugging shows a 64 byte copy for all MPI_DOUBLE
types, which I feel should have been 8 bytes.  Therefore, I really believe
_copy_blength is being set to high.


Interestingly enough, the call to MPI_Isend() was generating an ABR (Array
Bounds Read) error in the exact same line of code.  The ABR error sometimes
can be fatal if the bytes being read is not a legal address, but the ABW
error is usually a much more fatal error because it is definitely writing
into memory that is probably used for something else.  I'm sure that if we
fix the ABW error, the ABR error should fix itself too as it's the same line
of code.

Purify also found 14 UMR (Uninitialized memory read) errors inside the
OpenMPI library.  Sometimes this can be really bad if there are any
decisions being made as a result of reading this memory location.  But for
now, let's solve the serious error I reported above first, then I will send
you the UMR errors next.

Any help you can provide would be greatly appreciated.

Thanks,
Brian

Reply via email to