To Whom This May Concern:

I've been trying to dig a little deeper into this problem and found some
additional information.

First, the stack trace for the ABR and ABW were different. The ABR problem
occurred in datatype_pack.h while the ABW problem occurred in
datatype_unpack.h.  The problem appears to be the same still.  Both errors
are occurring during a call to MEMCPY_CSUM().

I also found there are two different variables playing into this bug.  There
is _copy_blength and _copy_count.  At the top of the function, both of these
variables are set to 2 bytes for MPI_SHORT, 4 bytes for MPI_LONG, and 8
bytes for MPI_DOUBLE.  Then, these variables are multiplied together to get
the size of the memcpy().  Unfortunetly, the correct size are either of
these variables before they are squared.  There sometimes appears to be an
optimization where if two variables are next to each other, they are
sometimes converted into a MPI_BYTE where the size is also incorrectly
taking these squared values into consideration.

I wrote a small test program to illustrate the problem and attached it to
this email.  First, I configured openmpi 1.3.2 with the following options:

./configure --prefix=/myworkingdirectory/openmpi-1.3.2.local
--disable-mpi-f77 --disable-mpi-f90 --enable-debug --enable-mem-debug
--enable-mem-profile

I then modified datatype_pack.h & datatype_unpack.h located in
openmpi-1.3.2/ompi/datatype directory in order to produce additional
debugging output.  The new versions are attached to this email.

Then, I executed "make all install"

Then, I write the attached test.c program.  You can find the output below.
The problems appear in red.

0: sizes     '3'  '4'  '8'  '2'
0: offsets   '0'  '4'  '8'  '16'
0: addresses '134510640' '134510644' '134510648' '134510656'
0: name='MPI_CHAR'  _copy_blength='3'  orig_copy_blength='1'
_copy_count='3'  _source='134510640'
0: name='MPI_LONG'  _copy_blength='16'  orig_copy_blength='4'
_copy_count='4'  _source='134510644'
0: name='MPI_DOUBLE'  _copy_blength='64'  orig_copy_blength='8'
_copy_count='8'  _source='134510648'
0: name='MPI_SHORT'  _copy_blength='4'  orig_copy_blength='2'
_copy_count='2'  _source='134510656'
0: one='22'  two='222'  three='33.300000'  four='44'
1: sizes     '3'  '4'  '8'  '2'
1: offsets   '0'  '4'  '8'  '16'
1: addresses '134510640' '134510644' '134510648' '134510656'
1: name='MPI_CHAR'  _copy_blength='3'  orig_copy_blength='1'
_copy_count='3'  _destination='134510640'
1: name='MPI_LONG'  _copy_blength='16'  orig_copy_blength='4'
_copy_count='4'  _destination='134510644'
1: name='MPI_DOUBLE'  _copy_blength='64'  orig_copy_blength='8'
_copy_count='8'  _destination='134510648'
1: name='MPI_SHORT'  _copy_blength='4'  orig_copy_blength='2'
_copy_count='2'  _destination='134510656'
1: one='22'  two='222'  three='33.300000'  four='44'

You can see from the output that the MPI_Send & MPI_Recv functions are
reading or writing too much data from my structure, causing an overflow
condition to occur.  I believe this is causing my application to crash.

Any help on this problem would be appreciated.  If there is anything else
that you need from me, just let me know.

Thanks,
Brian



On Tue, Apr 28, 2009 at 9:58 PM, Brian Blank <brianbl...@gmail.com> wrote:

> To Whom This May Concern:
>
> I am having problems with an OpenMPI application I am writing on the
> Solaris/Intel AMD64 platform.  I am using OpenMPI 1.3.2 which I compiled
> myself using the Solaris C/C++ compiler.
>
> My application was crashing (signal 11) inside a call to malloc() which my
> code was running.  I thought it might be a memory overflow error that was
> causing this, so I fired up Purify.  Purify found several problems inside
> the the OpenMPI library.  I think one of the errors is serious and might be
> causing the problems I was looking for.
>
> The serious error is an Array Bounds Write (ABW) which occurred 824 times
> through 312 calls to MPI_Recv().  This error means that something inside the
> OpenMPI library is writing to memory where it shouldn't be writing to.  Here
> is the stack trace at the time of this error:
>
> Stack Trace 1 (Occurred 596 times)
>
> memcpy rtlib.o
> unpack_predefined_data [datatype_unpack.h:41]
>  MEMCPY_CSUM( _destination, *(SOURCE), _copy_blength, (CONVERTOR) );
> ompi_generic_simple_unpack [datatype_unpack.c:419]
> ompi_convertor_unpack [convertor.c:314]
> mca_pml_ob1_recv_frag_callback_match [pml_ob1_recvfrag.c:218]
> mca_btl_sm_component_progress [btl_sm_component.c:427]
> opal_progress [opal_progress.c:207]
> opal_condition_wait [condition.h:99]
> <Writing 64 bytes to 0x821f738 in heap (16 bytes at 0x821f768 illegal).>
> <Address 0x821f738 is 616 bytes into a malloc'd block at 0x821f4d0 of 664
> bytes.>
>
> Stack Trace 2 (Occurred 228 times)
>
> memcpy rtlib.o
> unpack_predefined_data [datatype_unpack.h:41]
>  MEMCPY_CSUM( _destination, *(SOURCE), _copy_blength, (CONVERTOR) );
> ompi_generic_simple_unpack [datatype_unpack.c:419]
> ompi_convertor_unpack [convertor.c:314]
> mca_pml_ob1_recv_request_progress_match [pml_ob1_recvreq.c:624]
> mca_pml_ob1_Recv_req_start [pml_ob1_recvreq.c:1008]
> mca_pml_ob1_recv [pml_ob1_irecv.c:103]
> MPI_Recv [precv.c:75]
> <Writing 64 bytes to 0x821f738 in heap (16 bytes at 0x821f768 illegal).>
> <Address 0x821f738 is 616 bytes into a malloc'd block at 0x821f4d0 of 664
> bytes.>
>
>
> I'm not that familiar with the under workings of the openmpi library, but I
> tried to debug it anyway.  I noticed that it was copying a lot of extra
> bytes for MPI_LONG and MPI_DOUBLE types.  On my system, MPI_LONG should have
> been 4 bytes, but was copying 16 bytes.  Also, MPI_DOUBLE should have been 8
> bytes, but was copying 64 bytes.  It seems the _copy_blength variable was
> being set to high, but I'm not sure why.  The above error also shows 64
> bytes being read, where my debugging shows a 64 byte copy for all MPI_DOUBLE
> types, which I feel should have been 8 bytes.  Therefore, I really believe
> _copy_blength is being set to high.
>
>
> Interestingly enough, the call to MPI_Isend() was generating an ABR (Array
> Bounds Read) error in the exact same line of code.  The ABR error sometimes
> can be fatal if the bytes being read is not a legal address, but the ABW
> error is usually a much more fatal error because it is definitely writing
> into memory that is probably used for something else.  I'm sure that if we
> fix the ABW error, the ABR error should fix itself too as it's the same line
> of code.
>
> Purify also found 14 UMR (Uninitialized memory read) errors inside the
> OpenMPI library.  Sometimes this can be really bad if there are any
> decisions being made as a result of reading this memory location.  But for
> now, let's solve the serious error I reported above first, then I will send
> you the UMR errors next.
>
> Any help you can provide would be greatly appreciated.
>
> Thanks,
> Brian
>
>
/* -*- Mode: C; c-basic-offset:4 ; -*- */
/*
 * Copyright (c) 2004-2006 The University of Tennessee and The University
 *                         of Tennessee Research Foundation.  All rights
 *                         reserved.
 * $COPYRIGHT$
 *
 * Additional copyrights may follow
 *
 * $HEADER$
 */

#ifndef DATATYPE_PACK_H_HAS_BEEN_INCLUDED
#define DATATYPE_PACK_H_HAS_BEEN_INCLUDED

static inline void pack_predefined_data( ompi_convertor_t* CONVERTOR,
                                         dt_elem_desc_t* ELEM,
                                         uint32_t* COUNT,
                                         unsigned char** SOURCE,
                                         unsigned char** DESTINATION,
                                         size_t* SPACE )
{
int mpi_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &mpi_rank);
    uint32_t _copy_count = *(COUNT);
    size_t _copy_blength;
    ddt_elem_desc_t* _elem = &((ELEM)->elem);
    unsigned char* _source = (*SOURCE) + _elem->disp;

    _copy_blength = ompi_ddt_basicDatatypes[_elem->common.type]->size;
    if( (_copy_count * _copy_blength) > *(SPACE) ) {
        _copy_count = (uint32_t)(*(SPACE) / _copy_blength);
        if( 0 == _copy_count ) return;  /* nothing to do */
    }

    if( (ptrdiff_t)_copy_blength == _elem->extent ) {
        _copy_blength *= _copy_count;
        /* the extent and the size of the basic datatype are equals */
        OMPI_DDT_SAFEGUARD_POINTER( _source, _copy_blength, (CONVERTOR)->pBaseBuf,
                                    (CONVERTOR)->pDesc, (CONVERTOR)->count );
        DO_DEBUG( opal_output( 0, "pack 1. memcpy( %p, %p, %lu ) => space %lu\n",
                               *(DESTINATION), _source, (unsigned long)_copy_blength, (unsigned long)(*(SPACE)) ); );
printf("%d: name='%s'  _copy_blength='%d'  orig_copy_blength='%d'  _copy_count='%d'  _source='%d'\n", 
mpi_rank, ompi_ddt_basicDatatypes[_elem->common.type]->name,
_copy_blength,
ompi_ddt_basicDatatypes[_elem->common.type]->size,
_copy_count, _source);
        MEMCPY_CSUM( *(DESTINATION), _source, _copy_blength, (CONVERTOR) );
        _source        += _copy_blength;
        *(DESTINATION) += _copy_blength;
    } else {
        uint32_t _i;
        for( _i = 0; _i < _copy_count; _i++ ) {
            OMPI_DDT_SAFEGUARD_POINTER( _source, _copy_blength, (CONVERTOR)->pBaseBuf,
                                        (CONVERTOR)->pDesc, (CONVERTOR)->count );
            DO_DEBUG( opal_output( 0, "pack 2. memcpy( %p, %p, %lu ) => space %lu\n",
                                   *(DESTINATION), _source, (unsigned long)_copy_blength, (unsigned long)(*(SPACE) - (_i * _copy_blength)) ); );
            MEMCPY_CSUM( *(DESTINATION), _source, _copy_blength, (CONVERTOR) );
            *(DESTINATION) += _copy_blength;
            _source        += _elem->extent;
        }
        _copy_blength *= _copy_count;
    }
    *(SOURCE)  = _source - _elem->disp;
    *(SPACE)  -= _copy_blength;
    *(COUNT)  -= _copy_count;
}

static inline void pack_contiguous_loop( ompi_convertor_t* CONVERTOR,
                                         dt_elem_desc_t* ELEM,
                                         uint32_t* COUNT,
                                         unsigned char** SOURCE,
                                         unsigned char** DESTINATION,
                                         size_t* SPACE )
{
    ddt_loop_desc_t *_loop = (ddt_loop_desc_t*)(ELEM);
    ddt_endloop_desc_t* _end_loop = (ddt_endloop_desc_t*)((ELEM) + _loop->items);
    unsigned char* _source = (*SOURCE) + _end_loop->first_elem_disp;
    uint32_t _copy_loops = *(COUNT);
    uint32_t _i;

    if( (_copy_loops * _end_loop->size) > *(SPACE) )
        _copy_loops = (uint32_t)(*(SPACE) / _end_loop->size);
    for( _i = 0; _i < _copy_loops; _i++ ) {
        OMPI_DDT_SAFEGUARD_POINTER( _source, _end_loop->size, (CONVERTOR)->pBaseBuf,
                                    (CONVERTOR)->pDesc, (CONVERTOR)->count );
        DO_DEBUG( opal_output( 0, "pack 3. memcpy( %p, %p, %lu ) => space %lu\n",
                               *(DESTINATION), _source, (unsigned long)_end_loop->size, (unsigned long)(*(SPACE) - _i * _end_loop->size) ); );
        MEMCPY_CSUM( *(DESTINATION), _source, _end_loop->size, (CONVERTOR) );
        *(DESTINATION) += _end_loop->size;
        _source        += _loop->extent;
    }
    *(SOURCE) = _source - _end_loop->first_elem_disp;
    *(SPACE) -= _copy_loops * _end_loop->size;
    *(COUNT) -= _copy_loops;
}

#define PACK_PREDEFINED_DATATYPE( CONVERTOR,    /* the convertor */                       \
                                  ELEM,         /* the basic element to be packed */      \
                                  COUNT,        /* the number of elements */              \
                                  SOURCE,       /* the source pointer (char*) */          \
                                  DESTINATION,  /* the destination pointer (char*) */     \
                                  SPACE )       /* the space in the destination buffer */ \
pack_predefined_data( (CONVERTOR), (ELEM), &(COUNT), &(SOURCE), &(DESTINATION), &(SPACE) )

#define PACK_CONTIGUOUS_LOOP( CONVERTOR, ELEM, COUNT, SOURCE, DESTINATION, SPACE ) \
    pack_contiguous_loop( (CONVERTOR), (ELEM), &(COUNT), &(SOURCE), &(DESTINATION), &(SPACE) )

#endif  /* DATATYPE_PACK_H_HAS_BEEN_INCLUDED */
/* -*- Mode: C; c-basic-offset:4 ; -*- */
/*
 * Copyright (c) 2004-2006 The University of Tennessee and The University
 *                         of Tennessee Research Foundation.  All rights
 *                         reserved.
 * $COPYRIGHT$
 *
 * Additional copyrights may follow
 *
 * $HEADER$
 */

#ifndef DATATYPE_UNPACK_H_HAS_BEEN_INCLUDED
#define DATATYPE_UNPACK_H_HAS_BEEN_INCLUDED

static inline void unpack_predefined_data( ompi_convertor_t* CONVERTOR, /* the convertor */
                                           dt_elem_desc_t* ELEM,         /* the element description */
                                           uint32_t* COUNT,              /* the number of elements */
                                           unsigned char** SOURCE,       /* the source pointer */
                                           unsigned char** DESTINATION,  /* the destination pointer */
                                           size_t* SPACE )               /* the space in the destination buffer */
{
    uint32_t _copy_count = *(COUNT);
    size_t _copy_blength;
    ddt_elem_desc_t* _elem = &((ELEM)->elem);
    unsigned char* _destination = (*DESTINATION) + _elem->disp;
int mpi_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &mpi_rank);

    _copy_blength = ompi_ddt_basicDatatypes[_elem->common.type]->size;
    if( (_copy_count * _copy_blength) > *(SPACE) ) {
        _copy_count = (uint32_t)(*(SPACE) / _copy_blength);
        if( 0 == _copy_count ) return;  /* nothing to do */
    }

    if( _copy_blength == (uint32_t)_elem->extent ) {
        _copy_blength *= _copy_count;
        /* the extent and the size of the basic datatype are equals */
        OMPI_DDT_SAFEGUARD_POINTER( _destination, _copy_blength, (CONVERTOR)->pBaseBuf,
                                    (CONVERTOR)->pDesc, (CONVERTOR)->count );
        DO_DEBUG( opal_output( 0, "unpack 1. memcpy( %p, %p, %lu ) => space %lu\n",
                               _destination, *(SOURCE), (unsigned long)_copy_blength, (unsigned long)(*(SPACE)) ); );
printf("%d: name='%s'  _copy_blength='%d'  orig_copy_blength='%d'  _copy_count='%d'  _destination='%d'\n",
mpi_rank, ompi_ddt_basicDatatypes[_elem->common.type]->name,
_copy_blength,
ompi_ddt_basicDatatypes[_elem->common.type]->size,
_copy_count, _destination);
        MEMCPY_CSUM( _destination, *(SOURCE), _copy_blength, (CONVERTOR) );
        *(SOURCE)    += _copy_blength;
        _destination += _copy_blength;
    } else {
        uint32_t _i;
        for( _i = 0; _i < _copy_count; _i++ ) {
            OMPI_DDT_SAFEGUARD_POINTER( _destination, _copy_blength, (CONVERTOR)->pBaseBuf,
                                        (CONVERTOR)->pDesc, (CONVERTOR)->count );
            DO_DEBUG( opal_output( 0, "unpack 2. memcpy( %p, %p, %lu ) => space %lu\n",
                                   _destination, *(SOURCE), (unsigned long)_copy_blength, (unsigned long)(*(SPACE) - (_i * _copy_blength)) ); );
            MEMCPY_CSUM( _destination, *(SOURCE), _copy_blength, (CONVERTOR) );
            *(SOURCE)    += _copy_blength;
            _destination += _elem->extent;
        }
        _copy_blength *= _copy_count;
    }
    (*DESTINATION)  = _destination - _elem->disp;
    *(SPACE)       -= _copy_blength;
    *(COUNT)       -= _copy_count;
}

static inline void unpack_contiguous_loop( ompi_convertor_t* CONVERTOR,
                                           dt_elem_desc_t* ELEM,
                                           uint32_t* COUNT,
                                           unsigned char** SOURCE,
                                           unsigned char** DESTINATION,
                                           size_t* SPACE )
{
    ddt_loop_desc_t *_loop = (ddt_loop_desc_t*)(ELEM);
    ddt_endloop_desc_t* _end_loop = (ddt_endloop_desc_t*)((ELEM) + _loop->items);
    unsigned char* _destination = (*DESTINATION) + _end_loop->first_elem_disp;
    uint32_t _copy_loops = *(COUNT);
    uint32_t _i;

    if( (_copy_loops * _end_loop->size) > *(SPACE) )
        _copy_loops = (uint32_t)(*(SPACE) / _end_loop->size);
    for( _i = 0; _i < _copy_loops; _i++ ) {
        OMPI_DDT_SAFEGUARD_POINTER( _destination, _end_loop->size, (CONVERTOR)->pBaseBuf,
                                    (CONVERTOR)->pDesc, (CONVERTOR)->count );
        DO_DEBUG( opal_output( 0, "unpack 3. memcpy( %p, %p, %lu ) => space %lu\n",
                               _destination, *(SOURCE), (unsigned long)_end_loop->size, (unsigned long)(*(SPACE) - _i * _end_loop->size) ); );
        MEMCPY_CSUM( _destination, *(SOURCE), _end_loop->size, (CONVERTOR) );
        *(SOURCE)    += _end_loop->size;
        _destination += _loop->extent;
    }
    *(DESTINATION) = _destination - _end_loop->first_elem_disp;
    *(SPACE)      -= _copy_loops * _end_loop->size;
    *(COUNT)      -= _copy_loops;
}

#define UNPACK_PREDEFINED_DATATYPE( CONVERTOR, ELEM, COUNT, SOURCE, DESTINATION, SPACE ) \
    unpack_predefined_data( (CONVERTOR), (ELEM), &(COUNT), &(SOURCE), &(DESTINATION), &(SPACE) )

#define UNPACK_CONTIGUOUS_LOOP( CONVERTOR, ELEM, COUNT, SOURCE, DESTINATION, SPACE ) \
    unpack_contiguous_loop( (CONVERTOR), (ELEM), &(COUNT), &(SOURCE), &(DESTINATION), &(SPACE) )

#endif  /* DATATYPE_UNPACK_H_HAS_BEEN_INCLUDED */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

#include <mpi.h>

typedef struct _tstruct {
        char   one[3];
        long   two;
        double three;
        short  four;
} TSTRUCT;

int main(int argc, char *argv[]) {
        TSTRUCT tstruct;
        int mpi_size=0, mpi_rank=0;
        int sizes[4];
        MPI_Aint offsets[4];
        MPI_Datatype types[4];
        MPI_Datatype mpitype;
        int status;
        MPI_Status mpistatus;

        // Initialize MPI
        if ((status = MPI_Init(&argc, &argv)) != MPI_SUCCESS) {
                printf("MPI_Init failed - %d.\n", status);
                (void)MPI_Abort(MPI_COMM_WORLD, status);
                abort();
        }

        // Get size & rank info
        MPI_Comm_rank(MPI_COMM_WORLD, &mpi_rank);
        MPI_Comm_size(MPI_COMM_WORLD, &mpi_size);

        if(mpi_size != 2) {
                printf("This test program must be run with exactly '2' processes.\n");
                (void)MPI_Abort(MPI_COMM_WORLD, 1);
                abort();
        }

        // Sleep to try and get Process 1 output to appear first... just for convenience
        if(mpi_rank==1) sleep(1);

        // Initialize data that I will send from process 0 to process 1
        memset(&tstruct, '\0', sizeof(TSTRUCT));
        if(mpi_rank==0) {
                sprintf(tstruct.one, "22");
                tstruct.two= 222;
                tstruct.three = 33.3;
                tstruct.four= 44;
        }

        // Build custom type structure
        sizes[0] = (int)sizeof(tstruct.one);
        sizes[1] = (int)sizeof(tstruct.two);
        sizes[2] = (int)sizeof(tstruct.three);
        sizes[3] = (int)sizeof(tstruct.four);
        offsets[0] = 0;
        offsets[1] = ((int)&tstruct.two) - ((int)&tstruct.one);
        offsets[2] = ((int)&tstruct.three) - ((int)&tstruct.one);
        offsets[3] = ((int)&tstruct.four) - ((int)&tstruct.one);
        types[0] = MPI_CHAR;
        types[1] = MPI_LONG;
        types[2] = MPI_DOUBLE;
        types[3] = MPI_SHORT;
        printf("%d: sizes     '%d'  '%d'  '%d'  '%d'\n", mpi_rank, sizes[0], sizes[1], sizes[2], sizes[3]);
        printf("%d: offsets   '%d'  '%d'  '%d'  '%d'\n", mpi_rank, offsets[0], offsets[1], offsets[2], offsets[3]);
        printf("%d: addresses '%d' '%d' '%d' '%d'\n", mpi_rank, &tstruct.one, 
							&tstruct.two, &tstruct.three, &tstruct.four);
        if((status = MPI_Type_struct(4, &sizes[0], &offsets[0], &types[0], &mpitype)) != MPI_SUCCESS) {
                printf("MPI_Type_struct() failed - %d.\n", status);
                (void)MPI_Abort(MPI_COMM_WORLD, status);
                abort();
        }
        if((status = MPI_Type_commit(&mpitype)) != MPI_SUCCESS) {
                printf("MPI_Type_struct() failed - %d.\n", status);
                (void)MPI_Abort(MPI_COMM_WORLD, status);
                abort();
        }

        if(mpi_rank==0) {
                if((status = MPI_Send(&tstruct.one, 1/*count*/, mpitype, 
                                        1/*dest*/, 0/*tag*/, MPI_COMM_WORLD)) != MPI_SUCCESS) {
                        printf("MPI_Send() failed - %d.\n", status);
                        (void)MPI_Abort(MPI_COMM_WORLD, status);
                        abort();
                }
        } else {
                if((status = MPI_Recv(&tstruct.one, 1/*count*/, mpitype, 0/*source*/, 
                                        MPI_ANY_TAG, MPI_COMM_WORLD, &mpistatus)) != MPI_SUCCESS) {
                        printf("MPI_Recv() failed - %d.\n", status);
                        (void)MPI_Abort(MPI_COMM_WORLD, status);
                        abort();
                }
        }

        // Print data
        printf("%d: one='%s'  two='%d'  three='%f'  four='%d'\n", mpi_rank, tstruct.one, tstruct.two, tstruct.three,
                                                                        tstruct.four);

        MPI_Finalize();

        return 1;
}

Reply via email to