Hi,

I'm having a problem with MPI_Gather in openMPI 1.3.3. The code that fails here works fine with mpich1.2.5, mpich2 1.1 and hpmpi 2.2.5 (I'm not blaming anyone, I just want to understand !). My code runs locally on a bi-pro, debian 32 bits, with 2 processes, and fails during an MPI_Gather call with the following message :
[sabrina:14631] *** An error occurred in MPI_Gather
[sabrina:14631] *** on communicator MPI COMMUNICATOR 37 SPLIT FROM 5
[sabrina:14631] *** MPI_ERR_TRUNCATE: message truncated
[sabrina:14631] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
when I run it with memchecker, valgrind produces the following message about an uninitialised value (I know that sometimes valgrind is wrong about this kind of error) :
==14634==
==14634== Conditional jump or move depends on uninitialised value(s)
==14634==    at 0x42E3A4C: ompi_convertor_need_buffers (convertor.h:175)
==14634==    by 0x42E3800: mca_pml_ob1_recv_request_ack (pml_ob1_recvreq.c:264)
==14634==    by 0x42E5566: mca_pml_ob1_recv_request_progress_rndv (pml_ob1_recvreq.c:554)
==14634==    by 0x42E1316: mca_pml_ob1_recv_frag_match (pml_ob1_recvfrag.c:641)
==14634==    by 0x42DFFDD: mca_pml_ob1_recv_frag_callback_rndv (pml_ob1_recvfrag.c:259)
==14634==    by 0x42322E7: mca_btl_sm_component_progress (btl_sm_component.c:426)
==14634==    by 0x44E3CF4: opal_progress (opal_progress.c:207)
==14634==    by 0x41A6E66: opal_condition_wait (condition.h:99)
==14634==    by 0x41A73E6: ompi_request_default_wait_all (req_wait.c:262)
==14634==    by 0x424E99A: ompi_coll_tuned_gather_intra_linear_sync (coll_tuned_gather.c:328)
==14634==    by 0x423CB98: ompi_coll_tuned_gather_intra_dec_fixed (coll_tuned_decision_fixed.c:718)
==14634==    by 0x4252B9E: mca_coll_sync_gather (coll_sync_gather.c:46)
==14634==

This is the first error message, if we except those produced during MPI_Init(). If I attach the debugger, I get the following backtrace :
0x042e3a4c in ompi_convertor_need_buffers (pConvertor=0x4a2c000)
    at ../../../../../../ompi/datatype/convertor.h:175
175     ../../../../../../ompi/datatype/convertor.h: No such file or directory.
        in ../../../../../../ompi/datatype/convertor.h
(gdb) where
#0  0x042e3a4c in ompi_convertor_need_buffers (pConvertor=0x4a2c000)
    at ../../../../../../ompi/datatype/convertor.h:175
#1  0x042e3801 in mca_pml_ob1_recv_request_ack (recvreq=0x4a2bf80,
    hdr=0x95b0a90, bytes_received=4032)
    at ../../../../../../ompi/mca/pml/ob1/pml_ob1_recvreq.c:264
#2  0x042e5567 in mca_pml_ob1_recv_request_progress_rndv (recvreq=0x4a2bf80,
    btl=0x4375260, segments=0xbecc3490, num_segments=1)
    at ../../../../../../ompi/mca/pml/ob1/pml_ob1_recvreq.c:554
#3  0x042e1317 in mca_pml_ob1_recv_frag_match (btl=0x4375260, hdr=0x95b0a90,
    segments=0xbecc3490, num_segments=1, type=66)
    at ../../../../../../ompi/mca/pml/ob1/pml_ob1_recvfrag.c:641
#4  0x042dffde in mca_pml_ob1_recv_frag_callback_rndv (btl=0x4375260,
    tag=66 'B', des=0xbecc3438, cbdata=0x0)
    at ../../../../../../ompi/mca/pml/ob1/pml_ob1_recvfrag.c:259
#5  0x042322e8 in mca_btl_sm_component_progress ()
    at ../../../../../../ompi/mca/btl/sm/btl_sm_component.c:426
#6  0x044e3cf5 in opal_progress () at ../../../opal/runtime/opal_progress.c:207
#7  0x041a6e67 in opal_condition_wait (c=0x4382700, m=0x4382760)
    at ../../../opal/threads/condition.h:99
#8  0x041a73e7 in ompi_request_default_wait_all (count=2, requests=0x4ef5360,
    statuses=0x0) at ../../../ompi/request/req_wait.c:262
#9  0x0424e99b in ompi_coll_tuned_gather_intra_linear_sync (sbuf=0x4ebd438,
    scount=3016, sdtype=0x4a3aa70, rbuf=0x4ecda00, rcount=1, rdtype=0x4f4b348,
    root=0, comm=0x4d0d8a8, module=0x4d0e220, first_segment_size=1024)
    at ../../../../../../ompi/mca/coll/tuned/coll_tuned_gather.c:328
#10 0x0423cb99 in ompi_coll_tuned_gather_intra_dec_fixed (sbuf=0x4ebd438,
    scount=3016, sdtype=0x4a3aa70, rbuf=0x4ecda00, rcount=1, rdtype=0x4f4b348,
    root=0, comm=0x4d0d8a8, module=0x4d0e220)
    at ../../../../../../ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:718
#11 0x04252b9f in mca_coll_sync_gather (sbuf=0x4ebd438, scount=3016,
    sdtype=0x4a3aa70, rbuf=0x4ecda00, rcount=1, rdtype=0x4f4b348, root=0,
    comm=0x4d0d8a8, module=0x4d0e098)
    at ../../../../../../ompi/mca/coll/sync/coll_sync_gather.c:46
#12 0x041db441 in PMPI_Gather (sendbuf=0x4ebd438, sendcount=3016,
    sendtype=0x4a3aa70, recvbuf=0x4ecda00, recvcount=1, recvtype=0x4f4b348,
    root=0, comm=0x4d0d8a8) at pgather.c:175
#13 0x082a47c9 in MPF_GEMV_SPARSE_INCORE (comm_row=0x4d0ce38,
    comm_col=0x4d0d8a8, transa=84 'T', M=232, N=464, P=232, Q=13,
    ALPHA=0x8d22e88, gBuffer=0x4f4aec0, bufferB=0x4f3f210, bufferC=0x4ebd438)
    at /home/gsylvand/BE_COMMON/MPF/src/MAT/IMPLS/SPARSE/matgemv_sparse.c:160
#14 0x082a592b in MPF_GEMV_SPARSE (TRANSA=0xbecc38f7 "T", ALPHA=0x8d22e88,
    matA=0x4d0b7d0, vecB=0x4cbb7e0, BETA=0x8d22e88, vecC=0x4f3d8f0)
    at /home/gsylvand/BE_COMMON/MPF/src/MAT/IMPLS/SPARSE/matgemv_sparse.c:331
#15 0x08251f2a in MPF_GEMV (transa=0x8c937ec "T", alpha=0x8d22e88,
    matA=0x4d0b7d0, vecB=0x4cbb7e0, beta=0x8d22e88, vecC=0x4f3d8f0)
    at /home/gsylvand/BE_COMMON/MPF/src/MAT/INTERFACE/mat_gemv.c:150
#16 0x080ab641 in main (argc=1, argv=0xbecc3aa4)
    at /home/gsylvand/ACTIPOLE/src/COUCHA/SRC/coucha.c:358
The content of pConvertor is :
(gdb)  p pConvertor[0]
$2 = {super = {obj_magic_id = 16046253926196952813, obj_class = 0x43741e0,
    obj_reference_count = 1,
    cls_init_file_name = 0x435687c "../../../../../ompi/mca/pml/base/pml_base_recvreq.c", cls_init_lineno = 42}, remoteArch = 4291428864, flags = 134873088,
  local_size = 0, remote_size = 0, pDesc = 0x0, use_desc = 0x0, count = 0,
  pBaseBuf = 0x0, pStack = 0x4a2c060, stack_size = 5, fAdvance = 0,
  master = 0x485eb60, stack_pos = 0, bConverted = 0, partial_length = 0,
  checksum = 0, csum_ui1 = 0, csum_ui2 = 0, static_stack = {{index = 0,
      type = 0, count = 0, disp = 0}, {index = 0, type = 0, count = 0,
      disp = 0}, {index = 0, type = 0, count = 0, disp = 0}, {index = 0,
      type = 0, count = 0, disp = 0}, {index = 0, type = 0, count = 0,
      disp = 0}}}

The MPI_Gather that fails is a bit complicated, since it uses MPI type created with MPI_Type_vector and MPI_Struct. The call is :
/* here we have N=464 P=232 Q=13 */
    bufferC = calloc(P * Q, 2*sizeof(double));
    bufferE = calloc(N * Q, 2*sizeof(double));
....
    ierr = MPI_Gather( bufferC, P*Q, BasicType, bufferE, 1, NStridedType, 0, comm_col );
where BasicType is a double complex created with :
    MPI_Type_contiguous(2, MPI_DOUBLE, &BasicType);
    MPI_Type_commit(&BasicType);
and NStridedType is an array of Q blocks of P complexes every N with extent=P, created with :
  MPI_Type_vector(Q, P, N, BasicType, &QPNStridedType) ; /* Q blocks of P BasicType every N */
  disp[0]=0 ;
  type[0]=QPNStridedType ;
  blocklen[0]=1 ;
  MPI_Type_extent(BasicType, &(disp[1]) ) ;
  disp[1] *= P ;
  type[1]=MPI_UB ;
  blocklen[1]=1 ;
  MPI_Type_struct(2, blocklen, disp, type, &NStridedType) ; /* Just to set the extent=P */
  MPI_Type_commit(&NStridedType) ;

As mentionned earlier, this works with other MPI implementation, and this kind of mechanism is widely used in this code, and it works (usually) fine.
Moreover, if I replace MPI_Gather by MPI_Allgather, no more bugs, it works :
ierr = MPI_Allgather( bufferC, P*Q, BasicType, bufferE, 1, NStridedType, comm_col ); CHKERRQ(ierr) ;
Another strange thing is that if I try to produce a small test.c code with these commands to reproduce this bug, no more bug ! It works :(
Any suggestions on something to test ?
Thanks in advance for your help,
Best regards,

Guillaume
-- 
Guillaume SYLVAND

Reply via email to