Hi,
I'm having a problem with MPI_Gather in openMPI 1.3.3. The code that
fails here works fine with mpich1.2.5, mpich2 1.1 and hpmpi 2.2.5 (I'm
not blaming anyone, I just want to understand !). My code runs locally
on a
bi-pro, debian 32 bits, with 2 processes, and fails during an
MPI_Gather call with the following message :
[sabrina:14631] *** An error occurred in MPI_Gather
[sabrina:14631] *** on communicator MPI COMMUNICATOR 37 SPLIT
FROM 5
[sabrina:14631] *** MPI_ERR_TRUNCATE: message truncated
[sabrina:14631] *** MPI_ERRORS_ARE_FATAL (your MPI job will now
abort)
when I run it with memchecker, valgrind produces the following message
about an uninitialised value (I know that sometimes valgrind is wrong
about this kind of error) :
==14634==
==14634== Conditional jump or move depends on uninitialised
value(s)
==14634== at 0x42E3A4C: ompi_convertor_need_buffers
(convertor.h:175)
==14634== by 0x42E3800: mca_pml_ob1_recv_request_ack
(pml_ob1_recvreq.c:264)
==14634== by 0x42E5566: mca_pml_ob1_recv_request_progress_rndv
(pml_ob1_recvreq.c:554)
==14634== by 0x42E1316: mca_pml_ob1_recv_frag_match
(pml_ob1_recvfrag.c:641)
==14634== by 0x42DFFDD: mca_pml_ob1_recv_frag_callback_rndv
(pml_ob1_recvfrag.c:259)
==14634== by 0x42322E7: mca_btl_sm_component_progress
(btl_sm_component.c:426)
==14634== by 0x44E3CF4: opal_progress (opal_progress.c:207)
==14634== by 0x41A6E66: opal_condition_wait (condition.h:99)
==14634== by 0x41A73E6: ompi_request_default_wait_all
(req_wait.c:262)
==14634== by 0x424E99A:
ompi_coll_tuned_gather_intra_linear_sync (coll_tuned_gather.c:328)
==14634== by 0x423CB98: ompi_coll_tuned_gather_intra_dec_fixed
(coll_tuned_decision_fixed.c:718)
==14634== by 0x4252B9E: mca_coll_sync_gather
(coll_sync_gather.c:46)
==14634==
This is the first error message, if we except those produced during
MPI_Init(). If I attach the debugger, I get the following backtrace :
0x042e3a4c in ompi_convertor_need_buffers
(pConvertor=0x4a2c000)
at ../../../../../../ompi/datatype/convertor.h:175
175 ../../../../../../ompi/datatype/convertor.h: No such file
or directory.
in ../../../../../../ompi/datatype/convertor.h
(gdb) where
#0 0x042e3a4c in ompi_convertor_need_buffers
(pConvertor=0x4a2c000)
at ../../../../../../ompi/datatype/convertor.h:175
#1 0x042e3801 in mca_pml_ob1_recv_request_ack (recvreq=0x4a2bf80,
hdr=0x95b0a90, bytes_received=4032)
at ../../../../../../ompi/mca/pml/ob1/pml_ob1_recvreq.c:264
#2 0x042e5567 in mca_pml_ob1_recv_request_progress_rndv
(recvreq=0x4a2bf80,
btl=0x4375260, segments=0xbecc3490, num_segments=1)
at ../../../../../../ompi/mca/pml/ob1/pml_ob1_recvreq.c:554
#3 0x042e1317 in mca_pml_ob1_recv_frag_match (btl=0x4375260,
hdr=0x95b0a90,
segments=0xbecc3490, num_segments=1, type=66)
at ../../../../../../ompi/mca/pml/ob1/pml_ob1_recvfrag.c:641
#4 0x042dffde in mca_pml_ob1_recv_frag_callback_rndv
(btl=0x4375260,
tag=66 'B', des=0xbecc3438, cbdata=0x0)
at ../../../../../../ompi/mca/pml/ob1/pml_ob1_recvfrag.c:259
#5 0x042322e8 in mca_btl_sm_component_progress ()
at ../../../../../../ompi/mca/btl/sm/btl_sm_component.c:426
#6 0x044e3cf5 in opal_progress () at
../../../opal/runtime/opal_progress.c:207
#7 0x041a6e67 in opal_condition_wait (c=0x4382700, m=0x4382760)
at ../../../opal/threads/condition.h:99
#8 0x041a73e7 in ompi_request_default_wait_all (count=2,
requests=0x4ef5360,
statuses=0x0) at ../../../ompi/request/req_wait.c:262
#9 0x0424e99b in ompi_coll_tuned_gather_intra_linear_sync
(sbuf=0x4ebd438,
scount=3016, sdtype=0x4a3aa70, rbuf=0x4ecda00, rcount=1,
rdtype=0x4f4b348,
root=0, comm=0x4d0d8a8, module=0x4d0e220,
first_segment_size=1024)
at
../../../../../../ompi/mca/coll/tuned/coll_tuned_gather.c:328
#10 0x0423cb99 in ompi_coll_tuned_gather_intra_dec_fixed
(sbuf=0x4ebd438,
scount=3016, sdtype=0x4a3aa70, rbuf=0x4ecda00, rcount=1,
rdtype=0x4f4b348,
root=0, comm=0x4d0d8a8, module=0x4d0e220)
at
../../../../../../ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:718
#11 0x04252b9f in mca_coll_sync_gather (sbuf=0x4ebd438,
scount=3016,
sdtype=0x4a3aa70, rbuf=0x4ecda00, rcount=1, rdtype=0x4f4b348,
root=0,
comm=0x4d0d8a8, module=0x4d0e098)
at ../../../../../../ompi/mca/coll/sync/coll_sync_gather.c:46
#12 0x041db441 in PMPI_Gather (sendbuf=0x4ebd438, sendcount=3016,
sendtype=0x4a3aa70, recvbuf=0x4ecda00, recvcount=1,
recvtype=0x4f4b348,
root=0, comm=0x4d0d8a8) at pgather.c:175
#13 0x082a47c9 in MPF_GEMV_SPARSE_INCORE (comm_row=0x4d0ce38,
comm_col=0x4d0d8a8, transa=84 'T', M=232, N=464, P=232, Q=13,
ALPHA=0x8d22e88, gBuffer=0x4f4aec0, bufferB=0x4f3f210,
bufferC=0x4ebd438)
at
/home/gsylvand/BE_COMMON/MPF/src/MAT/IMPLS/SPARSE/matgemv_sparse.c:160
#14 0x082a592b in MPF_GEMV_SPARSE (TRANSA=0xbecc38f7 "T",
ALPHA=0x8d22e88,
matA=0x4d0b7d0, vecB=0x4cbb7e0, BETA=0x8d22e88,
vecC=0x4f3d8f0)
at
/home/gsylvand/BE_COMMON/MPF/src/MAT/IMPLS/SPARSE/matgemv_sparse.c:331
#15 0x08251f2a in MPF_GEMV (transa=0x8c937ec "T", alpha=0x8d22e88,
matA=0x4d0b7d0, vecB=0x4cbb7e0, beta=0x8d22e88,
vecC=0x4f3d8f0)
at
/home/gsylvand/BE_COMMON/MPF/src/MAT/INTERFACE/mat_gemv.c:150
#16 0x080ab641 in main (argc=1, argv=0xbecc3aa4)
at /home/gsylvand/ACTIPOLE/src/COUCHA/SRC/coucha.c:358
The content of pConvertor is :
(gdb) p pConvertor[0]
$2 = {super = {obj_magic_id = 16046253926196952813, obj_class =
0x43741e0,
obj_reference_count = 1,
cls_init_file_name = 0x435687c
"../../../../../ompi/mca/pml/base/pml_base_recvreq.c", cls_init_lineno
= 42}, remoteArch = 4291428864, flags = 134873088,
local_size = 0, remote_size = 0, pDesc = 0x0, use_desc = 0x0,
count = 0,
pBaseBuf = 0x0, pStack = 0x4a2c060, stack_size = 5, fAdvance =
0,
master = 0x485eb60, stack_pos = 0, bConverted = 0,
partial_length = 0,
checksum = 0, csum_ui1 = 0, csum_ui2 = 0, static_stack =
{{index = 0,
type = 0, count = 0, disp = 0}, {index = 0, type = 0, count
= 0,
disp = 0}, {index = 0, type = 0, count = 0, disp = 0},
{index = 0,
type = 0, count = 0, disp = 0}, {index = 0, type = 0, count
= 0,
disp = 0}}}
The MPI_Gather that fails is a bit complicated, since it uses MPI type
created with MPI_Type_vector and MPI_Struct. The call is :
/* here we have N=464 P=232 Q=13 */
bufferC = calloc(P * Q, 2*sizeof(double));
bufferE = calloc(N * Q, 2*sizeof(double));
....
ierr = MPI_Gather( bufferC, P*Q, BasicType, bufferE, 1,
NStridedType, 0, comm_col );
where BasicType is a double complex created with :
MPI_Type_contiguous(2, MPI_DOUBLE, &BasicType);
MPI_Type_commit(&BasicType);
and NStridedType is an array of Q blocks of P complexes every N with
extent=P, created with :
MPI_Type_vector(Q, P, N, BasicType,
&QPNStridedType) ; /* Q blocks of P BasicType every N */
disp[0]=0 ;
type[0]=QPNStridedType ;
blocklen[0]=1 ;
MPI_Type_extent(BasicType, &(disp[1]) ) ;
disp[1] *= P ;
type[1]=MPI_UB ;
blocklen[1]=1 ;
MPI_Type_struct(2, blocklen, disp, type, &NStridedType) ;
/* Just to set the extent=P */
MPI_Type_commit(&NStridedType) ;
As mentionned earlier, this works with other MPI implementation, and
this kind of mechanism is widely used in this code, and it works
(usually) fine.
Moreover, if I replace MPI_Gather by MPI_Allgather, no more bugs, it
works :
ierr = MPI_Allgather( bufferC, P*Q, BasicType, bufferE,
1, NStridedType, comm_col ); CHKERRQ(ierr) ;
Another strange thing is that if I try to produce a small test.c code
with these commands to reproduce this bug, no more bug ! It works :(
Any suggestions on something to test ?
Thanks in advance for your help,
Best regards,
Guillaume
--
Guillaume SYLVAND
|