Yvan,
It's now corrected. Please use the trunk (nightly builds) starting from
revision 8997 or wait 'til monday when we will update the next stable
candidate. If you are in a hurry and feel like playing around with the MPI
code, you can apply the attached patch to the latest stable.
Thanks,
george.
On Fri, 10 Feb 2006, George Bosilca wrote:
Yvan,
I'm looking into this one. So far I cannot reproduce it with the
current version from the trunk. I will look into the stable versions.
Until I figure out what's wrong, can you please use the nightly
builds to run your test. Once the problem get fixed it will be
included in the 1.0.2 release.
BTW, which interconnect are you using ? Ethernet ?
Thanks,
george.
On Feb 10, 2006, at 5:06 PM, Yvan Fournier wrote:
Hello,
I seem to have encountered a bug in Open MPI 1.0 using indexed
datatypes
with MPI_Recv (which seems to be of the "off by one" sort). I have
joined a test case, which is briefly explained below (as well as in
the
source file). This case should run on two processes. I observed the
bug
on 2 different Linux systems (single processor Centrino under Suse
10.0
with gcc 4.0.2, dual-processor Xeon under Debian Sarge with gcc 3.4)
with Open MPI 1.0.1, and do not observe it using LAM 7.1.1 or MPICH2.
Here is a summary of the case:
------------------
Each processor reads a file ("data_p0" or "data_p1") giving a list of
global element ids. Some elements (vertices from a partitionned mesh)
may belong to both processors, so their id's may appear on both
processors: we have 7178 global vertices, 3654 and 3688 of them being
known by ranks 0 and 1 respectively.
In this simplified version, we assign coordinates {x, y, z} to each
vertex equal to it's global id number for rank 1, and the negative of
that for rank 0 (assigning the same values to x, y, and z). After
finishing the "ordered gather", rank 0 prints the global id and
coordinates of each vertex.
lines should print (for example) as:
6456 ; 6455.00000 6455.00000 6456.00000
6457 ; -6457.00000 -6457.00000 -6457.00000
depending on whether a vertex belongs only to rank 0 (negative
coordinates) or belongs to rank 1 (positive coordinates).
With the OMPI 1.0.1 bug (observed on Suse Linux 10.0 with gcc 4.0
and on
Debian sarge with gcc 3.4), we have for example for the last vertices:
7176 ; 7175.00000 7175.00000 7176.00000
7177 ; 7176.00000 7176.00000 7177.00000
seeming to indicate an "off by one" type bug in datatype handling
Not using an indexed datatype (i.e. not defining USE_INDEXED_DATATYPE
in the gather_test.c file), the bug dissapears. Using the indexed
datatype with LAM MPI 7.1.1 or MPICH2, we do not reproduce the bug
either, so it does seem to be an Open MPI issue.
------------------
Best regards,
Yvan Fournier
<ompi_datatype_bug.tar.gz>
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
"Half of what I say is meaningless; but I say it so that the other
half may reach you"
Kahlil Gibran
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
"We must accept finite disappointment, but we must never lose infinite
hope."
Martin Luther King
Index: new_position.c
===================================================================
--- new_position.c (revision 8976)
+++ new_position.c (working copy)
@@ -1,12 +1,12 @@
/* -*- Mode: C; c-basic-offset:4 ; -*- */
/*
- * Copyright (c) 2004-2005 The Trustees of Indiana University.
+ * Copyright (c) 2004-2006 The Trustees of Indiana University.
* All rights reserved.
- * Copyright (c) 2004-2005 The Trustees of the University of Tennessee.
+ * Copyright (c) 2004-2006 The Trustees of the University of Tennessee.
* All rights reserved.
- * Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
+ * Copyright (c) 2004-2006 High Performance Computing Center Stuttgart,
* University of Stuttgart. All rights reserved.
- * Copyright (c) 2004-2005 The Regents of the University of California.
+ * Copyright (c) 2004-2006 The Regents of the University of California.
* All rights reserved.
* $COPYRIGHT$
*
@@ -26,10 +26,13 @@
#endif
#include <stdlib.h>
-static int ompi_pack_debug=0;
+#if OMPI_ENABLE_DEBUG
+int32_t ompi_position_debug = 0;
+#define DO_DEBUG(INST) if( ompi_position_debug ) { INST }
+#else
+#define DO_DEBUG(INST)
+#endif /* OMPI_ENABLE_DEBUG */
-#define DO_DEBUG(INST) if( ompi_pack_debug ) { INST }
-
/* The pack/unpack functions need a cleanup. I have to create a proper
interface to access
* all basic functionalities, hence using them as basic blocks for all
conversion functions.
*
@@ -215,6 +218,10 @@
(*position) -= iov_len_local;
pConvertor->bConverted = *position; /* update the already converted bytes
*/
assert( iov_len_local >= 0 );
+ if( (pConvertor->pending_length != iov_len_local) &&
+ (pConvertor->flags & CONVERTOR_RECV) ) {
+ opal_output( 0, "Missing some data ?" );
+ }
if( !(pConvertor->flags & CONVERTOR_COMPLETED) ) {
/* I complete an element, next step I should go to the next one */
PUSH_STACK( pStack, pConvertor->stack_pos, pos_desc, DT_BYTE,
count_desc,
Index: new_unpack.c
===================================================================
--- new_unpack.c (revision 8976)
+++ new_unpack.c (working copy)
@@ -1,14 +1,12 @@
/* -*- Mode: C; c-basic-offset:4 ; -*- */
/*
- * Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
- * University Research and Technology
- * Corporation. All rights reserved.
- * Copyright (c) 2004-2005 The University of Tennessee and The University
- * of Tennessee Research Foundation. All rights
- * reserved.
- * Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
+ * Copyright (c) 2004-2006 The Trustees of Indiana University.
+ * All rights reserved.
+ * Copyright (c) 2004-2006 The Trustees of the University of Tennessee.
+ * All rights reserved.
+ * Copyright (c) 2004-2006 High Performance Computing Center Stuttgart,
* University of Stuttgart. All rights reserved.
- * Copyright (c) 2004-2005 The Regents of the University of California.
+ * Copyright (c) 2004-2006 The Regents of the University of California.
* All rights reserved.
* $COPYRIGHT$
*
@@ -265,19 +263,20 @@
}
}
complete_loop:
- iov[iov_count].iov_len -= iov_len_local; /* update the amount of
valid data */
- total_unpacked += iov[iov_count].iov_len;
- pConvertor->bConverted += iov[iov_count].iov_len; /* update the
already converted bytes */
- assert( iov_len_local >= 0 );
if( !(pConvertor->flags & CONVERTOR_COMPLETED) && (0 != iov_len_local)
) {
/* We have some partial data here. Let's copy it into the convertor
* and keep it hot until the next round.
*/
- assert( iov_len_local < 16 );
+ assert( iov_len_local < ompi_ddt_basicDatatypes[type]->size );
memcpy( pConvertor->pending, packed_buffer, iov_len_local );
DO_DEBUG( opal_output( 0, "Saving %d bytes for the next call\n",
iov_len_local ); );
pConvertor->pending_length = iov_len_local;
+ iov_len_local = 0;
}
+ iov[iov_count].iov_len -= iov_len_local; /* update the amount of
valid data */
+ total_unpacked += iov[iov_count].iov_len;
+ pConvertor->bConverted += iov[iov_count].iov_len; /* update the
already converted bytes */
+ assert( iov_len_local >= 0 );
}
*max_data = total_unpacked;
*out_size = iov_count;
Index: dt_module.c
===================================================================
--- dt_module.c (revision 8976)
+++ dt_module.c (working copy)
@@ -26,6 +26,7 @@
extern int32_t ompi_unpack_debug;
extern int32_t ompi_pack_debug;
extern int32_t ompi_copy_debug;
+extern int32_t ompi_position_debug;
#endif /* OMPI_ENABLE_DEBUG */
extern size_t ompi_datatype_memcpy_block_size;
@@ -542,6 +543,8 @@
false, false, 0, &ompi_unpack_debug );
mca_base_param_reg_int_name( "datatype", "pack_debug", "Non zero lead to
output generated by the pack functions",
false, false, 0, &ompi_pack_debug );
+ mca_base_param_reg_int_name( "datatype", "position_debug", "Non zero lead
to output generated by the datatype position functions",
+ false, false, 0, &ompi_position_debug );
mca_base_param_reg_int_name( "datatype", "copy_debug", "Non zero lead to
output generated by the local copy functions",
false, false, 0, &ompi_copy_debug );
#endif /* OMPI_ENABLE_DEBUG */
@@ -642,12 +645,12 @@
(int)pDesc->loop.extent );
else if( DT_END_LOOP == pDesc->elem.common.type )
index += snprintf( ptr + index, length - index, "prev %d elements
first elem displacement %ld size of data %d\n",
- (int)pDesc->end_loop.items,
pDesc->end_loop.first_elem_disp,
- (int)pDesc->end_loop.size );
+ (int)pDesc->end_loop.items,
pDesc->end_loop.first_elem_disp,
+ (int)pDesc->end_loop.size );
else
- index += snprintf( ptr + index, length - index, "count %d disp
0x%lx (%ld) extent %d\n",
+ index += snprintf( ptr + index, length - index, "count %d disp
0x%lx (%ld) extent %d (size %ld)\n",
(int)pDesc->elem.count, pDesc->elem.disp,
pDesc->elem.disp,
- (int)pDesc->elem.extent );
+ (int)pDesc->elem.extent, pDesc->elem.count *
ompi_ddt_basicDatatypes[pDesc->elem.common.type]->size );
pDesc++;
if( length <= index ) break;