Yvan,

I'm looking into this one. So far I cannot reproduce it with the current version from the trunk. I will look into the stable versions. Until I figure out what's wrong, can you please use the nightly builds to run your test. Once the problem get fixed it will be included in the 1.0.2 release.

BTW, which interconnect are you using ? Ethernet ?

  Thanks,
    george.

On Feb 10, 2006, at 5:06 PM, Yvan Fournier wrote:

Hello,

I seem to have encountered a bug in Open MPI 1.0 using indexed datatypes
with MPI_Recv (which seems to be of the "off by one" sort). I have
joined a test case, which is briefly explained below (as well as in the source file). This case should run on two processes. I observed the bug on 2 different Linux systems (single processor Centrino under Suse 10.0
with gcc 4.0.2, dual-processor Xeon under Debian Sarge with gcc 3.4)
with Open MPI 1.0.1, and do not observe it using LAM 7.1.1 or MPICH2.

Here is a summary of the case:

------------------

Each processor reads a file ("data_p0" or "data_p1") giving a list of
global element ids. Some elements (vertices from a partitionned mesh)
may belong to both processors, so their id's may appear on both
processors: we have 7178 global vertices, 3654 and 3688 of them being
known by ranks 0 and 1 respectively.

In this simplified version, we assign coordinates {x, y, z} to each
vertex equal to it's global id number for rank 1, and the negative of
that for rank 0 (assigning the same values to x, y, and z). After
finishing the "ordered gather", rank 0 prints the global id and
coordinates of each vertex.

lines should print (for example) as:
  6456 ;   6455.00000   6455.00000   6456.00000
  6457 ;  -6457.00000  -6457.00000  -6457.00000
depending on whether a vertex belongs only to rank 0 (negative
coordinates) or belongs to rank 1 (positive coordinates).

With the OMPI 1.0.1 bug (observed on Suse Linux 10.0 with gcc 4.0 and on
Debian sarge with gcc 3.4), we have for example for the last vertices:
  7176 ;   7175.00000   7175.00000   7176.00000
  7177 ;   7176.00000   7176.00000   7177.00000
seeming to indicate an "off by one" type bug in datatype handling

Not using an indexed datatype (i.e. not defining USE_INDEXED_DATATYPE
in the gather_test.c file), the bug dissapears. Using the indexed
datatype with LAM MPI 7.1.1 or MPICH2, we do not reproduce the bug
either, so it does seem to be an Open MPI issue.

------------------

Best regards,

        Yvan Fournier
<ompi_datatype_bug.tar.gz>
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

"Half of what I say is meaningless; but I say it so that the other half may reach you"
                                  Kahlil Gibran


Reply via email to