I read the MPICH trac ticket you pointed to and your analysis seems pertinent. The impact of my patch for “count = 0” has a similar outcome to yours: removed all references to the datatype if the count was zero, without looking fo the special markers.
Let me try to come up with a fix. Thanks, George. On May 8, 2014, at 17:08 , Rob Latham <r...@mcs.anl.gov> wrote: > > > On 05/07/2014 11:36 AM, Rob Latham wrote: >> >> >> On 05/05/2014 09:20 PM, Richard Shaw wrote: >>> Hello, >>> >>> I think I've come across a bug when using ROMIO to read in a 2D >>> distributed array. I've attached a test case to this email. >> >> Thanks for the bug report and the test case. >> >> I've opened MPICH bug (because this is ROMIO's fault, not OpenMPI's >> fault... until I can prove otherwise ! :>) > > This bug appears to be OpenMPIs fault now. > > I'm looking at OpenMPI's "pulled it from git an hour ago" version, and > ROMIO's flattening code overruns arrays: the OpenMPI datatype processing > routines return too few blocks for ranks 1 and 3. > > Michael Raymond told me off-list "I tracked this down to MPT not marking > HVECTORs / STRUCTs with 0-sized counts as contiguous. Once I changed this, > the memory corruption and the data mismatches both went away. ". Could > something similar be happening in OpenMPI ? > > ==rob > >> >> http://trac.mpich.org/projects/mpich/ticket/2089 >> >> ==rob >> >>> >>> In the testcase I first initialise an array of 25 doubles (which will be >>> a 5x5 grid), then I create a datatype representing a 5x5 matrix >>> distributed in 3x3 blocks over a 2x2 process grid. As a reference I use >>> MPI_Pack to pull out the block cyclic array elements local to each >>> process (which I think is correct). Then I write the original array of >>> 25 doubles to disk, and use MPI-IO to read it back in (performing the >>> Open, Set_view, and Real_all), and compare to the reference. >>> >>> Running this with OMPI, the two match on all ranks. >>> >>> > mpirun -mca io ompio -np 4 ./darr_read.x >>> === Rank 0 === (9 elements) >>> Packed: 0.0 1.0 2.0 5.0 6.0 7.0 10.0 11.0 12.0 >>> Read: 0.0 1.0 2.0 5.0 6.0 7.0 10.0 11.0 12.0 >>> >>> === Rank 1 === (6 elements) >>> Packed: 15.0 16.0 17.0 20.0 21.0 22.0 >>> Read: 15.0 16.0 17.0 20.0 21.0 22.0 >>> >>> === Rank 2 === (6 elements) >>> Packed: 3.0 4.0 8.0 9.0 13.0 14.0 >>> Read: 3.0 4.0 8.0 9.0 13.0 14.0 >>> >>> === Rank 3 === (4 elements) >>> Packed: 18.0 19.0 23.0 24.0 >>> Read: 18.0 19.0 23.0 24.0 >>> >>> >>> >>> However, using ROMIO the two differ on two of the ranks: >>> >>> > mpirun -mca io romio -np 4 ./darr_read.x >>> === Rank 0 === (9 elements) >>> Packed: 0.0 1.0 2.0 5.0 6.0 7.0 10.0 11.0 12.0 >>> Read: 0.0 1.0 2.0 5.0 6.0 7.0 10.0 11.0 12.0 >>> >>> === Rank 1 === (6 elements) >>> Packed: 15.0 16.0 17.0 20.0 21.0 22.0 >>> Read: 0.0 1.0 2.0 0.0 1.0 2.0 >>> >>> === Rank 2 === (6 elements) >>> Packed: 3.0 4.0 8.0 9.0 13.0 14.0 >>> Read: 3.0 4.0 8.0 9.0 13.0 14.0 >>> >>> === Rank 3 === (4 elements) >>> Packed: 18.0 19.0 23.0 24.0 >>> Read: 0.0 1.0 0.0 1.0 >>> >>> >>> >>> My interpretation is that the behaviour with OMPIO is correct. >>> Interestingly everything matches up using both ROMIO and OMPIO if I set >>> the block shape to 2x2. >>> >>> This was run on OS X using 1.8.2a1r31632. I have also run this on Linux >>> with OpenMPI 1.7.4, and OMPIO is still correct, but using ROMIO I just >>> get segfaults. >>> >>> Thanks, >>> Richard >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> > > -- > Rob Latham > Mathematics and Computer Science Division > Argonne National Lab, IL USA > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users