Hi all, 

Around a year ago, I posted the attached note regarding apparent incorrect file 
output results when using OpenMPI >= 1.3.0.  I was requested that I generate a 
small, self contained bit of code that demonstrates the issue.  I have attached 
that code to this posting (mpiio.cpp). 

You can build this with: 

  mpicxx mpiio.cpp -o mpiio 

And I execute with the command: 

sh-3.2$ mpiexec -n 1 ~/dgm/src/mpiio; od -e mpi.out 
0000000   0.000000000000000e+00   1.000000000000000e+00 
0000020   2.000000000000000e+00   3.000000000000000e+00 
0000040   4.000000000000000e+00   5.000000000000000e+00 
0000060   6.000000000000000e+00   7.000000000000000e+00 
0000100   8.000000000000000e+00   9.000000000000000e+00 
0000120   1.000000000000000e+01   1.100000000000000e+01 
0000140   1.200000000000000e+01   1.300000000000000e+01 
0000160   1.400000000000000e+01   1.500000000000000e+01 
0000200   1.600000000000000e+01   1.700000000000000e+01 
0000220   1.800000000000000e+01   1.900000000000000e+01 
0000240   2.000000000000000e+01   2.100000000000000e+01 
0000260   2.200000000000000e+01   2.300000000000000e+01 
0000300 

sh-3.2$ mpiexec -n 2 ~/dgm/src/mpiio; od -e mpi.out 
0000000   1.200000000000000e+01   1.300000000000000e+01 
0000020   1.400000000000000e+01   1.500000000000000e+01 
0000040   1.600000000000000e+01   1.700000000000000e+01 
0000060   1.800000000000000e+01   1.900000000000000e+01 
0000100   2.000000000000000e+01   2.100000000000000e+01 
0000120   2.200000000000000e+01   2.300000000000000e+01 
0000140   1.200000000000000e+01   1.300000000000000e+01 
0000160   1.400000000000000e+01   1.500000000000000e+01 
0000200   1.600000000000000e+01   1.700000000000000e+01 
0000220   1.800000000000000e+01   1.900000000000000e+01 
0000240   2.000000000000000e+01   2.100000000000000e+01 
0000260   2.200000000000000e+01   2.300000000000000e+01 
0000300 

Note that the program should write out doubles 0-23 and on one processor this 
is 
true.  However, for n=2, it incorrectly writes rank 2's information overtop 
of rank 1's stuff. 

For larger problems it sometimes also drops information -- i.e. One rank 
doesn't even write data at all.  I suspect that the problems are closely 
related.  So see this behavior, use 100 elements (instead of the default 2) 

mpiexec -n 4 ~/dgm/src/mpiio 100; ls -l mpi.out 
-rw-r----- 1 user user 2400 Apr 19 12:19 mpi.out 

mpiexec -n 1 ~/dgm/src/mpiio 100; ls -l mpi.out 
-rw-r----- 1 user user 9600 Apr 19 12:19 mpi.out 

Note how the -n 4 file is too small. 

Note that with OpenMPI 1.2.7, I have verified that we get the correct 
results: 

$ mpiexec -n 1 mpiio; od -e mpi.out 
0000000     0.000000000000000e+00    1.000000000000000e+00 
0000020     2.000000000000000e+00    3.000000000000000e+00 
0000040     4.000000000000000e+00    5.000000000000000e+00 
0000060     6.000000000000000e+00    7.000000000000000e+00 
0000100     8.000000000000000e+00    9.000000000000000e+00 
0000120     1.000000000000000e+01    1.100000000000000e+01 
0000140     1.200000000000000e+01    1.300000000000000e+01 
0000160     1.400000000000000e+01    1.500000000000000e+01 
0000200     1.600000000000000e+01    1.700000000000000e+01 
0000220     1.800000000000000e+01    1.900000000000000e+01 
0000240     2.000000000000000e+01    2.100000000000000e+01 
0000260     2.200000000000000e+01    2.300000000000000e+01 
0000300 

$ mpiexec -n 2 mpiio; od -e mpi.out 
0000000     0.000000000000000e+00    1.000000000000000e+00 
0000020     2.000000000000000e+00    3.000000000000000e+00 
0000040     4.000000000000000e+00    5.000000000000000e+00 
0000060     6.000000000000000e+00    7.000000000000000e+00 
0000100     8.000000000000000e+00    9.000000000000000e+00 
0000120     1.000000000000000e+01    1.100000000000000e+01 
0000140     1.200000000000000e+01    1.300000000000000e+01 
0000160     1.400000000000000e+01    1.500000000000000e+01 
0000200     1.600000000000000e+01    1.700000000000000e+01 
0000220     1.800000000000000e+01    1.900000000000000e+01 
0000240     2.000000000000000e+01    2.100000000000000e+01 
0000260     2.200000000000000e+01    2.300000000000000e+01 
0000300 

Finally, just to prove that it is OpenMPI related, I build the latest MPICH2 
with the results: 

$ ~/local/mpich2/bin/mpiexec -n 1 mpiio-mpich2; od -e mpi.out 
0000000     0.000000000000000e+00    1.000000000000000e+00 
0000020     2.000000000000000e+00    3.000000000000000e+00 
0000040     4.000000000000000e+00    5.000000000000000e+00 
0000060     6.000000000000000e+00    7.000000000000000e+00 
0000100     8.000000000000000e+00    9.000000000000000e+00 
0000120     1.000000000000000e+01    1.100000000000000e+01 
0000140     1.200000000000000e+01    1.300000000000000e+01 
0000160     1.400000000000000e+01    1.500000000000000e+01 
0000200     1.600000000000000e+01    1.700000000000000e+01 
0000220     1.800000000000000e+01    1.900000000000000e+01 
0000240     2.000000000000000e+01    2.100000000000000e+01 
0000260     2.200000000000000e+01    2.300000000000000e+01 
0000300 

$ ~/local/mpich2/bin/mpiexec -n 2 mpiio-mpich2; od -e mpi.out 
0000000     0.000000000000000e+00    1.000000000000000e+00 
0000020     2.000000000000000e+00    3.000000000000000e+00 
0000040     4.000000000000000e+00    5.000000000000000e+00 
0000060     6.000000000000000e+00    7.000000000000000e+00 
0000100     8.000000000000000e+00    9.000000000000000e+00 
0000120     1.000000000000000e+01    1.100000000000000e+01 
0000140     1.200000000000000e+01    1.300000000000000e+01 
0000160     1.400000000000000e+01    1.500000000000000e+01 
0000200     1.600000000000000e+01    1.700000000000000e+01 
0000220     1.800000000000000e+01    1.900000000000000e+01 
0000240     2.000000000000000e+01    2.100000000000000e+01 
0000260     2.200000000000000e+01    2.300000000000000e+01 
0000300 

Clearly something is wrong (perhaps with file pointer/offsets).  Hope that this 
helps, 

Scott 



Subject: [OMPI users] Incorrect results with MPI-IO under OpenMPI v1.3.1 
From: Scott Collis (sscollis_at_[hidden]) 
List-Post: users@lists.open-mpi.org
Date: 2009-04-06 14:16:18 

I have been a user of MPI-IO for 4+ years and have a code that has run 

correctly with MPICH, MPICH2, and OpenMPI 1.2.* 

I recently upgraded to OpenMPI 1.3.1 and immediately noticed that my 
MPI-IO generated output files are corrupted. I have not yet had a 
chance to debug this in detail, but it appears that 
MPI_File_write_all() commands are not placing information correctly on 
their file_view when running with more than 1 processor (everything is 
okay with -np 1). 

Note that I have observed the same incorrect behavior on both Linux 
and OS-X. I have also gone back and made sure that the same code 
works with MPICH, MPICH2, and OpenMPI 1.2.* so I'm fairly confident 
that something has been changed or broken as of OpenMPI 1.3.*. Just 
today, I checked out the SVN repository version of OpenMPI and built 
and tested my code with that and the results are incorrect just as for 
the 1.3.1 tarball. 

While I plan to continue to debug this and will try to put together a 
small test that demonstrates the issue, I thought that I would first 
send out this message to see if this might trigger a thought within 
the OpenMPI development team as to where this issue might be. 

Please let me know if you have any ideas as I would very much 
appreciate it! 

Thanks in advance, 

Scott 

-- 
Scott Collis 
sscollis_at_[hidden] 

Attachment: mpiio.cpp
Description: Binary data

Reply via email to