Hi all, Around a year ago, I posted the attached note regarding apparent incorrect file output results when using OpenMPI >= 1.3.0. I was requested that I generate a small, self contained bit of code that demonstrates the issue. I have attached that code to this posting (mpiio.cpp).
You can build this with: mpicxx mpiio.cpp -o mpiio And I execute with the command: sh-3.2$ mpiexec -n 1 ~/dgm/src/mpiio; od -e mpi.out 0000000 0.000000000000000e+00 1.000000000000000e+00 0000020 2.000000000000000e+00 3.000000000000000e+00 0000040 4.000000000000000e+00 5.000000000000000e+00 0000060 6.000000000000000e+00 7.000000000000000e+00 0000100 8.000000000000000e+00 9.000000000000000e+00 0000120 1.000000000000000e+01 1.100000000000000e+01 0000140 1.200000000000000e+01 1.300000000000000e+01 0000160 1.400000000000000e+01 1.500000000000000e+01 0000200 1.600000000000000e+01 1.700000000000000e+01 0000220 1.800000000000000e+01 1.900000000000000e+01 0000240 2.000000000000000e+01 2.100000000000000e+01 0000260 2.200000000000000e+01 2.300000000000000e+01 0000300 sh-3.2$ mpiexec -n 2 ~/dgm/src/mpiio; od -e mpi.out 0000000 1.200000000000000e+01 1.300000000000000e+01 0000020 1.400000000000000e+01 1.500000000000000e+01 0000040 1.600000000000000e+01 1.700000000000000e+01 0000060 1.800000000000000e+01 1.900000000000000e+01 0000100 2.000000000000000e+01 2.100000000000000e+01 0000120 2.200000000000000e+01 2.300000000000000e+01 0000140 1.200000000000000e+01 1.300000000000000e+01 0000160 1.400000000000000e+01 1.500000000000000e+01 0000200 1.600000000000000e+01 1.700000000000000e+01 0000220 1.800000000000000e+01 1.900000000000000e+01 0000240 2.000000000000000e+01 2.100000000000000e+01 0000260 2.200000000000000e+01 2.300000000000000e+01 0000300 Note that the program should write out doubles 0-23 and on one processor this is true. However, for n=2, it incorrectly writes rank 2's information overtop of rank 1's stuff. For larger problems it sometimes also drops information -- i.e. One rank doesn't even write data at all. I suspect that the problems are closely related. So see this behavior, use 100 elements (instead of the default 2) mpiexec -n 4 ~/dgm/src/mpiio 100; ls -l mpi.out -rw-r----- 1 user user 2400 Apr 19 12:19 mpi.out mpiexec -n 1 ~/dgm/src/mpiio 100; ls -l mpi.out -rw-r----- 1 user user 9600 Apr 19 12:19 mpi.out Note how the -n 4 file is too small. Note that with OpenMPI 1.2.7, I have verified that we get the correct results: $ mpiexec -n 1 mpiio; od -e mpi.out 0000000 0.000000000000000e+00 1.000000000000000e+00 0000020 2.000000000000000e+00 3.000000000000000e+00 0000040 4.000000000000000e+00 5.000000000000000e+00 0000060 6.000000000000000e+00 7.000000000000000e+00 0000100 8.000000000000000e+00 9.000000000000000e+00 0000120 1.000000000000000e+01 1.100000000000000e+01 0000140 1.200000000000000e+01 1.300000000000000e+01 0000160 1.400000000000000e+01 1.500000000000000e+01 0000200 1.600000000000000e+01 1.700000000000000e+01 0000220 1.800000000000000e+01 1.900000000000000e+01 0000240 2.000000000000000e+01 2.100000000000000e+01 0000260 2.200000000000000e+01 2.300000000000000e+01 0000300 $ mpiexec -n 2 mpiio; od -e mpi.out 0000000 0.000000000000000e+00 1.000000000000000e+00 0000020 2.000000000000000e+00 3.000000000000000e+00 0000040 4.000000000000000e+00 5.000000000000000e+00 0000060 6.000000000000000e+00 7.000000000000000e+00 0000100 8.000000000000000e+00 9.000000000000000e+00 0000120 1.000000000000000e+01 1.100000000000000e+01 0000140 1.200000000000000e+01 1.300000000000000e+01 0000160 1.400000000000000e+01 1.500000000000000e+01 0000200 1.600000000000000e+01 1.700000000000000e+01 0000220 1.800000000000000e+01 1.900000000000000e+01 0000240 2.000000000000000e+01 2.100000000000000e+01 0000260 2.200000000000000e+01 2.300000000000000e+01 0000300 Finally, just to prove that it is OpenMPI related, I build the latest MPICH2 with the results: $ ~/local/mpich2/bin/mpiexec -n 1 mpiio-mpich2; od -e mpi.out 0000000 0.000000000000000e+00 1.000000000000000e+00 0000020 2.000000000000000e+00 3.000000000000000e+00 0000040 4.000000000000000e+00 5.000000000000000e+00 0000060 6.000000000000000e+00 7.000000000000000e+00 0000100 8.000000000000000e+00 9.000000000000000e+00 0000120 1.000000000000000e+01 1.100000000000000e+01 0000140 1.200000000000000e+01 1.300000000000000e+01 0000160 1.400000000000000e+01 1.500000000000000e+01 0000200 1.600000000000000e+01 1.700000000000000e+01 0000220 1.800000000000000e+01 1.900000000000000e+01 0000240 2.000000000000000e+01 2.100000000000000e+01 0000260 2.200000000000000e+01 2.300000000000000e+01 0000300 $ ~/local/mpich2/bin/mpiexec -n 2 mpiio-mpich2; od -e mpi.out 0000000 0.000000000000000e+00 1.000000000000000e+00 0000020 2.000000000000000e+00 3.000000000000000e+00 0000040 4.000000000000000e+00 5.000000000000000e+00 0000060 6.000000000000000e+00 7.000000000000000e+00 0000100 8.000000000000000e+00 9.000000000000000e+00 0000120 1.000000000000000e+01 1.100000000000000e+01 0000140 1.200000000000000e+01 1.300000000000000e+01 0000160 1.400000000000000e+01 1.500000000000000e+01 0000200 1.600000000000000e+01 1.700000000000000e+01 0000220 1.800000000000000e+01 1.900000000000000e+01 0000240 2.000000000000000e+01 2.100000000000000e+01 0000260 2.200000000000000e+01 2.300000000000000e+01 0000300 Clearly something is wrong (perhaps with file pointer/offsets). Hope that this helps, Scott Subject: [OMPI users] Incorrect results with MPI-IO under OpenMPI v1.3.1 From: Scott Collis (sscollis_at_[hidden]) List-Post: users@lists.open-mpi.org Date: 2009-04-06 14:16:18 I have been a user of MPI-IO for 4+ years and have a code that has run correctly with MPICH, MPICH2, and OpenMPI 1.2.* I recently upgraded to OpenMPI 1.3.1 and immediately noticed that my MPI-IO generated output files are corrupted. I have not yet had a chance to debug this in detail, but it appears that MPI_File_write_all() commands are not placing information correctly on their file_view when running with more than 1 processor (everything is okay with -np 1). Note that I have observed the same incorrect behavior on both Linux and OS-X. I have also gone back and made sure that the same code works with MPICH, MPICH2, and OpenMPI 1.2.* so I'm fairly confident that something has been changed or broken as of OpenMPI 1.3.*. Just today, I checked out the SVN repository version of OpenMPI and built and tested my code with that and the results are incorrect just as for the 1.3.1 tarball. While I plan to continue to debug this and will try to put together a small test that demonstrates the issue, I thought that I would first send out this message to see if this might trigger a thought within the OpenMPI development team as to where this issue might be. Please let me know if you have any ideas as I would very much appreciate it! Thanks in advance, Scott -- Scott Collis sscollis_at_[hidden]
mpiio.cpp
Description: Binary data