On Wed, Dec 1, 2010 at 8:28 AM, Rob Latham <r...@mcs.anl.gov> wrote: > On Mon, Nov 22, 2010 at 04:40:14PM -0700, James Overfelt wrote: >> Hello, >> >> I have a small test case where a file created with MPI_File_open >> is still open at the time MPI_Finalize is called. In the actual >> program there are lots of open files and it would be nice to avoid the >> resulting "Your MPI job will now abort." by either having MPI_Finalize >> close the files or honor the error handler and return an error code >> without an abort. >> >> I've tried with with OpenMPI 1.4.3 and 1.5 with the same results. >> Attached are the configure, compile and source files and the whole >> program follows. > > under MPICH2, this simple test program does not abort. You leak a lot > of resources (e.g. info structure allocated is not freed) but it > sounds like you are well aware of that. > > under openmpi, this test program fails because openmpi is trying to > help you out. I'm going to need some help from the openmpi folks > here, but the backtrace makes it look like MPI_Finalize is setting the > "no more mpi calls allowed" flag, and then goes and calls some mpi > routines to clean up the opened files: > > Breakpoint 1, 0xb7f7c346 in PMPI_Barrier () from > /home/robl/work/soft/openmpi-1.4/lib/libmpi.so.0 > (gdb) where > #0 0xb7f7c346 in PMPI_Barrier () from > /home/robl/work/soft/openmpi-1.4/lib/libmpi.so.0 > #1 0xb78a4c25 in mca_io_romio_dist_MPI_File_close () from > /home/robl/work/soft/openmpi-1.4/lib/openmpi/mca_io_romio.so > #2 0xb787e8b3 in mca_io_romio_file_close () from > /home/robl/work/soft/openmpi-1.4/lib/openmpi/mca_io_romio.so > #3 0xb7f591b1 in file_destructor () from > /home/robl/work/soft/openmpi-1.4/lib/libmpi.so.0 > #4 0xb7f58f28 in ompi_file_finalize () from > /home/robl/work/soft/openmpi-1.4/lib/libmpi.so.0 > #5 0xb7f67eb3 in ompi_mpi_finalize () from > /home/robl/work/soft/openmpi-1.4/lib/libmpi.so.0 > #6 0xb7f82828 in PMPI_Finalize () from > /home/robl/work/soft/openmpi-1.4/lib/libmpi.so.0 > #7 0x0804f9c2 in main (argc=1, argv=0xbfffed94) at file_error.cc:17 > > Why is there an MPI_Barrier in the close path? It has to do with our > implementation of shared file pointers. If you run this test on a file system > that does not support shared file pointers ( PVFS, for example), you might get > a little further. > > So, I think the ball is back in the OpenMPI court: they have to > re-jigger the order of the destructors so that closing files comes a > little earlier in the shutdown process. > > ==rob >
Rob, Thank you, that is the answer I was hoping for: I'm not crazy and it should be an easy fix. I'll look through the OpenMPI source code and maybe suggest a fix. jro