On Wed, Dec 1, 2010 at 8:28 AM, Rob Latham <r...@mcs.anl.gov> wrote:
> On Mon, Nov 22, 2010 at 04:40:14PM -0700, James Overfelt wrote:
>> Hello,
>>
>>     I have a small test case where a file created with MPI_File_open
>> is still open at the time MPI_Finalize is called.  In the actual
>> program there are lots of open files and it would be nice to avoid the
>> resulting "Your MPI job will now abort." by either having MPI_Finalize
>> close the files or honor the error handler and return an error code
>> without an abort.
>>
>>   I've tried with with OpenMPI 1.4.3 and 1.5 with the same results.
>> Attached are the configure, compile and source files and the whole
>> program follows.
>
> under MPICH2, this simple test program does not abort.  You leak a lot
> of resources (e.g. info structure allocated is not freed) but it
> sounds like you are well aware of that.
>
> under openmpi, this test program fails because openmpi is trying to
> help you out.  I'm going to need some help from the openmpi folks
> here, but the backtrace makes it look like MPI_Finalize is setting the
> "no more mpi calls allowed" flag, and then goes and calls some mpi
> routines to clean up the opened files:
>
> Breakpoint 1, 0xb7f7c346 in PMPI_Barrier () from 
> /home/robl/work/soft/openmpi-1.4/lib/libmpi.so.0
> (gdb) where
> #0  0xb7f7c346 in PMPI_Barrier () from 
> /home/robl/work/soft/openmpi-1.4/lib/libmpi.so.0
> #1  0xb78a4c25 in mca_io_romio_dist_MPI_File_close () from 
> /home/robl/work/soft/openmpi-1.4/lib/openmpi/mca_io_romio.so
> #2  0xb787e8b3 in mca_io_romio_file_close () from 
> /home/robl/work/soft/openmpi-1.4/lib/openmpi/mca_io_romio.so
> #3  0xb7f591b1 in file_destructor () from 
> /home/robl/work/soft/openmpi-1.4/lib/libmpi.so.0
> #4  0xb7f58f28 in ompi_file_finalize () from 
> /home/robl/work/soft/openmpi-1.4/lib/libmpi.so.0
> #5  0xb7f67eb3 in ompi_mpi_finalize () from 
> /home/robl/work/soft/openmpi-1.4/lib/libmpi.so.0
> #6  0xb7f82828 in PMPI_Finalize () from 
> /home/robl/work/soft/openmpi-1.4/lib/libmpi.so.0
> #7  0x0804f9c2 in main (argc=1, argv=0xbfffed94) at file_error.cc:17
>
> Why is there an MPI_Barrier in the close path?  It has to do with our
> implementation of shared file pointers.  If you run this test on a file system
> that does not support shared file pointers ( PVFS, for example), you might get
> a little further.
>
> So, I think the ball is back in the OpenMPI court: they have to
> re-jigger the order of the destructors so that closing files comes a
> little earlier in the shutdown process.
>
> ==rob
>


Rob,

  Thank you, that is the answer I was hoping for:  I'm not crazy and
it should be an easy fix.  I'll look through the OpenMPI source code
and maybe suggest a fix.

jro

Reply via email to