Brian -- While I was on a plane today, I took a whack at making OMPI behave better when you forget to MPI_File_close() a file. Can you try this patch (should apply cleanly to OMPI trunk, v1.6, or v1.7):
https://svn.open-mpi.org/trac/ompi/changeset/28177 On Mar 18, 2013, at 12:42 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> wrote: > I *believe* that this means that you didn't MPI_File_close a file. > > We're not giving a very helpful error message here (it's downright > misleading, actually), but I'm pretty sure that this is the case. > > > On Mar 6, 2013, at 10:28 AM, "Smith, Brian E." <smit...@ornl.gov> wrote: > >> HI all, >> >> I have some code that uses parallel netCDF. I've run successfully on Titan >> (using the Cray MPICH derivative) and on my laptop (also running MPICH). >> However, when I run on one of our clusters running OMPI, the code barfs in >> MPI_Finalize() and doesn't write the complete/expected output files: >> >> [:17472] *** An error occurred in MPI_File_set_errhandler >> [:17472] *** on a NULL communicator >> [:17472] *** Unknown error >> [:17472] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort >> -------------------------------------------------------------------------- >> An MPI process is aborting at a time when it cannot guarantee that all >> of its peer processes in the job will be killed properly. You should >> double check that everything has shut down cleanly. >> >> Reason: After MPI_FINALIZE was invoked >> Local host: >> PID: 17472 >> -------------------------------------------------------------------------- >> >> The stacks are: >> PMPI_Finalize (pfinalize.c:46) >> ompi_mpi_finalize (ompi_mpi_finalize.c:272) >> ompi_file_finalize (file.c:196) >> opal_obj_run_destructors (opal_object.h:448) >> file_destructor (file.c:273) >> mca_io_romio_file_close >> (io_romio_file_open.c:59) >> PMPI_File_set_errhandler >> (pfile_set_errhandler.c:47) >> >> ompi_mpi_errors_are_fatal_comm_handler (errhandler_predefined.c:52) >> >> This is with OMPI 1.6.2 It is pnetCDF 1.3.1 on all 3 platforms. >> >> The code appears to have the right participants opening/closing the right >> files on the right communicators (a mixture of rank 0s on subcomms opening >> across their subcomms and some nodes opening on MPI_COMM_SELF). It looks to >> me like some IO is getting delayed until MPI_Finalize() suggesting perhaps I >> missed a wait() or close() pnetCDF call. >> >> I don't necessarily think this is a bug in OMPI, I just don't know where to >> start looking in my code, since it is working fine on the two different >> versions of MPICH. >> >> Thanks. >> >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/