HI all,

I have some code that uses parallel netCDF. I've run successfully on Titan 
(using the Cray MPICH derivative) and on my laptop (also running MPICH). 
However, when I run on one of our clusters running OMPI, the code barfs in 
MPI_Finalize() and doesn't write the complete/expected output files:

[:17472] *** An error occurred in MPI_File_set_errhandler
[:17472] *** on a NULL communicator
[:17472] *** Unknown error
[:17472] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
--------------------------------------------------------------------------
An MPI process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly.  You should
double check that everything has shut down cleanly.

  Reason:     After MPI_FINALIZE was invoked
  Local host:
  PID:        17472
--------------------------------------------------------------------------

The stacks are:
PMPI_Finalize (pfinalize.c:46)
        ompi_mpi_finalize (ompi_mpi_finalize.c:272)
                ompi_file_finalize (file.c:196)
                        opal_obj_run_destructors (opal_object.h:448)
                                file_destructor (file.c:273)
                                        mca_to_romio_file_close 
(io_romio_file_open.c:59)
                                                PMPI_File_set_errhandler 
(pfile_set_errhandler.c:47)
                                                        
ompi_mpi_errors_are_fatal_comm_handler (errhandler_predefined.c:52)

This is with OMPI 1.6.2 It is pnetCDF 1.3.1 on all 3 platforms.

The code appears to have the right participants opening/closing the right files 
on the right communicators (a mixture of rank 0s on subcomms opening across 
their subcomms and some nodes opening on MPI_COMM_SELF). It looks to me like 
some IO is getting delayed until MPI_Finalize() suggesting perhaps I missed a 
wait() or close() pnetCDF call. 

I don't necessarily think this is a bug in OMPI, I just don't know where to 
start looking in my code, since it is working fine on the two different 
versions of MPICH.

Thanks.



Reply via email to