It happens with or without rankfile. Started with mpirun -np 16 ./somecode
mca parameters: btl = self,sm,openib mpi_maffinity_alone = 1rmaps_base_no_oversubscribe = 1 (rmaps_base_no_oversubscribe = 0 doesn't change it)
I tested with both: "btl=self,sm" on 16c-core nodes and "btl=self,sm,openib" on 8x dual-core nodes , result is the same.
It looks like it always occurs exactly at the same point in the execution, not at the beginning, it is not first MPI_Comm_dup in the code.
I can't say too much about particular piece of the code, where it is happening, because it is in the 3rd-party library (MUMPS). When error occurs, MPI_Comm_dup in every task deals with single-task communicator (MPI_Comm_split of initial MPI_Comm_world for 16 processes into 16 groups, 1 process per group). And I can guess that before this error, MPI_Comm_dup is called something like 100 of times by the same piece of code on the same communicators without any problems.
I can say that it used to work correctly with all previous versions of openmpi we used (1.2.8-1.3.2 and some earlier versions). It also works correctly on other platforms/MPI implementations.
All environmental variables (PATH, LD_LIBRARY_PATH) are correct. I recompiled code and 3rd-party libraries with this version of OMPI.
config.log.gz
Description: GNU Zip compressed data
ompi-info.txt.gz
Description: GNU Zip compressed data
-- Anton Starikov. Computational Material Science, Faculty of Science and Technology, University of Twente. Phone: +31 (0)53 489 2986 Fax: +31 (0)53 489 2910 On May 12, 2009, at 12:35 PM, Jeff Squyres wrote:
Can you send all the information listed here: http://www.open-mpi.org/community/help/ On May 11, 2009, at 10:03 PM, Anton Starikov wrote:By the way, this if fortran code, which uses F77 bindings. -- Anton Starikov. On May 12, 2009, at 3:06 AM, Anton Starikov wrote: > Due to rankfile fixes I switched to SVN r21208, now my code dies > with error > > [node037:20519] *** An error occurred in MPI_Comm_dup> [node037:20519] *** on communicator MPI COMMUNICATOR 32 SPLIT FROM 4> [node037:20519] *** MPI_ERR_INTERN: internal error> [node037:20519] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)> > -- > Anton Starikov. > _______________________________________________ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users-- Jeff Squyres Cisco Systems _______________________________________________ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users