As a pure guess, it might actually be this one: - Fix to detect and avoid overlapping memcpy(). Thanks to Francis Pellegrini for identifying the issue.
We're actually very close to releasing 1.4.4 -- using the latest RC should be pretty safe. On Sep 23, 2011, at 5:51 AM, Paul Kapinos wrote: > Hi Open MPI volks, > > we see some quite strange effects with our installations of Open MPI 1.4.3 > with Intel 12.x compilers, which makes us puzzling: Different programs > reproducibly deadlock or die with errors alike the below-listed ones. > > Some of the errors looks alike programming issue at first look (well, a > deadlock *is* usually a programming error) but we do not believe it is so: > the errors arise in many well-tested codes including HPL (*) and only with a > special compiler + Open MPI version (Intel 12.x compiler + open MPI 1.4.3) > and only on special number of processes (usually high ones). For example, HPL > reproducible deadlocks with 72 procs and dies with error message #2 with 384 > processes. > > All this errors seem to be somehow related to MPI communicators; and 1.4.4rc3 > and in 1.5.3 and 1.5.4 seem not to have this problem. Also 1.4.3 if using > together with Intel 11.x compielr series seem to be unproblematic. So > probably this: > > (1.4.4 release notes:) > - Fixed a segv in MPI_Comm_create when called with GROUP_EMPTY. > Thanks to Dominik Goeddeke for finding this. > > is also fix for our issues? Or maybe not, because 1.5.3 is _older_ than this > fix? > > As far as we workarounded the problem by switching our production to 1.5.3 > this issue is not a "burning" one; but I decieded still to post this because > any issue on such fundamental things may be interesting for developers. > > Best wishes, > Paul Kapinos > > > (*) http://www.netlib.org/benchmark/hpl/ > > ################################################################ > Fatal error in MPI_Comm_size: Invalid communicator, error stack: > MPI_Comm_size(111): MPI_Comm_size(comm=0x0, size=0x6f4a90) failed > MPI_Comm_size(69).: Invalid communicator > > ################################################################ > [linuxbdc05.rz.RWTH-Aachen.DE:23219] *** An error occurred in MPI_Comm_split > [linuxbdc05.rz.RWTH-Aachen.DE:23219] *** on communicator MPI COMMUNICATOR 3 > SPLIT FROM 0 > [linuxbdc05.rz.RWTH-Aachen.DE:23219] *** MPI_ERR_IN_STATUS: error code in > status > [linuxbdc05.rz.RWTH-Aachen.DE:23219] *** MPI_ERRORS_ARE_FATAL (your MPI job > will now abort) > > ################################################################ > forrtl: severe (71): integer divide by zero > Image PC Routine Line Source > libmpi.so.0 00002AAAAD9EDF52 Unknown Unknown Unknown > libmpi.so.0 00002AAAAD9EE45D Unknown Unknown Unknown > libmpi.so.0 00002AAAAD9C3375 Unknown Unknown Unknown > libmpi_f77.so.0 00002AAAAD75C37A Unknown Unknown Unknown > vasp_mpi_gamma 000000000057E010 Unknown Unknown Unknown > vasp_mpi_gamma 000000000059F636 Unknown Unknown Unknown > vasp_mpi_gamma 0000000000416C5A Unknown Unknown Unknown > vasp_mpi_gamma 0000000000A62BEE Unknown Unknown Unknown > libc.so.6 0000003EEB61EC5D Unknown Unknown Unknown > vasp_mpi_gamma 0000000000416A29 Unknown Unknown Unknown > > > -- > Dipl.-Inform. Paul Kapinos - High Performance Computing, > RWTH Aachen University, Center for Computing and Communication > Seffenter Weg 23, D 52074 Aachen (Germany) > Tel: +49 241/80-24915 > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/