On Wed, 23 May 2012, Lisandro Dalcin wrote:

On 23 May 2012 19:04, Jeff Squyres <jsquy...@cisco.com> wrote:
Thanks for all the info!

But still, can we get a copy of the test in C?  That would make it 
significantly easier for us to tell if there is a problem with Open MPI -- 
mainly because we don't know anything about the internals of mpi4py.

FYI, this test ran fine with previous (but recent, let say 1.5.4)
OpenMPI versions, but fails with 1.6. The test also runs fine with
MPICH2.

I compiled the C example Lisandro provided using openmpi/1.4.3 compiled against the Intel 11.0 compilers, and it ran the first time. I then recompiled using gcc 4.6.2 and openmpi 1.4.4, and it provided the following errors:

$ mpirun -np 5 a.out
[hostname:6601] *** An error occurred in MPI_Allgatherv
[hostname:6601] *** on communicator
[hostname:6601] *** MPI_ERR_COUNT: invalid count argument
[hostname:6601] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
--------------------------------------------------------------------------
mpirun has exited due to process rank 4 with PID 6601 on
node hostname exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------

I then recompiled using the Intel compilers, and it runs without error 10 out of 10 times.

I then recompiled using the gcc 4.6.2/openmpi 1.4.4 combination, and it fails consistently.

On the second and subsequent tries, it provides the following additional errors:

$ mpirun -np 5 a.out
[hostname:7168] *** An error occurred in MPI_Allgatherv
[hostname:7168] *** on communicator
[hostname:7168] *** MPI_ERR_COUNT: invalid count argument
[hostname:7168] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
--------------------------------------------------------------------------
mpirun has exited due to process rank 2 with PID 7168 on
node hostname exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
[hostname:07163] 1 more process has sent help message help-mpi-errors.txt / 
mpi_errors_are_fatal
[hostname:07163] Set MCA parameter "orte_base_help_aggregate" to 0 to see all 
help / error messages

Not sure if that information is helpful or not.

I am still completely puzzled why the number 5 is magic....

                        -- bennet

Reply via email to