Hello,

I'm trying to run openmpi with multithread support enabled.

I'm getting this error messages before init finishes:
[node011:61627] PSM returned unhandled/unknown connect error: Operation timed out
[node011:61627] PSM EP connect error (unknown connect error):

*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[node005:51948] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[node039:57062] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[node012:64036] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[node008:14098] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[node011:61627] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed! [node005:51887] 1 more process has sent help message help-mpi-runtime / mpi_init:startup:internal-failure [node005:51887] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

The library was configured with:
./configure \
--prefix=/home/opt \
--enable-static \
--enable-mpi-thread-multiple \
--with-threads

gcc 4.8.2

On Linux:
Linux node001 2.6.32-279.14.1.el6.x86_64 #1 SMP Mon Oct 15 13:44:51 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux

The job was started with:
sbatch --nodes=6 --ntasks=30 --mem=4096 -o result/TOn6t30.txt -e result/TEn6t30.txt job.sh


job.sh contains:
mpirun --mca btl tcp,self \
       --mca btl_tcp_if_include 172.24.38.0/24 \
       --mca oob_tcp_if_include eth0 \
/home/umons/info/menezes/drsim/build/NameResolution/gameoflife_mpi2 --columns=1000 --rows=1000

I call MPI_INIT with:
    int provided;
    MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);

The program is a simple game of life simulation. It runs fine in a single node (with one or many tasks). But fails at random nodes when distributed.

Any hint may help.

Best Regards,

Nilo Menezes

Reply via email to