Hello,
I'm trying to run openmpi with multithread support enabled.
I'm getting this error messages before init finishes:
[node011:61627] PSM returned unhandled/unknown connect error: Operation
timed out
[node011:61627] PSM EP connect error (unknown connect error):
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[node005:51948] Local abort before MPI_INIT completed successfully; not
able to aggregate error messages, and not able to guarantee that all
other processes were killed!
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[node039:57062] Local abort before MPI_INIT completed successfully; not
able to aggregate error messages, and not able to guarantee that all
other processes were killed!
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[node012:64036] Local abort before MPI_INIT completed successfully; not
able to aggregate error messages, and not able to guarantee that all
other processes were killed!
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[node008:14098] Local abort before MPI_INIT completed successfully; not
able to aggregate error messages, and not able to guarantee that all
other processes were killed!
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[node011:61627] Local abort before MPI_INIT completed successfully; not
able to aggregate error messages, and not able to guarantee that all
other processes were killed!
[node005:51887] 1 more process has sent help message help-mpi-runtime /
mpi_init:startup:internal-failure
[node005:51887] Set MCA parameter "orte_base_help_aggregate" to 0 to see
all help / error messages
The library was configured with:
./configure \
--prefix=/home/opt \
--enable-static \
--enable-mpi-thread-multiple \
--with-threads
gcc 4.8.2
On Linux:
Linux node001 2.6.32-279.14.1.el6.x86_64 #1 SMP Mon Oct 15 13:44:51 EDT
2012 x86_64 x86_64 x86_64 GNU/Linux
The job was started with:
sbatch --nodes=6 --ntasks=30 --mem=4096 -o result/TOn6t30.txt -e
result/TEn6t30.txt job.sh
job.sh contains:
mpirun --mca btl tcp,self \
--mca btl_tcp_if_include 172.24.38.0/24 \
--mca oob_tcp_if_include eth0 \
/home/umons/info/menezes/drsim/build/NameResolution/gameoflife_mpi2
--columns=1000 --rows=1000
I call MPI_INIT with:
int provided;
MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);
The program is a simple game of life simulation. It runs fine in a
single node (with one or many tasks). But fails at random nodes when
distributed.
Any hint may help.
Best Regards,
Nilo Menezes