Thank you,
That seems to solve the problem.
Best Regards,
Nilo Menezes
On 5/19/2015 3:34 PM, Ralph Castain wrote:
It looks like you have PSM enabled cards on your system as well as
Ethernet, and we are picking that up. Try adding "-mca pml ob1" to
your cmd line and see if that helps
On Tue, May 19, 2015 at 5:04 AM, Nilo Menezes <n...@nilo.pro.br
<mailto:n...@nilo.pro.br>> wrote:
Hello,
I'm trying to run openmpi with multithread support enabled.
I'm getting this error messages before init finishes:
[node011:61627] PSM returned unhandled/unknown connect error:
Operation timed out
[node011:61627] PSM EP connect error (unknown connect error):
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
abort,
*** and potentially your MPI job)
[node005:51948] Local abort before MPI_INIT completed
successfully; not able to aggregate error messages, and not able
to guarantee that all other processes were killed!
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
abort,
*** and potentially your MPI job)
[node039:57062] Local abort before MPI_INIT completed
successfully; not able to aggregate error messages, and not able
to guarantee that all other processes were killed!
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
abort,
*** and potentially your MPI job)
[node012:64036] Local abort before MPI_INIT completed
successfully; not able to aggregate error messages, and not able
to guarantee that all other processes were killed!
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
abort,
*** and potentially your MPI job)
[node008:14098] Local abort before MPI_INIT completed
successfully; not able to aggregate error messages, and not able
to guarantee that all other processes were killed!
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
abort,
*** and potentially your MPI job)
[node011:61627] Local abort before MPI_INIT completed
successfully; not able to aggregate error messages, and not able
to guarantee that all other processes were killed!
[node005:51887] 1 more process has sent help message
help-mpi-runtime / mpi_init:startup:internal-failure
[node005:51887] Set MCA parameter "orte_base_help_aggregate" to 0
to see all help / error messages
The library was configured with:
./configure \
--prefix=/home/opt \
--enable-static \
--enable-mpi-thread-multiple \
--with-threads
gcc 4.8.2
On Linux:
Linux node001 2.6.32-279.14.1.el6.x86_64 #1 SMP Mon Oct 15
13:44:51 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux
The job was started with:
sbatch --nodes=6 --ntasks=30 --mem=4096 -o result/TOn6t30.txt -e
result/TEn6t30.txt job.sh
job.sh contains:
mpirun --mca btl tcp,self \
--mca btl_tcp_if_include 172.24.38.0/24
<http://172.24.38.0/24> \
--mca oob_tcp_if_include eth0 \
/home/umons/info/menezes/drsim/build/NameResolution/gameoflife_mpi2
--columns=1000
--rows=1000
I call MPI_INIT with:
int provided;
MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);
The program is a simple game of life simulation. It runs fine in a
single node (with one or many tasks). But fails at random nodes
when distributed.
Any hint may help.
Best Regards,
Nilo Menezes
_______________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/05/26879.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/05/26880.php