It looks like you have PSM enabled cards on your system as well as
Ethernet, and we are picking that up. Try adding "-mca pml ob1" to your cmd
line and see if that helps


On Tue, May 19, 2015 at 5:04 AM, Nilo Menezes <n...@nilo.pro.br> wrote:

> Hello,
>
> I'm trying to run openmpi with multithread support enabled.
>
> I'm getting this error messages before init finishes:
> [node011:61627] PSM returned unhandled/unknown connect error: Operation
> timed out
> [node011:61627] PSM EP connect error (unknown connect error):
>
> *** An error occurred in MPI_Init_thread
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> ***    and potentially your MPI job)
> [node005:51948] Local abort before MPI_INIT completed successfully; not
> able to aggregate error messages, and not able to guarantee that all other
> processes were killed!
> *** An error occurred in MPI_Init_thread
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> ***    and potentially your MPI job)
> [node039:57062] Local abort before MPI_INIT completed successfully; not
> able to aggregate error messages, and not able to guarantee that all other
> processes were killed!
> *** An error occurred in MPI_Init_thread
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> ***    and potentially your MPI job)
> [node012:64036] Local abort before MPI_INIT completed successfully; not
> able to aggregate error messages, and not able to guarantee that all other
> processes were killed!
> *** An error occurred in MPI_Init_thread
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> ***    and potentially your MPI job)
> [node008:14098] Local abort before MPI_INIT completed successfully; not
> able to aggregate error messages, and not able to guarantee that all other
> processes were killed!
> *** An error occurred in MPI_Init_thread
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> ***    and potentially your MPI job)
> [node011:61627] Local abort before MPI_INIT completed successfully; not
> able to aggregate error messages, and not able to guarantee that all other
> processes were killed!
> [node005:51887] 1 more process has sent help message help-mpi-runtime /
> mpi_init:startup:internal-failure
> [node005:51887] Set MCA parameter "orte_base_help_aggregate" to 0 to see
> all help / error messages
>
> The library was configured with:
> ./configure \
> --prefix=/home/opt \
> --enable-static \
> --enable-mpi-thread-multiple \
> --with-threads
>
> gcc 4.8.2
>
> On Linux:
> Linux node001 2.6.32-279.14.1.el6.x86_64 #1 SMP Mon Oct 15 13:44:51 EDT
> 2012 x86_64 x86_64 x86_64 GNU/Linux
>
> The job was started with:
> sbatch --nodes=6 --ntasks=30 --mem=4096  -o result/TOn6t30.txt -e
> result/TEn6t30.txt job.sh
>
>
> job.sh contains:
> mpirun --mca btl tcp,self \
>        --mca btl_tcp_if_include 172.24.38.0/24 \
>        --mca oob_tcp_if_include eth0 \
> /home/umons/info/menezes/drsim/build/NameResolution/gameoflife_mpi2
> --columns=1000 --rows=1000
>
> I call MPI_INIT with:
>     int provided;
>     MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);
>
> The program is a simple game of life simulation. It runs fine in a single
> node (with one or many tasks). But fails at random nodes when distributed.
>
> Any hint may help.
>
> Best Regards,
>
> Nilo Menezes
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/05/26879.php
>

Reply via email to