It looks like you have PSM enabled cards on your system as well as Ethernet, and we are picking that up. Try adding "-mca pml ob1" to your cmd line and see if that helps
On Tue, May 19, 2015 at 5:04 AM, Nilo Menezes <n...@nilo.pro.br> wrote: > Hello, > > I'm trying to run openmpi with multithread support enabled. > > I'm getting this error messages before init finishes: > [node011:61627] PSM returned unhandled/unknown connect error: Operation > timed out > [node011:61627] PSM EP connect error (unknown connect error): > > *** An error occurred in MPI_Init_thread > *** on a NULL communicator > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, > *** and potentially your MPI job) > [node005:51948] Local abort before MPI_INIT completed successfully; not > able to aggregate error messages, and not able to guarantee that all other > processes were killed! > *** An error occurred in MPI_Init_thread > *** on a NULL communicator > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, > *** and potentially your MPI job) > [node039:57062] Local abort before MPI_INIT completed successfully; not > able to aggregate error messages, and not able to guarantee that all other > processes were killed! > *** An error occurred in MPI_Init_thread > *** on a NULL communicator > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, > *** and potentially your MPI job) > [node012:64036] Local abort before MPI_INIT completed successfully; not > able to aggregate error messages, and not able to guarantee that all other > processes were killed! > *** An error occurred in MPI_Init_thread > *** on a NULL communicator > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, > *** and potentially your MPI job) > [node008:14098] Local abort before MPI_INIT completed successfully; not > able to aggregate error messages, and not able to guarantee that all other > processes were killed! > *** An error occurred in MPI_Init_thread > *** on a NULL communicator > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, > *** and potentially your MPI job) > [node011:61627] Local abort before MPI_INIT completed successfully; not > able to aggregate error messages, and not able to guarantee that all other > processes were killed! > [node005:51887] 1 more process has sent help message help-mpi-runtime / > mpi_init:startup:internal-failure > [node005:51887] Set MCA parameter "orte_base_help_aggregate" to 0 to see > all help / error messages > > The library was configured with: > ./configure \ > --prefix=/home/opt \ > --enable-static \ > --enable-mpi-thread-multiple \ > --with-threads > > gcc 4.8.2 > > On Linux: > Linux node001 2.6.32-279.14.1.el6.x86_64 #1 SMP Mon Oct 15 13:44:51 EDT > 2012 x86_64 x86_64 x86_64 GNU/Linux > > The job was started with: > sbatch --nodes=6 --ntasks=30 --mem=4096 -o result/TOn6t30.txt -e > result/TEn6t30.txt job.sh > > > job.sh contains: > mpirun --mca btl tcp,self \ > --mca btl_tcp_if_include 172.24.38.0/24 \ > --mca oob_tcp_if_include eth0 \ > /home/umons/info/menezes/drsim/build/NameResolution/gameoflife_mpi2 > --columns=1000 --rows=1000 > > I call MPI_INIT with: > int provided; > MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided); > > The program is a simple game of life simulation. It runs fine in a single > node (with one or many tasks). But fails at random nodes when distributed. > > Any hint may help. > > Best Regards, > > Nilo Menezes > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/05/26879.php >