Hi all,
anyone ever seen an error like this? Seems like I have some setting
wrong in opemmpi. I thought I had it setup like the other machines but
seems as though I have missed something. I only get the error when
adding machine "fs1" to the hostfile list. The other 40+ machines seem
fine.
[fs1.calvin.edu:01750] [[2469,1],6] selected pml cm, but peer
[[2469,1],0] on compute-0-0 selected pml ob1
When I use ompi_info the output looks like my other machines:
[root@fs1 openmpi-1.3]# ompi_info | grep btl
MCA btl: ofud (MCA v2.0, API v2.0, Component v1.3)
MCA btl: openib (MCA v2.0, API v2.0, Component v1.3)
MCA btl: self (MCA v2.0, API v2.0, Component v1.3)
MCA btl: sm (MCA v2.0, API v2.0, Component v1.3)
The whole error is below, any help would be greatly appreciated.
Gary
[admin@dahl 00.greetings]$ /usr/local/bin/mpirun --mca btl ^tcp
--hostfile machines -np 7 greetings
[fs1.calvin.edu:01959] [[2212,1],6] selected pml cm, but peer
[[2212,1],0] on compute-0-0 selected pml ob1
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
PML add procs failed
--> Returned "Unreachable" (-12) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[fs1.calvin.edu:1959] Abort before MPI_INIT completed successfully; not
able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications. This means that no Open MPI device has indicated
that it can be used to communicate between these processes. This is
an error; Open MPI requires that all MPI processes be able to reach
each other. This error can sometimes be the result of forgetting to
specify the "self" BTL.
Process 1 ([[2212,1],3]) is on host: dahl.calvin.edu
Process 2 ([[2212,1],0]) is on host: compute-0-0
BTLs attempted: openib self sm
Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[dahl.calvin.edu:16884] Abort before MPI_INIT completed successfully;
not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[compute-0-0.local:1591] Abort before MPI_INIT completed successfully;
not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[fs2.calvin.edu:8826] Abort before MPI_INIT completed successfully; not
able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
mpirun has exited due to process rank 3 with PID 16884 on
node dahl.calvin.edu exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
[dahl.calvin.edu:16879] 3 more processes have sent help message
help-mpi-runtime / mpi_init:startup:internal-failure
[dahl.calvin.edu:16879] Set MCA parameter "orte_base_help_aggregate" to
0 to see all help / error messages
[dahl.calvin.edu:16879] 2 more processes have sent help message
help-mca-bml-r2.txt / unreachable proc
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users