fs1 is selecting the "cm" PML whereas other nodes are selecting the "ob1" PML component. You can force ob1 to be used via "--mca pml ob1"
What kind of hardware/NIC does fs1 have? --Nysal On Wed, 2009-03-18 at 17:17 -0400, Gary Draving wrote: > Hi all, > > anyone ever seen an error like this? Seems like I have some setting > wrong in opemmpi. I thought I had it setup like the other machines but > seems as though I have missed something. I only get the error when > adding machine "fs1" to the hostfile list. The other 40+ machines seem > fine. > > [fs1.calvin.edu:01750] [[2469,1],6] selected pml cm, but peer > [[2469,1],0] on compute-0-0 selected pml ob1 > > When I use ompi_info the output looks like my other machines: > > [root@fs1 openmpi-1.3]# ompi_info | grep btl > MCA btl: ofud (MCA v2.0, API v2.0, Component v1.3) > MCA btl: openib (MCA v2.0, API v2.0, Component v1.3) > MCA btl: self (MCA v2.0, API v2.0, Component v1.3) > MCA btl: sm (MCA v2.0, API v2.0, Component v1.3) > > The whole error is below, any help would be greatly appreciated. > > Gary > > [admin@dahl 00.greetings]$ /usr/local/bin/mpirun --mca btl ^tcp > --hostfile machines -np 7 greetings > [fs1.calvin.edu:01959] [[2212,1],6] selected pml cm, but peer > [[2212,1],0] on compute-0-0 selected pml ob1 > -------------------------------------------------------------------------- > It looks like MPI_INIT failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during MPI_INIT; some of which are due to configuration or environment > problems. This failure appears to be an internal failure; here's some > additional information (which may only be relevant to an Open MPI > developer): > > PML add procs failed > --> Returned "Unreachable" (-12) instead of "Success" (0) > -------------------------------------------------------------------------- > *** An error occurred in MPI_Init > *** before MPI was initialized > *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) > [fs1.calvin.edu:1959] Abort before MPI_INIT completed successfully; not > able to guarantee that all other processes were killed! > -------------------------------------------------------------------------- > At least one pair of MPI processes are unable to reach each other for > MPI communications. This means that no Open MPI device has indicated > that it can be used to communicate between these processes. This is > an error; Open MPI requires that all MPI processes be able to reach > each other. This error can sometimes be the result of forgetting to > specify the "self" BTL. > > Process 1 ([[2212,1],3]) is on host: dahl.calvin.edu > Process 2 ([[2212,1],0]) is on host: compute-0-0 > BTLs attempted: openib self sm > > Your MPI job is now going to abort; sorry. > -------------------------------------------------------------------------- > *** An error occurred in MPI_Init > *** before MPI was initialized > *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) > [dahl.calvin.edu:16884] Abort before MPI_INIT completed successfully; > not able to guarantee that all other processes were killed! > *** An error occurred in MPI_Init > *** before MPI was initialized > *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) > [compute-0-0.local:1591] Abort before MPI_INIT completed successfully; > not able to guarantee that all other processes were killed! > *** An error occurred in MPI_Init > *** before MPI was initialized > *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) > [fs2.calvin.edu:8826] Abort before MPI_INIT completed successfully; not > able to guarantee that all other processes were killed! > -------------------------------------------------------------------------- > mpirun has exited due to process rank 3 with PID 16884 on > node dahl.calvin.edu exiting without calling "finalize". This may > have caused other processes in the application to be > terminated by signals sent by mpirun (as reported here). > -------------------------------------------------------------------------- > [dahl.calvin.edu:16879] 3 more processes have sent help message > help-mpi-runtime / mpi_init:startup:internal-failure > [dahl.calvin.edu:16879] Set MCA parameter "orte_base_help_aggregate" to > 0 to see all help / error messages > [dahl.calvin.edu:16879] 2 more processes have sent help message > help-mca-bml-r2.txt / unreachable proc > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users