fs1 is selecting the "cm" PML whereas other nodes are selecting the
"ob1" PML component. You can force ob1 to be used via "--mca pml ob1"

What kind of hardware/NIC does fs1 have?

--Nysal

On Wed, 2009-03-18 at 17:17 -0400, Gary Draving wrote:
> Hi all,
> 
> anyone ever seen an error like this? Seems like I have some setting 
> wrong in opemmpi.  I thought I had it setup like the other machines but 
> seems as though I have missed something. I only get the error when 
> adding machine "fs1" to the hostfile list.  The other 40+ machines seem 
> fine.
> 
> [fs1.calvin.edu:01750] [[2469,1],6] selected pml cm, but peer 
> [[2469,1],0] on compute-0-0 selected pml ob1
> 
> When I use ompi_info the output looks like my other machines:
> 
> [root@fs1 openmpi-1.3]# ompi_info | grep btl
>                  MCA btl: ofud (MCA v2.0, API v2.0, Component v1.3)
>                  MCA btl: openib (MCA v2.0, API v2.0, Component v1.3)
>                  MCA btl: self (MCA v2.0, API v2.0, Component v1.3)
>                  MCA btl: sm (MCA v2.0, API v2.0, Component v1.3)
> 
> The whole error is below, any help would be greatly appreciated.
> 
> Gary
> 
> [admin@dahl 00.greetings]$ /usr/local/bin/mpirun --mca btl ^tcp 
> --hostfile machines -np 7 greetings
> [fs1.calvin.edu:01959] [[2212,1],6] selected pml cm, but peer 
> [[2212,1],0] on compute-0-0 selected pml ob1
> --------------------------------------------------------------------------
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
> 
>   PML add procs failed
>   --> Returned "Unreachable" (-12) instead of "Success" (0)
> --------------------------------------------------------------------------
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [fs1.calvin.edu:1959] Abort before MPI_INIT completed successfully; not 
> able to guarantee that all other processes were killed!
> --------------------------------------------------------------------------
> At least one pair of MPI processes are unable to reach each other for
> MPI communications.  This means that no Open MPI device has indicated
> that it can be used to communicate between these processes.  This is
> an error; Open MPI requires that all MPI processes be able to reach
> each other.  This error can sometimes be the result of forgetting to
> specify the "self" BTL.
> 
>   Process 1 ([[2212,1],3]) is on host: dahl.calvin.edu
>   Process 2 ([[2212,1],0]) is on host: compute-0-0
>   BTLs attempted: openib self sm
> 
> Your MPI job is now going to abort; sorry.
> --------------------------------------------------------------------------
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [dahl.calvin.edu:16884] Abort before MPI_INIT completed successfully; 
> not able to guarantee that all other processes were killed!
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [compute-0-0.local:1591] Abort before MPI_INIT completed successfully; 
> not able to guarantee that all other processes were killed!
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [fs2.calvin.edu:8826] Abort before MPI_INIT completed successfully; not 
> able to guarantee that all other processes were killed!
> --------------------------------------------------------------------------
> mpirun has exited due to process rank 3 with PID 16884 on
> node dahl.calvin.edu exiting without calling "finalize". This may
> have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
> --------------------------------------------------------------------------
> [dahl.calvin.edu:16879] 3 more processes have sent help message 
> help-mpi-runtime / mpi_init:startup:internal-failure
> [dahl.calvin.edu:16879] Set MCA parameter "orte_base_help_aggregate" to 
> 0 to see all help / error messages
> [dahl.calvin.edu:16879] 2 more processes have sent help message 
> help-mca-bml-r2.txt / unreachable proc
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to