Check ompi_info - was it built with openib support? Then check that the mca_btl_openib library is present in the prefix/lib/openmpi directory
Sounds like it isn't finding the openib plugin On Aug 12, 2013, at 11:57 AM, Gus Correa <g...@ldeo.columbia.edu> wrote: > Dear Open MPI pros > > On one of the clusters here, that has Infinband, > I am getting this type of errors from > OpenMPI 1.4.3 (OK, I know it is old ...): > > ********************************************************* > Tcl_InitNotifier: unable to start notifier thread > Abort: Command not found. > Tcl_InitNotifier: unable to start notifier thread > Abort: Command not found. > -------------------------------------------------------------------------- > At least one pair of MPI processes are unable to reach each other for > MPI communications. This means that no Open MPI device has indicated > that it can be used to communicate between these processes. This is > an error; Open MPI requires that all MPI processes be able to reach > each other. This error can sometimes be the result of forgetting to > specify the "self" BTL. > > Process 1 ([[907,1],68]) is on host: node11.cluster > Process 2 ([[907,1],0]) is on host: node15 > BTLs attempted: self sm > > Your MPI job is now going to abort; sorry. > -------------------------------------------------------------------------- > ********************************************************* > > Awkward, because I have "btl = ^tcp" in openmpi-mca-params.conf. > The same error also happens if I force --mca btl openib,sm,self > in mpiexec. > > ** Why is it attempting only the self and sm BTLs, but not openib? ** > > I don't understand either the initial errors > "Tcl_InitNotifier: unable to start notifier thread". > Are they coming from Torque perhaps? > > As I said, the cluster has Infiniband, > which is what we've been using forever, until > these errors started today. > > When I divert the traffic to tcp > (--mca btl tcp,sm,self), the jobs run normally. > > I am using the examples/connectivity_c.c program > to troubleshoot this problem. > > *** > I checked a few things on the IB side. > > The output of ibstat on all nodes seems OK (links up, etc), > and so are the output of ibhosts and ibchecknet. > > Only two connected ports had errors, as reported by ibcheckerrors, > and I cleared them with iblclearerrors. > > The IB subnet manager is running on the head node. > I restarted the daemon, but nothing changed, the job continue to > fail with the same errors. > > ** > > Any hints of what is going on, how to diagnose it, and how to fix it? > Any gentler way than reboot everything and power cycling > the IB switch? (And would this brute force method work, at least?) > > Thank you, > Gus Correa > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users