Check ompi_info - was it built with openib support?

Then check that the mca_btl_openib library is present in the prefix/lib/openmpi 
directory

Sounds like it isn't finding the openib plugin


On Aug 12, 2013, at 11:57 AM, Gus Correa <g...@ldeo.columbia.edu> wrote:

> Dear Open MPI pros
> 
> On one of the clusters here, that has Infinband,
> I am getting this type of errors from
> OpenMPI 1.4.3 (OK, I know it is old ...):
> 
> *********************************************************
> Tcl_InitNotifier: unable to start notifier thread
> Abort: Command not found.
> Tcl_InitNotifier: unable to start notifier thread
> Abort: Command not found.
> --------------------------------------------------------------------------
> At least one pair of MPI processes are unable to reach each other for
> MPI communications.  This means that no Open MPI device has indicated
> that it can be used to communicate between these processes.  This is
> an error; Open MPI requires that all MPI processes be able to reach
> each other.  This error can sometimes be the result of forgetting to
> specify the "self" BTL.
> 
>  Process 1 ([[907,1],68]) is on host: node11.cluster
>  Process 2 ([[907,1],0]) is on host: node15
>  BTLs attempted: self sm
> 
> Your MPI job is now going to abort; sorry.
> --------------------------------------------------------------------------
> *********************************************************
> 
> Awkward, because I have "btl = ^tcp" in openmpi-mca-params.conf.
> The same error also happens if I force --mca btl openib,sm,self
> in mpiexec.
> 
> ** Why is it attempting only the self and sm BTLs, but not openib? **
> 
> I don't understand either the initial errors
> "Tcl_InitNotifier: unable to start notifier thread".
> Are they coming from Torque perhaps?
> 
> As I said, the cluster has Infiniband,
> which is what we've been using forever, until
> these errors started today.
> 
> When I divert the traffic to tcp
> (--mca btl tcp,sm,self), the jobs run normally.
> 
> I am using the examples/connectivity_c.c program
> to troubleshoot this problem.
> 
> ***
> I checked a few things on the IB side.
> 
> The output of ibstat on all nodes seems OK (links up, etc),
> and so are the output of ibhosts and ibchecknet.
> 
> Only two connected ports had errors, as reported by ibcheckerrors,
> and I cleared them with iblclearerrors.
> 
> The IB subnet manager is running on the head node.
> I restarted the daemon, but nothing changed, the job continue to
> fail with the same errors.
> 
> **
> 
> Any hints of what is going on, how to diagnose it, and how to fix it?
> Any gentler way than reboot everything and power cycling
> the IB switch? (And would this brute force method work, at least?)
> 
> Thank you,
> Gus Correa
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to