No, this has nothing to do with the registration limit. For some reason, the 
system is refusing to create a thread - i.e., it is pthread_create that is 
failing. I have no idea what would be causing that to happen.

Try setting it to unlimited and see if it allows the thread to start, I guess.


On Aug 12, 2013, at 2:20 PM, Gus Correa <g...@ldeo.columbia.edu> wrote:

> Hi Ralph, all
> 
> I include more information below,
> after turning on btl_openib_verbose 30.
> As you can see, OMPI tries, and fails, to load openib.
> 
> Last week I reduced the memlock limit from unlimited
> to ~12GB, as part of a general attempt to reign on memory
> use/abuse by jobs sharing a node.
> No parallel job ran until today, when the problem showed up.
> Could the memlock limit be the root of the problem?
> 
> The OMPI FAQ says the memlock limit
> should be a "large number (or better yet, unlimited)":
> 
> http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
> 
> The next two FAQ kind of indicate that
> it should be set to "unlimited", but don't say it clearly:
> 
> http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages-user
> http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages-more
> 
> QUESTION:
> Is "unlimited" a must, or is there any (magic) "large number"
> that would be OK for openib?
> 
> I thought a 12GB memlock limit would be OK, but maybe it is not.
> The nodes have 64GB RAM.
> 
> Thank you,
> Gus Correa
> 
> *************************************************\
> [node15.cluster][[8097,1],0][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread]
>  Failed to create async event thread
> [node15.cluster][[8097,1],1][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread]
>  Failed to create async event thread
> [node15.cluster][[8097,1],4][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread]
>  Failed to create async event thread
> [node15.cluster][[8097,1],3][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread]
>  Failed to create async event thread
> [node15.cluster][[8097,1],2][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread]
>  Failed to create async event thread
> --------------------------------------------------------------------------
> WARNING: There was an error initializing an OpenFabrics device.
> 
>  Local host:   node15.cluster
>  Local device: mlx4_0
> --------------------------------------------------------------------------
> [node15.cluster][[8097,1],10][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread]
>  Failed to create async event thread
> [node15.cluster][[8097,1],12][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread]
>  Failed to create async event thread
> [node15.cluster][[8097,1],13][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread]
>  Failed to create async event thread
> [node14.cluster][[8097,1],17][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread]
>  Failed to create async event thread
> [node14.cluster][[8097,1],23][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread]
>  Failed to create async event thread
> [node14.cluster][[8097,1],24][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread]
>  Failed to create async event thread
> [node14.cluster][[8097,1],26][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread]
>  Failed to create async event thread
> [node14.cluster][[8097,1],28][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread]
>  Failed to create async event thread
> [node14.cluster][[8097,1],31][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread]
>  Failed to create async event thread
> --------------------------------------------------------------------------
> At least one pair of MPI processes are unable to reach each other for
> MPI communications.  This means that no Open MPI device has indicated
> that it can be used to communicate between these processes.  This is
> an error; Open MPI requires that all MPI processes be able to reach
> each other.  This error can sometimes be the result of forgetting to
> specify the "self" BTL.
> 
>  Process 1 ([[8097,1],4]) is on host: node15.cluster
>  Process 2 ([[8097,1],16]) is on host: node14
>  BTLs attempted: self sm
> 
> Your MPI job is now going to abort; sorry.
> --------------------------------------------------------------------------
> 
> *************************************************
> 
> On 08/12/2013 03:32 PM, Gus Correa wrote:
>> Thank you for the prompt help, Ralph!
>> 
>> Yes, it is OMPI 1.4.3 built with openib support:
>> 
>> $ ompi_info | grep openib
>> MCA btl: openib (MCA v2.0, API v2.0, Component v1.4.3)
>> 
>> There are three libraries in prefix/lib/openmpi,
>> no mca_btl_openib library.
>> 
>> $ ls $PREFIX/lib/openmpi/
>> libompi_dbg_msgq.a libompi_dbg_msgq.la libompi_dbg_msgq.so
>> 
>> 
>> However, this may be just because it is an older OMPI version in
>> the 1.4 series.
>> Because those are exactly what I have in another cluster with IB,
>> and OMPI 1.4.3, where there isn't a problem.
>> The libraries' organization may have changed from
>> the 1.4 to the 1.6 series, right?
>> I only have mca_btl_openib libraries in the 1.6 series, but it
>> will be a hardship to migrate this program to OMPI 1.6.
>> 
>> (OK, I have newer OMPI, but I need old also for some
>> programs).
>> 
>> Why the heck it is not detecting the Infinband hardware?
>> [It used to detect it! :( ]
>> 
>> Thank you,
>> Gus Correa
>> 
>> 
>> On 08/12/2013 03:01 PM, Ralph Castain wrote:
>>> Check ompi_info - was it built with openib support?
>>> 
>>> Then check that the mca_btl_openib library is present in the
>>> prefix/lib/openmpi directory
>>> 
>>> Sounds like it isn't finding the openib plugin
>>> 
>>> 
>>> On Aug 12, 2013, at 11:57 AM, Gus Correa<g...@ldeo.columbia.edu> wrote:
>>> 
>>>> Dear Open MPI pros
>>>> 
>>>> On one of the clusters here, that has Infinband,
>>>> I am getting this type of errors from
>>>> OpenMPI 1.4.3 (OK, I know it is old ...):
>>>> 
>>>> *********************************************************
>>>> Tcl_InitNotifier: unable to start notifier thread
>>>> Abort: Command not found.
>>>> Tcl_InitNotifier: unable to start notifier thread
>>>> Abort: Command not found.
>>>> --------------------------------------------------------------------------
>>>> 
>>>> At least one pair of MPI processes are unable to reach each other for
>>>> MPI communications. This means that no Open MPI device has indicated
>>>> that it can be used to communicate between these processes. This is
>>>> an error; Open MPI requires that all MPI processes be able to reach
>>>> each other. This error can sometimes be the result of forgetting to
>>>> specify the "self" BTL.
>>>> 
>>>> Process 1 ([[907,1],68]) is on host: node11.cluster
>>>> Process 2 ([[907,1],0]) is on host: node15
>>>> BTLs attempted: self sm
>>>> 
>>>> Your MPI job is now going to abort; sorry.
>>>> --------------------------------------------------------------------------
>>>> 
>>>> *********************************************************
>>>> 
>>>> Awkward, because I have "btl = ^tcp" in openmpi-mca-params.conf.
>>>> The same error also happens if I force --mca btl openib,sm,self
>>>> in mpiexec.
>>>> 
>>>> ** Why is it attempting only the self and sm BTLs, but not openib? **
>>>> 
>>>> I don't understand either the initial errors
>>>> "Tcl_InitNotifier: unable to start notifier thread".
>>>> Are they coming from Torque perhaps?
>>>> 
>>>> As I said, the cluster has Infiniband,
>>>> which is what we've been using forever, until
>>>> these errors started today.
>>>> 
>>>> When I divert the traffic to tcp
>>>> (--mca btl tcp,sm,self), the jobs run normally.
>>>> 
>>>> I am using the examples/connectivity_c.c program
>>>> to troubleshoot this problem.
>>>> 
>>>> ***
>>>> I checked a few things on the IB side.
>>>> 
>>>> The output of ibstat on all nodes seems OK (links up, etc),
>>>> and so are the output of ibhosts and ibchecknet.
>>>> 
>>>> Only two connected ports had errors, as reported by ibcheckerrors,
>>>> and I cleared them with iblclearerrors.
>>>> 
>>>> The IB subnet manager is running on the head node.
>>>> I restarted the daemon, but nothing changed, the job continue to
>>>> fail with the same errors.
>>>> 
>>>> **
>>>> 
>>>> Any hints of what is going on, how to diagnose it, and how to fix it?
>>>> Any gentler way than reboot everything and power cycling
>>>> the IB switch? (And would this brute force method work, at least?)
>>>> 
>>>> Thank you,
>>>> Gus Correa
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to