No, this has nothing to do with the registration limit. For some reason, the system is refusing to create a thread - i.e., it is pthread_create that is failing. I have no idea what would be causing that to happen.
Try setting it to unlimited and see if it allows the thread to start, I guess. On Aug 12, 2013, at 2:20 PM, Gus Correa <g...@ldeo.columbia.edu> wrote: > Hi Ralph, all > > I include more information below, > after turning on btl_openib_verbose 30. > As you can see, OMPI tries, and fails, to load openib. > > Last week I reduced the memlock limit from unlimited > to ~12GB, as part of a general attempt to reign on memory > use/abuse by jobs sharing a node. > No parallel job ran until today, when the problem showed up. > Could the memlock limit be the root of the problem? > > The OMPI FAQ says the memlock limit > should be a "large number (or better yet, unlimited)": > > http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages > > The next two FAQ kind of indicate that > it should be set to "unlimited", but don't say it clearly: > > http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages-user > http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages-more > > QUESTION: > Is "unlimited" a must, or is there any (magic) "large number" > that would be OK for openib? > > I thought a 12GB memlock limit would be OK, but maybe it is not. > The nodes have 64GB RAM. > > Thank you, > Gus Correa > > *************************************************\ > [node15.cluster][[8097,1],0][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread] > Failed to create async event thread > [node15.cluster][[8097,1],1][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread] > Failed to create async event thread > [node15.cluster][[8097,1],4][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread] > Failed to create async event thread > [node15.cluster][[8097,1],3][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread] > Failed to create async event thread > [node15.cluster][[8097,1],2][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread] > Failed to create async event thread > -------------------------------------------------------------------------- > WARNING: There was an error initializing an OpenFabrics device. > > Local host: node15.cluster > Local device: mlx4_0 > -------------------------------------------------------------------------- > [node15.cluster][[8097,1],10][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread] > Failed to create async event thread > [node15.cluster][[8097,1],12][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread] > Failed to create async event thread > [node15.cluster][[8097,1],13][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread] > Failed to create async event thread > [node14.cluster][[8097,1],17][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread] > Failed to create async event thread > [node14.cluster][[8097,1],23][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread] > Failed to create async event thread > [node14.cluster][[8097,1],24][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread] > Failed to create async event thread > [node14.cluster][[8097,1],26][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread] > Failed to create async event thread > [node14.cluster][[8097,1],28][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread] > Failed to create async event thread > [node14.cluster][[8097,1],31][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread] > Failed to create async event thread > -------------------------------------------------------------------------- > At least one pair of MPI processes are unable to reach each other for > MPI communications. This means that no Open MPI device has indicated > that it can be used to communicate between these processes. This is > an error; Open MPI requires that all MPI processes be able to reach > each other. This error can sometimes be the result of forgetting to > specify the "self" BTL. > > Process 1 ([[8097,1],4]) is on host: node15.cluster > Process 2 ([[8097,1],16]) is on host: node14 > BTLs attempted: self sm > > Your MPI job is now going to abort; sorry. > -------------------------------------------------------------------------- > > ************************************************* > > On 08/12/2013 03:32 PM, Gus Correa wrote: >> Thank you for the prompt help, Ralph! >> >> Yes, it is OMPI 1.4.3 built with openib support: >> >> $ ompi_info | grep openib >> MCA btl: openib (MCA v2.0, API v2.0, Component v1.4.3) >> >> There are three libraries in prefix/lib/openmpi, >> no mca_btl_openib library. >> >> $ ls $PREFIX/lib/openmpi/ >> libompi_dbg_msgq.a libompi_dbg_msgq.la libompi_dbg_msgq.so >> >> >> However, this may be just because it is an older OMPI version in >> the 1.4 series. >> Because those are exactly what I have in another cluster with IB, >> and OMPI 1.4.3, where there isn't a problem. >> The libraries' organization may have changed from >> the 1.4 to the 1.6 series, right? >> I only have mca_btl_openib libraries in the 1.6 series, but it >> will be a hardship to migrate this program to OMPI 1.6. >> >> (OK, I have newer OMPI, but I need old also for some >> programs). >> >> Why the heck it is not detecting the Infinband hardware? >> [It used to detect it! :( ] >> >> Thank you, >> Gus Correa >> >> >> On 08/12/2013 03:01 PM, Ralph Castain wrote: >>> Check ompi_info - was it built with openib support? >>> >>> Then check that the mca_btl_openib library is present in the >>> prefix/lib/openmpi directory >>> >>> Sounds like it isn't finding the openib plugin >>> >>> >>> On Aug 12, 2013, at 11:57 AM, Gus Correa<g...@ldeo.columbia.edu> wrote: >>> >>>> Dear Open MPI pros >>>> >>>> On one of the clusters here, that has Infinband, >>>> I am getting this type of errors from >>>> OpenMPI 1.4.3 (OK, I know it is old ...): >>>> >>>> ********************************************************* >>>> Tcl_InitNotifier: unable to start notifier thread >>>> Abort: Command not found. >>>> Tcl_InitNotifier: unable to start notifier thread >>>> Abort: Command not found. >>>> -------------------------------------------------------------------------- >>>> >>>> At least one pair of MPI processes are unable to reach each other for >>>> MPI communications. This means that no Open MPI device has indicated >>>> that it can be used to communicate between these processes. This is >>>> an error; Open MPI requires that all MPI processes be able to reach >>>> each other. This error can sometimes be the result of forgetting to >>>> specify the "self" BTL. >>>> >>>> Process 1 ([[907,1],68]) is on host: node11.cluster >>>> Process 2 ([[907,1],0]) is on host: node15 >>>> BTLs attempted: self sm >>>> >>>> Your MPI job is now going to abort; sorry. >>>> -------------------------------------------------------------------------- >>>> >>>> ********************************************************* >>>> >>>> Awkward, because I have "btl = ^tcp" in openmpi-mca-params.conf. >>>> The same error also happens if I force --mca btl openib,sm,self >>>> in mpiexec. >>>> >>>> ** Why is it attempting only the self and sm BTLs, but not openib? ** >>>> >>>> I don't understand either the initial errors >>>> "Tcl_InitNotifier: unable to start notifier thread". >>>> Are they coming from Torque perhaps? >>>> >>>> As I said, the cluster has Infiniband, >>>> which is what we've been using forever, until >>>> these errors started today. >>>> >>>> When I divert the traffic to tcp >>>> (--mca btl tcp,sm,self), the jobs run normally. >>>> >>>> I am using the examples/connectivity_c.c program >>>> to troubleshoot this problem. >>>> >>>> *** >>>> I checked a few things on the IB side. >>>> >>>> The output of ibstat on all nodes seems OK (links up, etc), >>>> and so are the output of ibhosts and ibchecknet. >>>> >>>> Only two connected ports had errors, as reported by ibcheckerrors, >>>> and I cleared them with iblclearerrors. >>>> >>>> The IB subnet manager is running on the head node. >>>> I restarted the daemon, but nothing changed, the job continue to >>>> fail with the same errors. >>>> >>>> ** >>>> >>>> Any hints of what is going on, how to diagnose it, and how to fix it? >>>> Any gentler way than reboot everything and power cycling >>>> the IB switch? (And would this brute force method work, at least?) >>>> >>>> Thank you, >>>> Gus Correa >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users