On Tue, Jun 10, 2014 at 12:10:28AM +0000, Jeff Squyres (jsquyres) wrote: > I seem to recall that you have an IB-based cluster, right? > > From a *very quick* glance at the code, it looks like this might be a simple > incorrect-finalization issue. That is: > > - you run the job on a single server > - openib disqualifies itself because you're running on a single server > - openib then goes to finalize/close itself > - but openib didn't fully initialize itself (because it disqualified itself > early in the initialization process), and something in the finalization > process didn't take that into account > > Nathan -- is that anywhere close to correct?
Nope. udcm_module_finalize is being called because there was an error setting up the udcm state. See btl_openib_connect_udcm.c:476. The opal_list_t destructor is getting an assert failure. Probably because the constructor wasn't called. I can rearrange the constructors to be called first but there appears to be a deeper issue with the user's system: udcm_module_init should not be failing! It creates a couple of CQs, allocates a small number of registered bufferes and starts monitoring the fd for the completion channel. All these things are also done in the setup of the openib btl itself. Keep in mind that the openib btl will not disqualify itself when running single server. Openib may be used to communicate on node and is needed for the dynamics case. The user might try adding -mca btl_base_verbose 100 to shed some light on what the real issue is. BTW, I no longer monitor the user mailing list. If something needs my attention forward it to me directly. -Nathan
pgpx5f_ZZt8HD.pgp
Description: PGP signature