On Mon, 2010-04-26 at 14:43 -0400, Jeff Squyres wrote: > On Apr 24, 2010, at 10:14 PM, Nev wrote: > > > void * const result = dlopen(libName, RTLD_LAZY | RTLD_LOCAL); > > This line is the problem: change RTLD_LOCAL to RTLD_GLOBAL and it'll work. > There's another option, too -- keep reading... > > <highly complex linker voodoo> > > Before discussing why this happens, know that Open MPI plugins call functions > back up in the main Open MPI libraries. As a > crass-and-not-really-correct-but-close-enough example, consider that OMPI > plugins are created (sorta) like this: > > gcc my_plugin_source.c ... -L<dir> -lmpi --shared -o > mca_framework_component.so > > where libmpi.so is a shared library. These plugins are not making MPI > standardized API function calls; they're calling internal functions inside > libmpi.so (i.e., OMPI's internal implementation API). This is because > libmpi.so (and friends) have a whole lotta infrastructure that the plugins > need in order to be able to do their work. > > It's a fun use of the intelligence of linkers -- a normal MPI app is linked > against OMPI's libmpi.so, but so is mca_framework_component.so. When your > app calls MPI_Init, the normal run-time linker semantics take over, resolve > the symbol, and then call it. Later, mca_framework_component.so is > dlopen()'ed. The run-time linker sees that it needs libmpi.so, but realizes > that libmpi.so is already loaded -- so it doesn't load it again. When > mca_framework_component.so calls OMPI_do_something(), the same run-time > resolution occurs, and (this is key) it calls the function in the same > instance of libmpi.so that your app is using. > > Nifty. Without this concept, OMPI's plugin concept wouldn't work. > > Your code is dlopening liba2lib as LOCAL. The run-time linker pulls in > libmpi.so at the same time as liba2lib (because MPI_Init needs it) -- and > therefore libmpi.so is loaded into the same private space as liba2lib. But > then later, the innards of Open MPI dlopen() mca_framework_component.so. > This plugin is loaded into a DIFFERENT symbol space than libmpi.so. The key > point here is that LOCAL is not "inherited", so to speak. If you dlopen() > libfoo as LOCAL, if libfoo then dlopen()s more DSOs, those newly-opened DSOs > are in a different space than libfoo. > > The best I can guess is that when mca_framework_component.so is dlopen()'ed, > the linker says "ya, we have libmpi.so loaded" and it allows the load to > complete successfully. But later when it tries to actually resolve > OMPI_do_something(), it fails -- because OMPI_do_something() is in the > private/LOCAL symbol space. And therefore OMPI_do_something has a value of > 0. And it segv's when we try to call through it. (this paragraph may not be > exactly right; but it's probably close -- every time I think I understand > linkers, I find out that I don't understand them at all...) > > It works for you in the static case because Open MPI slurps up all the > components *into* libmpi.so in that case. Hence, all the components *and* > all the internal libmpi symbols are loaded into the same LOCAL symbol space. > There's no dlopen'ing of plugins in this case. And it all works fine because > everything can resolve nicely, yadda yadda yadda. > > So I think your options are 1) to change that LOCAL to GLOBAL, 2) use > "--enable-static --disable-shared", or 3) use --disable-dlopen. #2 builds > libmpi.a *and* slurps all of OMPI's components up into libmpi.a. #3 builds > libmpi.so *and* slurps all of OMPI's components up into libmpi.so. So you > get the benefits of a shared library, but all the components are physically > inside libmpi.so as opposed to being standalone DSO's. > > </highly complex linker voodoo> > > I hope that made sense! >
Hi Jeff, Thank you very much for your very clear and detailed explanation. I have verified that all 3 options work with the minimal example. I will now verify against the real code, but that will take a little longer. Thanks again for the time you and effort addressing my problem. Thanks Nev.