On Mon, 2010-04-26 at 14:43 -0400, Jeff Squyres wrote:
> On Apr 24, 2010, at 10:14 PM, Nev wrote:
> 
> >   void * const result = dlopen(libName, RTLD_LAZY | RTLD_LOCAL);
> 
> This line is the problem: change RTLD_LOCAL to RTLD_GLOBAL and it'll work.  
> There's another option, too -- keep reading...
> 
> <highly complex linker voodoo>
> 
> Before discussing why this happens, know that Open MPI plugins call functions 
> back up in the main Open MPI libraries.  As a 
> crass-and-not-really-correct-but-close-enough example, consider that OMPI 
> plugins are created (sorta) like this:
> 
>     gcc my_plugin_source.c ... -L<dir> -lmpi --shared -o 
> mca_framework_component.so
> 
> where libmpi.so is a shared library.  These plugins are not making MPI 
> standardized API function calls; they're calling internal functions inside 
> libmpi.so (i.e., OMPI's internal implementation API).  This is because 
> libmpi.so (and friends) have a whole lotta infrastructure that the plugins 
> need in order to be able to do their work.
> 
> It's a fun use of the intelligence of linkers -- a normal MPI app is linked 
> against OMPI's libmpi.so, but so is mca_framework_component.so.  When your 
> app calls MPI_Init, the normal run-time linker semantics take over, resolve 
> the symbol, and then call it.  Later, mca_framework_component.so is 
> dlopen()'ed.  The run-time linker sees that it needs libmpi.so, but realizes 
> that libmpi.so is already loaded -- so it doesn't load it again.  When 
> mca_framework_component.so calls OMPI_do_something(), the same run-time 
> resolution occurs, and (this is key) it calls the function in the same 
> instance of libmpi.so that your app is using.
> 
> Nifty.  Without this concept, OMPI's plugin concept wouldn't work.
> 
> Your code is dlopening liba2lib as LOCAL.  The run-time linker pulls in 
> libmpi.so at the same time as liba2lib (because MPI_Init needs it) -- and 
> therefore libmpi.so is loaded into the same private space as liba2lib.  But 
> then later, the innards of Open MPI dlopen() mca_framework_component.so.  
> This plugin is loaded into a DIFFERENT symbol space than libmpi.so.  The key 
> point here is that LOCAL is not "inherited", so to speak.  If you dlopen() 
> libfoo as LOCAL, if libfoo then dlopen()s more DSOs, those newly-opened DSOs 
> are in a different space than libfoo.
> 
> The best I can guess is that when mca_framework_component.so is dlopen()'ed, 
> the linker says "ya, we have libmpi.so loaded" and it allows the load to 
> complete successfully.  But later when it tries to actually resolve 
> OMPI_do_something(), it fails -- because OMPI_do_something() is in the 
> private/LOCAL symbol space.  And therefore OMPI_do_something has a value of 
> 0.  And it segv's when we try to call through it.  (this paragraph may not be 
> exactly right; but it's probably close -- every time I think I understand 
> linkers, I find out that I don't understand them at all...)
> 
> It works for you in the static case because Open MPI slurps up all the 
> components *into* libmpi.so in that case.  Hence, all the components *and* 
> all the internal libmpi symbols are loaded into the same LOCAL symbol space.  
> There's no dlopen'ing of plugins in this case.  And it all works fine because 
> everything can resolve nicely, yadda yadda yadda.
> 
> So I think your options are 1) to change that LOCAL to GLOBAL, 2) use 
> "--enable-static --disable-shared", or 3) use --disable-dlopen.  #2 builds 
> libmpi.a *and* slurps all of OMPI's components up into libmpi.a.  #3 builds 
> libmpi.so *and* slurps all of OMPI's components up into libmpi.so.  So you 
> get the benefits of a shared library, but all the components are physically 
> inside libmpi.so as opposed to being standalone DSO's.
> 
> </highly complex linker voodoo>
> 
> I hope that made sense!
> 

Hi Jeff,
Thank you very much for your very clear and detailed explanation. I have
verified that all 3 options work with the minimal example. I will now
verify against the real code, but that will take a little longer.
Thanks again for the time you and effort addressing my problem.
Thanks Nev.

Reply via email to