On Apr 24, 2010, at 10:14 PM, Nev wrote:

>   void * const result = dlopen(libName, RTLD_LAZY | RTLD_LOCAL);

This line is the problem: change RTLD_LOCAL to RTLD_GLOBAL and it'll work.  
There's another option, too -- keep reading...

<highly complex linker voodoo>

Before discussing why this happens, know that Open MPI plugins call functions 
back up in the main Open MPI libraries.  As a 
crass-and-not-really-correct-but-close-enough example, consider that OMPI 
plugins are created (sorta) like this:

    gcc my_plugin_source.c ... -L<dir> -lmpi --shared -o 
mca_framework_component.so

where libmpi.so is a shared library.  These plugins are not making MPI 
standardized API function calls; they're calling internal functions inside 
libmpi.so (i.e., OMPI's internal implementation API).  This is because 
libmpi.so (and friends) have a whole lotta infrastructure that the plugins need 
in order to be able to do their work.

It's a fun use of the intelligence of linkers -- a normal MPI app is linked 
against OMPI's libmpi.so, but so is mca_framework_component.so.  When your app 
calls MPI_Init, the normal run-time linker semantics take over, resolve the 
symbol, and then call it.  Later, mca_framework_component.so is dlopen()'ed.  
The run-time linker sees that it needs libmpi.so, but realizes that libmpi.so 
is already loaded -- so it doesn't load it again.  When 
mca_framework_component.so calls OMPI_do_something(), the same run-time 
resolution occurs, and (this is key) it calls the function in the same instance 
of libmpi.so that your app is using.

Nifty.  Without this concept, OMPI's plugin concept wouldn't work.

Your code is dlopening liba2lib as LOCAL.  The run-time linker pulls in 
libmpi.so at the same time as liba2lib (because MPI_Init needs it) -- and 
therefore libmpi.so is loaded into the same private space as liba2lib.  But 
then later, the innards of Open MPI dlopen() mca_framework_component.so.  This 
plugin is loaded into a DIFFERENT symbol space than libmpi.so.  The key point 
here is that LOCAL is not "inherited", so to speak.  If you dlopen() libfoo as 
LOCAL, if libfoo then dlopen()s more DSOs, those newly-opened DSOs are in a 
different space than libfoo.

The best I can guess is that when mca_framework_component.so is dlopen()'ed, 
the linker says "ya, we have libmpi.so loaded" and it allows the load to 
complete successfully.  But later when it tries to actually resolve 
OMPI_do_something(), it fails -- because OMPI_do_something() is in the 
private/LOCAL symbol space.  And therefore OMPI_do_something has a value of 0.  
And it segv's when we try to call through it.  (this paragraph may not be 
exactly right; but it's probably close -- every time I think I understand 
linkers, I find out that I don't understand them at all...)

It works for you in the static case because Open MPI slurps up all the 
components *into* libmpi.so in that case.  Hence, all the components *and* all 
the internal libmpi symbols are loaded into the same LOCAL symbol space.  
There's no dlopen'ing of plugins in this case.  And it all works fine because 
everything can resolve nicely, yadda yadda yadda.

So I think your options are 1) to change that LOCAL to GLOBAL, 2) use 
"--enable-static --disable-shared", or 3) use --disable-dlopen.  #2 builds 
libmpi.a *and* slurps all of OMPI's components up into libmpi.a.  #3 builds 
libmpi.so *and* slurps all of OMPI's components up into libmpi.so.  So you get 
the benefits of a shared library, but all the components are physically inside 
libmpi.so as opposed to being standalone DSO's.

</highly complex linker voodoo>

I hope that made sense!

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to