[OMPI users] unresolved symbol mca_base_param_reg_int
Hi, I am having a problem running application with OpenMpi version 1.4.1. The system works with version 1.2.7, but fails with version 1.3.4 and 1.4.1. (These are the only version I have tried). My application is linked against a shared library which does a dlopen of a 2nd shared "C" library which is compiled and linked using mpicc. The application and first shared library are C++. I rebuild and relink the 2nd shared library each time I change the openmpi build. When MPI_init is called I get the following error symbol lookup error: /opt/openmpi/lib/openmpi/mca_paffinity_linux.so: undefined symbol: mca_base_param_reg_int This does NOT occur with OpenMpi version 1.2.7, Or if I build OpenMpi as a static library, and then link against this static library. I am building a default openmpi except for --prefix=/opt/openmpi and --enable-static --disable-shared for static library build. I would link to be able to use non static openmpi build. Any suggestion on what I am doing wrong? Thanks Nev
Re: [OMPI users] unresolved symbol mca_base_param_reg_int
Hi Jeff, I have tried --disable-visibility but get the same results. Any other ideas? I am not able to remove the dlopen, but maybe able to move it to directly dlopen the mpi library, instead of my library that is linked to mpi. Is this likely to help? Nev On Mon, 2010-04-19 at 09:21 -0400, Jeff Squyres wrote: > It could well be because of the additional dlopen in your application (we > changed some things from the 1.2 series with regards to this kind of stuff). > > Try configuring Open MPI with the --disable-visibility switch and see if that > helps. > > > On Apr 17, 2010, at 9:05 PM, Nev wrote: > > > Hi, > > I am having a problem running application with OpenMpi version 1.4.1. > > The system works with version 1.2.7, but fails with version 1.3.4 and > > 1.4.1. (These are the only version I have tried). > > > > My application is linked against a shared library which does a dlopen of > > a 2nd shared "C" library which is compiled and linked using mpicc. The > > application and first shared library are C++. > > I rebuild and relink the 2nd shared library each time I change the > > openmpi build. > > > > When MPI_init is called I get the following error > > symbol lookup error: /opt/openmpi/lib/openmpi/mca_paffinity_linux.so: > > undefined symbol: mca_base_param_reg_int > > > > This does NOT occur with OpenMpi version 1.2.7, Or if I build OpenMpi as > > a static library, and then link against this static library. > > > > I am building a default openmpi except for --prefix=/opt/openmpi and > > --enable-static --disable-shared for static library build. > > > > I would link to be able to use non static openmpi build. > > > > Any suggestion on what I am doing wrong? > > > > Thanks Nev > > > > > > > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > >
Re: [OMPI users] unresolved symbol mca_base_param_reg_int
Hi Jeff, I did the install to the "same place". I always use /opt/openmpi, the procedure I use when building is configure --prefix=/opt/openmpi ... rm -r /opt/openmpi/* make clean make all make install is this sufficient to un-install previous version, or is more required. On Tue, 2010-04-20 at 07:59 -0400, Jeff Squyres wrote: > Gah! I didn't look at your error message closely enough the first time -- > sorry! > > Did you perchance upgrade an existing Open MPI installation in place? I.e., > have Open MPI 1.2.7 installed in /somewhere and the install Open MPI > 1.3.x/1.4.x into the same /somewhere? > > If so, try a full uninstall of Open MPI 1.2.7 from /somewhere first -- or > install Open MPI 1.4.x into /somewhere_else. > > The reason is that Open MPI has a set of plugins that are not necessarily > compatible between versions, and are not necessarily removed if you just > install a new version over an old version. > > > > On Apr 19, 2010, at 6:52 PM, Nev wrote: > > > Hi Jeff, > > I have tried --disable-visibility but get the same results. Any other > > ideas? I am not able to remove the dlopen, but maybe able to move it to > > directly dlopen the mpi library, instead of my library that is linked to > > mpi. Is this likely to help? > > Nev > > > > On Mon, 2010-04-19 at 09:21 -0400, Jeff Squyres wrote: > > > It could well be because of the additional dlopen in your application (we > > > changed some things from the 1.2 series with regards to this kind of > > > stuff). > > > > > > Try configuring Open MPI with the --disable-visibility switch and see if > > > that helps. > > > > > > > > > On Apr 17, 2010, at 9:05 PM, Nev wrote: > > > > > > > Hi, > > > > I am having a problem running application with OpenMpi version 1.4.1. > > > > The system works with version 1.2.7, but fails with version 1.3.4 and > > > > 1.4.1. (These are the only version I have tried). > > > > > > > > My application is linked against a shared library which does a dlopen of > > > > a 2nd shared "C" library which is compiled and linked using mpicc. The > > > > application and first shared library are C++. > > > > I rebuild and relink the 2nd shared library each time I change the > > > > openmpi build. > > > > > > > > When MPI_init is called I get the following error > > > > symbol lookup error: /opt/openmpi/lib/openmpi/mca_paffinity_linux.so: > > > > undefined symbol: mca_base_param_reg_int > > > > > > > > This does NOT occur with OpenMpi version 1.2.7, Or if I build OpenMpi as > > > > a static library, and then link against this static library. > > > > > > > > I am building a default openmpi except for --prefix=/opt/openmpi and > > > > --enable-static --disable-shared for static library build. > > > > > > > > I would link to be able to use non static openmpi build. > > > > > > > > Any suggestion on what I am doing wrong? > > > > > > > > Thanks Nev > > > > > > > > > > > > > > > > > > > > ___ > > > > users mailing list > > > > us...@open-mpi.org > > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > > > > > > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > >
Re: [OMPI users] unresolved symbol mca_base_param_reg_int
O n Tue, 2010-04-20 at 20:22 -0400, Jeff Squyres wrote: > On Apr 20, 2010, at 6:16 PM, Nev wrote: > > > Hi Jeff, > > I did the install to the "same place". I always use /opt/openmpi, the > > procedure I use when building is > > configure --prefix=/opt/openmpi ... > > rm -r /opt/openmpi/* > > make clean > > make all > > make install > > is this sufficient to un-install previous version, or is more required. > > Yes, that should be sufficient. Is that what you did this time? > > If so, is there any way you can provide a small code example of the problem > you're seeing? > OK, I will attempt to reduce to minimal code set, but will not be able to do so until the week end.
Re: [OMPI users] unresolved symbol mca_base_param_reg_int
On Thu, 2010-04-22 at 08:09 +1000, Nev wrote: > O > n Tue, 2010-04-20 at 20:22 -0400, Jeff Squyres wrote: > > On Apr 20, 2010, at 6:16 PM, Nev wrote: > > > > > Hi Jeff, > > > I did the install to the "same place". I always use /opt/openmpi, the > > > procedure I use when building is > > > configure --prefix=/opt/openmpi ... > > > rm -r /opt/openmpi/* > > > make clean > > > make all > > > make install > > > is this sufficient to un-install previous version, or is more required. > > > > Yes, that should be sufficient. Is that what you did this time? > > > > If so, is there any way you can provide a small code example of the problem > > you're seeing? > > > OK, I will attempt to reduce to minimal code set, but will not be able > to do so until the week end. > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users Hi Jeff, Hopefully I have include sufficient information for you to identify what I am doing incorrectly. Have created minimalist set of code which was built, linked and run against version 1.2.7 shared, version 1.4.1 shared and version 1.4.1 static. But have not been able to get the same error message, as reported earlier. v1.4.1 static WORKS with no error or warning messages. v1.4.1 shared FAILS with message "mpirun noticed that process rank 0 with PID 31115 on node dingo3 exited on signal 11 (Segmentation fault)". v1.2.7 shared WORKS but with message: "[dingo3:31123] mca: base: component_find: unable to open osc pt2pt: file not found (ignored)" I have also run the above 3 configuration with actual comms between the processes and that works except for 1.4.1 shared. 1.4.1 shared always fails in the call MPI_Init(...) To run command I used /opt/openmpi/bin/mpirun -np 2 -mca btl tcp,self \ -x LD_LIBRARY_PATH=/opt/openmpi/lib:/work/lib \ -x PATH=/opt/openmpi/bin:/work/bin:/usr/bin \ -host dingo3 a3exec setting the LD_LIBRARY_PATH and PATH are not my normal habit, but used to minimise any external dependencies. This test machine is a newly installed (eg very clean) Ubuntu 9.10 64 desktop with server kernel. It is a dual socket 8 core hyperthreaded intel box. It has installed a. openssh + freenx b. KVM c. build-essential d. 32 bit libraries e. bridge-utils f. uml-utilities openmpi was built with ./configure prefix=/opt/openmpi CFLAGS=-m32 CXXFLAGS=-m32 plus --enable-static --disable-shared for static builds I have also tested on 32 bit Ubuntu 9.10 and 8.04 (not clean) with the same results. Minimist files init.c build as "liba1lib.so" using mpicc #include "mpi.h" #include "stdio.h" static int mpiRank = -1; static int mpiSize = -1; int connect(int * const pArgc, char * * pArgv[]) { printf("ENTER connect *pArgc=%d, *pArgv[0]=%s\n", *pArgc, (*pArgv)[0]); fflush(0); MPI_Init(pArgc, pArgv); // <<<>>> get to here for version 1.4.1 shared build printf("DONE MPI_init\n"); fflush(0); MPI_Comm_rank(MPI_COMM_WORLD, &mpiRank); MPI_Comm_size(MPI_COMM_WORLD, &mpiSize); printf("MPI_rank = %d, MPI_size = %d\n", mpiRank, mpiSize); fflush(0); MPI_Finalize(); printf("%d EXITING connect\n", mpiRank); fflush(0); return 0; } load.cpp build as "liba2lib.so" using g++ extern "C" { #include #include #include typedef void (*tConnect)(int * pArgc, char * * pArgv[]); void load(int * pArgc, char * * pArgv[]); } void load(int * pArgc, char * * pArgv[]) { printf("ENTER load\n"); fflush(0); dlerror(); char const * const libName = "liba1lib.so"; void * const result = dlopen(libName, RTLD_LAZY | RTLD_LOCAL); if (result == 0) { fprintf(stderr, "Failed to load library %s error = %s\n", libName, dlerror()); fflush(0); exit(1); } char const * const symbolName = "connect"; void * symbol = dlsym(result, symbolName); if (symbol == 0) { fprintf(stderr, "Failed to load symbol %s from %s error = %s\n", symbolName, libName, dlerror()); fflush(0); exit(1); } ((tConnect)symbol)(pArgc, pArgv); printf("DONE load\n"); fflush(0); return; } main.cpp built as "a3exec" using g++ extern "C" { #include void load(int * pArgc, char * * pArgv[]); } int main(int argc, char * argv[]) { printf("ENTER main\n"); load(&argc, &argv); printf("EXIT main\n"); return 0; } Thanks Nev
Re: [OMPI users] unresolved symbol mca_base_param_reg_int
On Mon, 2010-04-26 at 14:43 -0400, Jeff Squyres wrote: > On Apr 24, 2010, at 10:14 PM, Nev wrote: > > > void * const result = dlopen(libName, RTLD_LAZY | RTLD_LOCAL); > > This line is the problem: change RTLD_LOCAL to RTLD_GLOBAL and it'll work. > There's another option, too -- keep reading... > > > > Before discussing why this happens, know that Open MPI plugins call functions > back up in the main Open MPI libraries. As a > crass-and-not-really-correct-but-close-enough example, consider that OMPI > plugins are created (sorta) like this: > > gcc my_plugin_source.c ... -L -lmpi --shared -o > mca_framework_component.so > > where libmpi.so is a shared library. These plugins are not making MPI > standardized API function calls; they're calling internal functions inside > libmpi.so (i.e., OMPI's internal implementation API). This is because > libmpi.so (and friends) have a whole lotta infrastructure that the plugins > need in order to be able to do their work. > > It's a fun use of the intelligence of linkers -- a normal MPI app is linked > against OMPI's libmpi.so, but so is mca_framework_component.so. When your > app calls MPI_Init, the normal run-time linker semantics take over, resolve > the symbol, and then call it. Later, mca_framework_component.so is > dlopen()'ed. The run-time linker sees that it needs libmpi.so, but realizes > that libmpi.so is already loaded -- so it doesn't load it again. When > mca_framework_component.so calls OMPI_do_something(), the same run-time > resolution occurs, and (this is key) it calls the function in the same > instance of libmpi.so that your app is using. > > Nifty. Without this concept, OMPI's plugin concept wouldn't work. > > Your code is dlopening liba2lib as LOCAL. The run-time linker pulls in > libmpi.so at the same time as liba2lib (because MPI_Init needs it) -- and > therefore libmpi.so is loaded into the same private space as liba2lib. But > then later, the innards of Open MPI dlopen() mca_framework_component.so. > This plugin is loaded into a DIFFERENT symbol space than libmpi.so. The key > point here is that LOCAL is not "inherited", so to speak. If you dlopen() > libfoo as LOCAL, if libfoo then dlopen()s more DSOs, those newly-opened DSOs > are in a different space than libfoo. > > The best I can guess is that when mca_framework_component.so is dlopen()'ed, > the linker says "ya, we have libmpi.so loaded" and it allows the load to > complete successfully. But later when it tries to actually resolve > OMPI_do_something(), it fails -- because OMPI_do_something() is in the > private/LOCAL symbol space. And therefore OMPI_do_something has a value of > 0. And it segv's when we try to call through it. (this paragraph may not be > exactly right; but it's probably close -- every time I think I understand > linkers, I find out that I don't understand them at all...) > > It works for you in the static case because Open MPI slurps up all the > components *into* libmpi.so in that case. Hence, all the components *and* > all the internal libmpi symbols are loaded into the same LOCAL symbol space. > There's no dlopen'ing of plugins in this case. And it all works fine because > everything can resolve nicely, yadda yadda yadda. > > So I think your options are 1) to change that LOCAL to GLOBAL, 2) use > "--enable-static --disable-shared", or 3) use --disable-dlopen. #2 builds > libmpi.a *and* slurps all of OMPI's components up into libmpi.a. #3 builds > libmpi.so *and* slurps all of OMPI's components up into libmpi.so. So you > get the benefits of a shared library, but all the components are physically > inside libmpi.so as opposed to being standalone DSO's. > > > > I hope that made sense! > Hi Jeff, Thank you very much for your very clear and detailed explanation. I have verified that all 3 options work with the minimal example. I will now verify against the real code, but that will take a little longer. Thanks again for the time you and effort addressing my problem. Thanks Nev.