After a bunch of off-list communication, it turns out that the OMPI warning cited in the first mail of this chain was indeed the culprit.
OMPI was warning that it could only register about half the memory in the machine due to limitations in the OFED driver. Once those limits were raised to include the entire memory in the machine, the problems went away. That was the main/root issue; there were a small number of other side issues that crept in during debugging and troubleshooting. We fixed all those configuration issues along that way (e.g., just did a clean, new install into a fresh installation tree to avoid any stale/cruft from prior builds, etc.). On Jan 24, 2014, at 6:24 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> wrote: > Greg and I are chatting off list; there's something definitely weird going on > in his setup. > > We'll report back to the list when we figure it out. > > > On Jan 24, 2014, at 1:26 PM, Gus Correa <g...@ldeo.columbia.edu> > wrote: > >> On 01/24/2014 12:50 PM, Fischer, Greg A. wrote: >>> Yep. That was the problem. It works beautifully now. >>> >>> Thanks for prodding me to take another look. >>> >>> With regards to openmpi-1.6.5, the system that I'm compiling and running on, >> SLES10, contains some pretty dated software (e.g. Linux 2.6.x, python 2.4, >> gcc 4.1.2). Is it possible there's simply an >> incompatibility lurking in there somewhere that would trip >> openmpi-1.6.5 but not openmpi-1.4.3? >>> >>> Greg >>> >> >> Hi Greg >> >> FWIW, we have OpenMPI 1.6.5 installed >> (and we have used OMPI 1.4.5, 1.4.4, 1.4.3, ..., 1.2.8, before) >> in our older cluster that has CentOS 5.2, Linux kernel 2.6.18, >> gcc 4.1.2, Python 2.4.3, etc. >> Parallel programs compile and run with OMPI 1.6.5 without problems. >> >> I hope this helps, >> Gus Correa >> >>>> -----Original Message----- >>>> From: Fischer, Greg A. >>>> Sent: Friday, January 24, 2014 11:41 AM >>>> To: 'Open MPI Users' >>>> Cc: Fischer, Greg A. >>>> Subject: RE: [OMPI users] simple test problem hangs on mpi_finalize and >>>> consumes all system resources >>>> >>>> Hmm... It looks like CMAKE was somehow finding openmpi-1.6.5 instead of >>>> openmpi-1.4.3, despite the environment variables being set otherwise. This >>>> is likely the explanation. I'll try to chase that down. >>>> >>>>> -----Original Message----- >>>>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff >>>>> Squyres (jsquyres) >>>>> Sent: Friday, January 24, 2014 11:39 AM >>>>> To: Open MPI Users >>>>> Subject: Re: [OMPI users] simple test problem hangs on mpi_finalize and >>>>> consumes all system resources >>>>> >>>>> Ok. I only mention this because the "mca_paffinity_linux.so: undefined >>>>> symbol: mca_base_param_reg_int" type of message is almost always an >>>>> indicator of two different versions being installed into the same tree. >>>>> >>>>> >>>>> On Jan 24, 2014, at 11:26 AM, "Fischer, Greg A." >>>>> <fisch...@westinghouse.com> wrote: >>>>> >>>>>> Version 1.4.3 and 1.6.5 were and are installed in separate trees: >>>>>> >>>>>> 1003 fischega@lxlogin2[~]> ls >>>>>> /tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/openmpi-1.* >>>>>> /tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/openmpi-1.4.3: >>>>>> bin etc include lib share >>>>>> >>>>>> /tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/openmpi-1.6.5: >>>>>> bin etc include lib share >>>>>> >>>>>> I'm fairly sure I was careful to check that the LD_LIBRARY_PATH was >>>>>> set >>>>> correctly, but I'll check again. >>>>>> >>>>>>> -----Original Message----- >>>>>>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff >>>>>>> Squyres (jsquyres) >>>>>>> Sent: Friday, January 24, 2014 11:07 AM >>>>>>> To: Open MPI Users >>>>>>> Subject: Re: [OMPI users] simple test problem hangs on mpi_finalize >>>>>>> and consumes all system resources >>>>>>> >>>>>>> On Jan 22, 2014, at 10:21 AM, "Fischer, Greg A." >>>>>>> <fisch...@westinghouse.com> wrote: >>>>>>> >>>>>>>> The reason for deleting the openmpi-1.6.5 installation was that I >>>>>>>> went back >>>>>>> and installed openmpi-1.4.3 and the problem (mostly) went away. >>>>>>> Openmpi- >>>>>>> 1.4.3 can run the simple tests without issue, but on my "real" >>>>>>> program, I'm getting symbol lookup errors: >>>>>>>> >>>>>>>> mca_paffinity_linux.so: undefined symbol: mca_base_param_reg_int >>>>>>> >>>>>>> This sounds like you are mixing 1.6.x and 1.4.x in the same >>>>>>> installation >>>>> tree. >>>>>>> This can definitely lead to sadness. >>>>>>> >>>>>>> More specifically: installing 1.6 over an existing 1.4 installation >>>>>>> (and vice >>>>>>> versa) is definitely NOT supported. The set of plugins that the two >>>>>>> install are different, and can lead to all manner of weird/undefined >>>>> behavior. >>>>>>> >>>>>>> FWIW: I typically install Open MPI into a tree by itself. And if I >>>>>>> later want to remove that installation, I just "rm -rf" that tree. >>>>>>> Then I can install a different version of OMPI into that same tree >>>>>>> (because the prior tree is completely gone). >>>>>>> >>>>>>> However, if you can't install OMPI into a tree by itself, you can >>>>>>> "make uninstall" from the source tree, and that should surgically >>>>>>> completely remove OMPI from the installation tree. Then it is safe >>>>>>> to install a different version of OMPI into that same tree. >>>>>>> >>>>>>> Can you verify that you had installed OMPI into completely clean >>>>>>> trees? If you didn't, I can imagine that causing the kinds of >>>>>>> errors that you >>>>> described. >>>>>>> >>>>>>> -- >>>>>>> Jeff Squyres >>>>>>> jsquy...@cisco.com >>>>>>> For corporate legal information go to: >>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>>> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>>> >>>>> -- >>>>> Jeff Squyres >>>>> jsquy...@cisco.com >>>>> For corporate legal information go to: >>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/