[OMPI users] Low performance of Open MPI-1.3 over Gigabit
Dear All, A fortran application is installed with Open MPI-1.3 + Intel compilers on a Rocks-4.3 cluster with Intel Xeon Dual socket Quad core processor @ 3GHz (8cores/node). The time consumed for different tests over a Gigabit connected nodes are as follows: (Each node has 8 GB memory). No of Nodes used:6 No of cores used/node:4 total mpi processes:24 CPU TIME :1 HOURS 19 MINUTES 14.39 SECONDS ELAPSED TIME :2 HOURS 41 MINUTES 8.55 SECONDS No of Nodes used:6 No of cores used/node:8 total mpi processes:48 CPU TIME :4 HOURS 19 MINUTES 19.29 SECONDS ELAPSED TIME :9 HOURS 15 MINUTES 46.39 SECONDS No of Nodes used:3 No of cores used/node:8 total mpi processes:24 CPU TIME :2 HOURS 41 MINUTES 27.98 SECONDS ELAPSED TIME :4 HOURS 21 MINUTES 0.24 SECONDS But the same application performs well on another Linux cluster with LAM-MPI-7.1.3 No of Nodes used:6 No of cores used/node:4 total mpi processes:24 CPU TIME :1hours:30min:37.25s ELAPSED TIME 1hours:51min:10.00S No of Nodes used:12 No of cores used/node:4 total mpi processes:48 CPU TIME :0hours:46min:13.98s ELAPSED TIME 1hours:02min:26.11s No of Nodes used:6 No of cores used/node:8 total mpi processes:48 CPU TIME : 1hours:13min:09.17s ELAPSED TIME 1hours:47min:14.04s So there is a huge difference between CPU TIME & ELAPSED TIME for Open MPI jobs. Note: On the same cluster Open MPI gives better performance for inifiniband nodes. What could be the problem for Open MPI over Gigabit? Any flags need to be used? Or is it not that good to use Open MPI on Gigabit? Thanks, Sangamesh
Re: [OMPI users] Problems in 1.3 loading shared libs whenusingVampirServer
Thanks for the hints. > You have some possible workarounds: > > - We recommended to the PyMPI author a while ago that he add his own > dlopen() of libmpi before calling MPI_INIT, but specifically using > RTLD_GLOBAL, so that the library is opened in the global process space > (not a private space in the process). Then libmpi's (and friends) > symbols will be available to its plugins. If you're unhappy with the > non-portability of dlopen, try lt_dlopen_advise() -- it's a portable > version that is linked inside Open MPI. This is the solution we go with our "modified python" approach. We do not exactly dlopen libmpi but simply link the Python binary against it, which has the same effect. > - Another option is to configure/compile Open MPI with "--disable- > dlopen" or "--enable-static --disable-shared" configure options. > Either of these options will cause Open MPI to slurp all of its > plugins up into libmpi (etc) and not dynamically open them at run- > time, thereby avoiding the problem of Python opening libmpi in a > private scope. This sounds good, I gotta try this. > - Get Python to give you the possibility of opening dependent > libraries in the global scope. This may be somewhat controversial; > there are good reasons to open plugins in private scopes. But I have > to imagine that OMPI is not the only python extension out there that > wants to open plugins of its own; other such projects should be > running into similar issues. That would involve patching Python in some nifty places which would probably lead to less Platform independence, so no option yet. --- Michael Meinel German Aerospace Center Center for Computer Applications in Aerospace Science and Engineering
Re: [OMPI users] Problems in 1.3 loading shared libs when usingVampirServer
On Tue, 2009-02-24 at 13:30 -0500, Jeff Squyres wrote: > - Get Python to give you the possibility of opening dependent > libraries in the global scope. This may be somewhat controversial; > there are good reasons to open plugins in private scopes. But I have > to imagine that OMPI is not the only python extension out there that > wants to open plugins of its own; other such projects should be > running into similar issues. > Can you check if the following works: import dl import sys flags = sys.getdlopenflags() sys.setdlopenflags(flags | dl.RTLD_GLOBAL) import minimpi --Nysal
Re: [OMPI users] Problems in 1.3 loading shared libs whenusingVampirServer
On Feb 25, 2009, at 4:02 AM, wrote: - Get Python to give you the possibility of opening dependent libraries in the global scope. This may be somewhat controversial; there are good reasons to open plugins in private scopes. But I have to imagine that OMPI is not the only python extension out there that wants to open plugins of its own; other such projects should be running into similar issues. That would involve patching Python in some nifty places which would probably lead to less Platform independence, so no option yet. I should have been more clear: what I meant was to engage the Python community to get such a feature to be implemented upstream in Python itself. Since I would find it easy to believe that other Python Extension projects may run into similar issues, it may be worth raising this issue to the Python community and opening the debate there. That being said, Nysal also posted an interesting approach. :-) -- Jeff Squyres Cisco Systems
Re: [OMPI users] Problems in 1.3 loading shared libswhenusingVampirServer
>> That would involve patching Python in some nifty places which would >> probably lead to less Platform independence, so no option yet. > I should have been more clear: what I meant was to engage the Python > community to get such a feature to be implemented upstream in Python > itself. Since I would find it easy to believe that other Python > Extension projects may run into similar issues, it may be worth > raising this issue to the Python community and opening the debate > there. > > That being said, Nysal also posted an interesting approach. :-) I think what Nysal wrote is exactly the chance for every Python user to configure the dynamic loading to ones needs. I'll try that. But I think that obsoletes the need to open a discussion in the Pyhton community. -- Michael Meinel German Aerospace Center Center for Computer Applications in Aerospace Science and Enginnering
Re: [OMPI users] Problems in 1.3 loading shared libs when using VampirServer
If you simply want to call is "Problems in 1.3" I might have some things to add, though! gerry Jeff Squyres wrote: On Feb 23, 2009, at 8:59 PM, Jeff Squyres wrote: Err... I'm a little confused. We've been emailing about this exact issue for a week or two (off list); you just re-started the conversation from the beginning, moved it to the user's list, and dropped all the CC's (which include several people who are not on this list). Why did you do that? GAAH!! Mea maxima culpa. :-( My stupid mail program did something strange (exact details unimportant) that made me think you re-sent your message to the users list yesterday -- thereby re-starting the whole conversation, etc. Upon double checking, I see that this is *not* what you did at all -- my mail program was showing me your original post from Feb 4 and making it look like you re-sent it yesterday. I just wasn't careful in my reading. Sorry about that; the fault and confusion was entirely mine. :-( (we're continuing the conversation off-list just because it's gnarly and full of details about Vampir that most people probably don't care about; they're working on a small example to send to me that replicates the problem -- will post back here when we have some kind of solution...) We now return you to your regularly scheduled programming... -- Gerry Creager -- gerry.crea...@tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843
Re: [OMPI users] Problems in 1.3 loading shared libs when using VampirServer
On Feb 25, 2009, at 8:43 AM, Gerry Creager wrote: If you simply want to call is "Problems in 1.3" I might have some things to add, though! I'm not quite sure how to parse this sentence -- are you saying that you have found some problems with Open MPI v1.3? If so, yes, we'd like to know what they are (so that we can fix them!). -- Jeff Squyres Cisco Systems
[OMPI users] Signal: Segmentation fault (11); Signal code: Address not mapped (1)
Dear Open MPI gurus, We have F90 code which compiles with MPICH on a dual-core PC laptop using the Intel compiler. We are trying to compile the code with Open MPI on a Mac Pro with 2 quad-core Xeons using gfortran. The code seem to be running ... for the most part. Unfortunately we keep getting a segfault which spits out a variant of the following message: [oblix:21522] *** Process received signal *** [oblix:21522] Signal: Segmentation fault (11) [oblix:21522] Signal code: Address not mapped (1) [oblix:21522] Failing at address: 0xc710 [oblix:21522] [ 0] 2 libSystem.B.dylib 0x92a892bb _sigtramp + 43 [oblix:21522] [ 1] 3 ???0x 0x0 + 4294967295 [oblix:21522] [ 2] 4 exe.out0x0001281b MAIN__ + 4875 [oblix:21522] [ 3] 5 exe.out0x00013c38 main + 40 [oblix:21522] [ 4] 6 exe.out0x1936 start + 54 [oblix:21522] *** End of error message *** After some researching of the error message, and digging around in the Open MPI user's mailing list, it appears that the bug may be in Open MPI. Someone has offered what might be a workaround: recompile with the following flag: --with-memory-manager=none Sadly, that segfaults still occur. What should we try next? Best regards, -Ken Mighell P.S. mpi_info result follows: ompi_info.result Description: Binary data
[OMPI users] openmpi 1.2.9 with Xgrid support more information
HI I Have checked the crash log. the result is bellow. If I am reading it and following the mpirun code correctly the release of the last mca_pls_xgrid_component.client by orte_pls_xgrid_finalize causes a call to method dealloc for PlsXGridClient where a [connection finalize] is call that ends as a [NSObject finalize] I think is as intended, anyone knows if that is correct? but for some unknown reason is not liked for my configuration. The only thing that I can find is that the behaviour of the finalize method in NSObject depends of the status of garbage collection. I am using gcc-4.4 and Xcode 3.1.2. Ricardo Process: mpirun [854] Path:/opt/openmpi/bin/mpirun Identifier: mpirun Version: ??? (???) Code Type: X86 (Native) Parent Process: bash [829] Date/Time: 2009-02-25 17:09:53.411 +0100 OS Version: Mac OS X Server 10.5.6 (9G71) Report Version: 6 Exception Type: EXC_BREAKPOINT (SIGTRAP) Exception Codes: 0x0002, 0x Crashed Thread: 0 Application Specific Information: *** Terminating app due to uncaught exception 'NSInvalidArgumentException', reason: '*** -[NSKVONotifying_XGConnection<0x216910> finalize]: called when collecting not enabled' Thread 0 Crashed: 0 com.apple.CoreFoundation 0x917dffb4 ___TERMINATING_DUE_TO_UNCAUGHT_EXCEPTION___ + 4 1 libobjc.A.dylib 0x91255e3b objc_exception_throw + 40 2 com.apple.CoreFoundation 0x917e701d -[NSObject finalize] + 157 3 mca_pls_xgrid.so 0x0019bf8b -[PlsXGridClient dealloc] + 59 (opal_object.h:403) 4 mca_pls_xgrid.so 0x0019a120 orte_pls_xgrid_finalize + 48 (pls_xgrid_module.m:219) 5 libopen-rte.0.dylib 0x0007b093 orte_pls_base_close + 35 6 libopen-rte.0.dylib 0x0005cb5e orte_system_finalize + 142 7 libopen-rte.0.dylib 0x0005932f orte_finalize + 47 8 mpirun 0x2702 orterun + 2202 (orterun.c:496) 9 mpirun 0x1b06 main + 24 (main.c:14) 10 mpirun 0x1ac2 start + 54
Re: [OMPI users] Signal: Segmentation fault (11); Signal code: Address not mapped (1)
On Feb 25, 2009, at 12:25 PM, Ken Mighell wrote: We are trying to compile the code with Open MPI on a Mac Pro with 2 quad-core Xeons using gfortran. The code seem to be running ... for the most part. Unfortunately we keep getting a segfault which spits out a variant of the following message: [oblix:21522] *** Process received signal *** [oblix:21522] Signal: Segmentation fault (11) [oblix:21522] Signal code: Address not mapped (1) [oblix:21522] Failing at address: 0xc710 [oblix:21522] [ 0] 2 libSystem.B.dylib 0x92a892bb _sigtramp + 43 [oblix:21522] [ 1] 3 ???0x 0x0 + 4294967295 [oblix:21522] [ 2] 4 exe.out0x0001281b MAIN__ + 4875 [oblix:21522] [ 3] 5 exe.out0x00013c38 main + 40 [oblix:21522] [ 4] 6 exe.out0x1936 start + 54 [oblix:21522] *** End of error message *** After some researching of the error message, and digging around in the Open MPI user's mailing list, it appears that the bug may be in Open MPI. I'm not sure what you mean by this -- getting a stack trace out of Open MPI doesn't necessarily mean a bug in Open MPI. Can you get corefile and look and see what exactly failed? Or run under a debugger to see where/how exactly the process fails? From the stack trace above, it looks like the failure occurs in application code, not Open MPI...? -- Jeff Squyres Cisco Systems
Re: [OMPI users] openmpi 1.2.9 with Xgrid support more information
Ricardo - That's really interesting. THis is on a Leopard system, right? I'm the author/maintainer of the xgrid code. Unfortunately, I've been hiding trying to finish my dissertation the last couple of months. I can't offer much advice without digging into it in more detail than I have time to do in the near future. Brian On Wed, 25 Feb 2009, Ricardo Fernández-Perea wrote: HI I Have checked the crash log. the result is bellow. If I am reading it and following the mpirun code correctly the release of the last mca_pls_xgrid_component.client by orte_pls_xgrid_finalize causes a call to method dealloc for PlsXGridClient where a [connection finalize] is call that ends as a [NSObject finalize] I think is as intended, anyone knows if that is correct? but for some unknown reason is not liked for my configuration. The only thing that I can find is that the behaviour of the finalize method in NSObject depends of the status of garbage collection. I am using gcc-4.4 and Xcode 3.1.2. Ricardo Process: mpirun [854] Path: /opt/openmpi/bin/mpirun Identifier: mpirun Version: ??? (???) Code Type: X86 (Native) Parent Process: bash [829] Date/Time: 2009-02-25 17:09:53.411 +0100 OS Version: Mac OS X Server 10.5.6 (9G71) Report Version: 6 Exception Type: EXC_BREAKPOINT (SIGTRAP) Exception Codes: 0x0002, 0x Crashed Thread: 0 Application Specific Information: *** Terminating app due to uncaught exception 'NSInvalidArgumentException', reason: '*** -[NSKVONotifying_XGConnection<0x216910> finalize]: called when collecting not enabled' Thread 0 Crashed: 0 com.apple.CoreFoundation 0x917dffb4 ___TERMINATING_DUE_TO_UNCAUGHT_EXCEPTION___ + 4 1 libobjc.A.dylib 0x91255e3b objc_exception_throw + 40 2 com.apple.CoreFoundation 0x917e701d -[NSObject finalize] + 157 3 mca_pls_xgrid.so 0x0019bf8b -[PlsXGridClient dealloc] + 59 (opal_object.h:403) 4 mca_pls_xgrid.so 0x0019a120 orte_pls_xgrid_finalize + 48 (pls_xgrid_module.m:219) 5 libopen-rte.0.dylib 0x0007b093 orte_pls_base_close + 35 6 libopen-rte.0.dylib 0x0005cb5e orte_system_finalize + 142 7 libopen-rte.0.dylib 0x0005932f orte_finalize + 47 8 mpirun 0x2702 orterun + 2202 (orterun.c:496) 9 mpirun 0x1b06 main + 24 (main.c:14) 10 mpirun 0x1ac2 start + 54
Re: [OMPI users] 3.5 seconds before application launches
Vittorio wrote: Hi! I'm using OpenMPI 1.3 on two nodes connected with Infiniband; i'm using Gentoo Linux x86_64. I've noticed that before any application starts there is a variable amount of time (around 3.5 seconds) in which the terminal just hangs with no output and then the application starts and works well. I imagined that there might have been some initialization routine somewhere in the Infiniband layer or in the software stack, but as i continued my tests i observed that this "latency" time is not present in other MPI implementations (like mvapich2) where my application starts immediately (but performs worse). Is my MPI configuration/installation broken or is this expected behaviour? Hi, I'm not really qualified to answer this question, but I know that in contrast to other MPI implementations (MPICH) the modular structure of Open MPI is based on shared libs that are dlopened at the startup. As symbol relocation can be costly this might be a reason why the startup time is higher. Have you checked wether this is an mpiexec start issue or the MPI_Init call? Regards, Dorian thanks a lot! Vittorio ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Signal: Segmentation fault (11); Signal code: Address not mapped (1)
Based on this info from the error report it appears that the segfault is generated directly in you application main function. Somehow, you call a function at address 0x, which doesn't make much sense. george. On Feb 25, 2009, at 12:25 , Ken Mighell wrote: [oblix:21522] [ 0] 2 libSystem.B.dylib 0x92a892bb _sigtramp + 43 [oblix:21522] [ 1] 3 ???0x 0x0 + 4294967295 [oblix:21522] [ 2] 4 exe.out0x0001281b MAIN__ + 4875
Re: [OMPI users] Signal: Segmentation fault (11); Signal code: Address not mapped (1)
Dear Jeff and George, The problem was in our code. Thanks for your help interpreting the error message. Best regards, -Ken Mighell
Re: [OMPI users] 3.5 seconds before application launches
Dorian raises a good point. You might want to try some simple tests of launching non-MPI codes (e.g., hostname, uptime, etc.) and see how they fare. Those will more accurately depict OMPI's launching speeds. Getting through MPI_INIT is another matter (although on 2 nodes, the startup should be pretty darn fast). Two other things that *may* impact you: 1. Is your ssh speed between the machines slow? OMPI uses ssh by default, but will fall back to rsh (or you can force rsh if you want). MVAPICH may use rsh by default...? (I don't actually know) 2. OMPI may be spending time creating shared memory files. You can disable OMPI's use of shared memory by running with: mpirun --mca btl ^sm ... Meaning "use anything except the 'sm' (shared memory) transport for MPI messages". On Feb 25, 2009, at 4:01 PM, doriankrause wrote: Vittorio wrote: Hi! I'm using OpenMPI 1.3 on two nodes connected with Infiniband; i'm using Gentoo Linux x86_64. I've noticed that before any application starts there is a variable amount of time (around 3.5 seconds) in which the terminal just hangs with no output and then the application starts and works well. I imagined that there might have been some initialization routine somewhere in the Infiniband layer or in the software stack, but as i continued my tests i observed that this "latency" time is not present in other MPI implementations (like mvapich2) where my application starts immediately (but performs worse). Is my MPI configuration/installation broken or is this expected behaviour? Hi, I'm not really qualified to answer this question, but I know that in contrast to other MPI implementations (MPICH) the modular structure of Open MPI is based on shared libs that are dlopened at the startup. As symbol relocation can be costly this might be a reason why the startup time is higher. Have you checked wether this is an mpiexec start issue or the MPI_Init call? Regards, Dorian thanks a lot! Vittorio ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems