[OMPI users] Low performance of Open MPI-1.3 over Gigabit

2009-02-25 Thread Sangamesh B
Dear All,

A fortran application is installed with Open MPI-1.3 + Intel
compilers on a Rocks-4.3 cluster with Intel Xeon Dual socket Quad core
processor @ 3GHz (8cores/node).

The time consumed for different tests over a Gigabit connected
nodes are as follows: (Each node has 8 GB memory).

No of Nodes used:6  No of cores used/node:4 total mpi processes:24
   CPU TIME :1 HOURS 19 MINUTES 14.39 SECONDS
   ELAPSED TIME :2 HOURS 41 MINUTES  8.55 SECONDS

No of Nodes used:6  No of cores used/node:8 total mpi processes:48
   CPU TIME :4 HOURS 19 MINUTES 19.29 SECONDS
   ELAPSED TIME :9 HOURS 15 MINUTES 46.39 SECONDS

No of Nodes used:3  No of cores used/node:8 total mpi processes:24
   CPU TIME :2 HOURS 41 MINUTES 27.98 SECONDS
   ELAPSED TIME :4 HOURS 21 MINUTES  0.24 SECONDS

But the same application performs well on another Linux cluster with
LAM-MPI-7.1.3

No of Nodes used:6  No of cores used/node:4 total mpi processes:24
CPU TIME :1hours:30min:37.25s
ELAPSED TIME  1hours:51min:10.00S

No of Nodes used:12  No of cores used/node:4 total mpi processes:48
CPU TIME :0hours:46min:13.98s
ELAPSED TIME  1hours:02min:26.11s

No of Nodes used:6  No of cores used/node:8 total mpi processes:48
CPU TIME : 1hours:13min:09.17s
ELAPSED TIME  1hours:47min:14.04s

So there is a huge difference between CPU TIME & ELAPSED TIME for Open MPI jobs.

Note: On the same cluster Open MPI gives better performance for
inifiniband nodes.

What could be the problem for Open MPI over Gigabit?
Any flags need to be used?
Or is it not that good to use Open MPI on Gigabit?

Thanks,
Sangamesh


Re: [OMPI users] Problems in 1.3 loading shared libs whenusingVampirServer

2009-02-25 Thread michael.meinel

Thanks for the hints.

> You have some possible workarounds:
> 
> - We recommended to the PyMPI author a while ago that he add his own  
> dlopen() of libmpi before calling MPI_INIT, but specifically using  
> RTLD_GLOBAL, so that the library is opened in the global process space

> (not a private space in the process).  Then libmpi's (and friends)  
> symbols will be available to its plugins.  If you're unhappy with the

> non-portability of dlopen, try lt_dlopen_advise() -- it's a portable  
> version that is linked inside Open MPI.

This is the solution we go with our "modified python" approach. We do
not exactly dlopen libmpi but simply link the Python binary against it,
which has the same effect.

> - Another option is to configure/compile Open MPI with "--disable- 
> dlopen" or "--enable-static --disable-shared" configure options.   
> Either of these options will cause Open MPI to slurp all of its  
> plugins up into libmpi (etc) and not dynamically open them at run- 
> time, thereby avoiding the problem of Python opening libmpi in a  
> private scope.

This sounds good, I gotta try this.

> - Get Python to give you the possibility of opening dependent  
> libraries in the global scope.  This may be somewhat controversial;  
> there are good reasons to open plugins in private scopes.  But I have

> to imagine that OMPI is not the only python extension out there that  
> wants to open plugins of its own; other such projects should be  
> running into similar issues.

That would involve patching Python in some nifty places which would
probably lead to less Platform independence, so no option yet.

---
Michael Meinel
German Aerospace Center
Center for Computer Applications in Aerospace Science and Engineering



Re: [OMPI users] Problems in 1.3 loading shared libs when usingVampirServer

2009-02-25 Thread Nysal Jan
On Tue, 2009-02-24 at 13:30 -0500, Jeff Squyres wrote:
> - Get Python to give you the possibility of opening dependent  
> libraries in the global scope.  This may be somewhat controversial;  
> there are good reasons to open plugins in private scopes.  But I have  
> to imagine that OMPI is not the only python extension out there that  
> wants to open plugins of its own; other such projects should be  
> running into similar issues.
> 
Can you check if the following works:
import dl
import sys
flags = sys.getdlopenflags()
sys.setdlopenflags(flags | dl.RTLD_GLOBAL)
import minimpi


--Nysal



Re: [OMPI users] Problems in 1.3 loading shared libs whenusingVampirServer

2009-02-25 Thread Jeff Squyres

On Feb 25, 2009, at 4:02 AM,  wrote:


- Get Python to give you the possibility of opening dependent
libraries in the global scope.  This may be somewhat controversial;
there are good reasons to open plugins in private scopes.  But I have
to imagine that OMPI is not the only python extension out there that
wants to open plugins of its own; other such projects should be
running into similar issues.


That would involve patching Python in some nifty places which would
probably lead to less Platform independence, so no option yet.



I should have been more clear: what I meant was to engage the Python  
community to get such a feature to be implemented upstream in Python  
itself.  Since I would find it easy to believe that other Python  
Extension projects may run into similar issues, it may be worth  
raising this issue to the Python community and opening the debate there.


That being said, Nysal also posted an interesting approach.  :-)

--
Jeff Squyres
Cisco Systems



Re: [OMPI users] Problems in 1.3 loading shared libswhenusingVampirServer

2009-02-25 Thread michael.meinel
>> That would involve patching Python in some nifty places which would
>> probably lead to less Platform independence, so no option yet.

> I should have been more clear: what I meant was to engage the Python  
> community to get such a feature to be implemented upstream in Python
> itself.  Since I would find it easy to believe that other Python  
> Extension projects may run into similar issues, it may be worth  
> raising this issue to the Python community and opening the debate 
> there.
>
> That being said, Nysal also posted an interesting approach.  :-)

I think what Nysal wrote is exactly the chance for every Python user to
configure the dynamic loading to ones needs. I'll try that. But I think
that obsoletes the need to open a discussion in the Pyhton community.

--
Michael Meinel
German Aerospace Center
Center for Computer Applications in Aerospace Science and Enginnering



Re: [OMPI users] Problems in 1.3 loading shared libs when using VampirServer

2009-02-25 Thread Gerry Creager
If you simply want to call is "Problems in 1.3" I might have some things 
to add, though!


gerry

Jeff Squyres wrote:

On Feb 23, 2009, at 8:59 PM, Jeff Squyres wrote:

Err... I'm a little confused.  We've been emailing about this exact 
issue for a week or two (off list); you just re-started the 
conversation from the beginning, moved it to the user's list, and 
dropped all the CC's (which include several people who are not on this 
list).  Why did you do that?



GAAH!!  Mea maxima culpa.  :-(

My stupid mail program did something strange (exact details unimportant) 
that made me think you re-sent your message to the users list yesterday 
-- thereby re-starting the whole conversation, etc.  Upon double 
checking, I see that this is *not* what you did at all -- my mail 
program was showing me your original post from Feb 4 and making it look 
like you re-sent it yesterday.  I just wasn't careful in my reading.  
Sorry about that; the fault and confusion was entirely mine.  :-(


(we're continuing the conversation off-list just because it's gnarly and 
full of details about Vampir that most people probably don't care about; 
they're working on a small example to send to me that replicates the 
problem -- will post back here when we have some kind of solution...)


We now return you to your regularly scheduled programming...



--
Gerry Creager -- gerry.crea...@tamu.edu
Texas Mesonet -- AATLT, Texas A&M University
Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983
Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843


Re: [OMPI users] Problems in 1.3 loading shared libs when using VampirServer

2009-02-25 Thread Jeff Squyres

On Feb 25, 2009, at 8:43 AM, Gerry Creager wrote:

If you simply want to call is "Problems in 1.3" I might have some  
things to add, though!


I'm not quite sure how to parse this sentence -- are you saying that  
you have found some problems with Open MPI v1.3?  If so, yes, we'd  
like to know what they are (so that we can fix them!).


--
Jeff Squyres
Cisco Systems



[OMPI users] Signal: Segmentation fault (11); Signal code: Address not mapped (1)

2009-02-25 Thread Ken Mighell

Dear Open MPI gurus,

We have F90 code which compiles with MPICH on a dual-core PC laptop  
using the Intel compiler.


We are trying to compile the code with Open MPI on a Mac Pro with 2  
quad-core Xeons using gfortran.


The code seem to be running ... for the most part.  Unfortunately we  
keep getting a segfault

which spits out a variant of the following message:

[oblix:21522] *** Process received signal ***
[oblix:21522] Signal: Segmentation fault (11)
[oblix:21522] Signal code: Address not mapped (1)
[oblix:21522] Failing at address: 0xc710
[oblix:21522] [ 0] 2   libSystem.B.dylib  0x92a892bb  
_sigtramp + 43
[oblix:21522] [ 1] 3   ???0x  
0x0 + 4294967295
[oblix:21522] [ 2] 4   exe.out0x0001281b  
MAIN__ + 4875
[oblix:21522] [ 3] 5   exe.out0x00013c38  
main + 40
[oblix:21522] [ 4] 6   exe.out0x1936  
start + 54

[oblix:21522] *** End of error message ***

After some researching of the error message, and digging around in   
the Open MPI user's mailing list,

it appears that the bug may be in Open MPI.

Someone has offered what might be a  workaround:

recompile with the following flag:

--with-memory-manager=none

Sadly, that segfaults still occur.

What should we try next?

Best regards,

-Ken Mighell

P.S. mpi_info result follows:



ompi_info.result
Description: Binary data





[OMPI users] openmpi 1.2.9 with Xgrid support more information

2009-02-25 Thread Ricardo Fernández-Perea
HI
I Have checked the crash log.

the result is bellow.

If I am reading it and following the mpirun code correctly the release of
the last
mca_pls_xgrid_component.client
 by orte_pls_xgrid_finalize
causes a call to method dealloc for PlsXGridClient

where a

[connection finalize]

is call that ends as a  [NSObject finalize]
I think is as intended,  anyone knows if that is correct?
but for some unknown reason is not liked for my configuration.
The only thing that I can find is that the behaviour of the finalize method
in NSObject  depends of the status of garbage collection.


I am using gcc-4.4 and Xcode 3.1.2.

Ricardo

Process: mpirun [854]
Path:/opt/openmpi/bin/mpirun
Identifier:  mpirun
Version: ??? (???)
Code Type:   X86 (Native)
Parent Process:  bash [829]

Date/Time:   2009-02-25 17:09:53.411 +0100
OS Version:  Mac OS X Server 10.5.6 (9G71)
Report Version:  6

Exception Type:  EXC_BREAKPOINT (SIGTRAP)
Exception Codes: 0x0002, 0x
Crashed Thread:  0

Application Specific Information:
*** Terminating app due to uncaught exception 'NSInvalidArgumentException',
reason: '*** -[NSKVONotifying_XGConnection<0x216910> finalize]: called when
collecting not enabled'

Thread 0 Crashed:
0   com.apple.CoreFoundation   0x917dffb4
___TERMINATING_DUE_TO_UNCAUGHT_EXCEPTION___ + 4
1   libobjc.A.dylib   0x91255e3b objc_exception_throw + 40
2   com.apple.CoreFoundation   0x917e701d -[NSObject finalize] + 157
3   mca_pls_xgrid.so   0x0019bf8b -[PlsXGridClient dealloc] + 59
(opal_object.h:403)
4   mca_pls_xgrid.so   0x0019a120 orte_pls_xgrid_finalize + 48
(pls_xgrid_module.m:219)
5   libopen-rte.0.dylib   0x0007b093 orte_pls_base_close + 35
6   libopen-rte.0.dylib   0x0005cb5e orte_system_finalize + 142
7   libopen-rte.0.dylib   0x0005932f orte_finalize + 47
8   mpirun 0x2702 orterun + 2202 (orterun.c:496)
9   mpirun 0x1b06 main + 24 (main.c:14)
10  mpirun 0x1ac2 start + 54


Re: [OMPI users] Signal: Segmentation fault (11); Signal code: Address not mapped (1)

2009-02-25 Thread Jeff Squyres

On Feb 25, 2009, at 12:25 PM, Ken Mighell wrote:

We are trying to compile the code with Open MPI on a Mac Pro with 2  
quad-core Xeons using gfortran.


The code seem to be running ... for the most part.  Unfortunately we  
keep getting a segfault

which spits out a variant of the following message:

[oblix:21522] *** Process received signal ***
[oblix:21522] Signal: Segmentation fault (11)
[oblix:21522] Signal code: Address not mapped (1)
[oblix:21522] Failing at address: 0xc710
[oblix:21522] [ 0] 2   libSystem.B.dylib  0x92a892bb  
_sigtramp + 43
[oblix:21522] [ 1] 3   ???0x  
0x0 + 4294967295
[oblix:21522] [ 2] 4   exe.out0x0001281b  
MAIN__ + 4875
[oblix:21522] [ 3] 5   exe.out0x00013c38  
main + 40
[oblix:21522] [ 4] 6   exe.out0x1936  
start + 54

[oblix:21522] *** End of error message ***

After some researching of the error message, and digging around in   
the Open MPI user's mailing list,

it appears that the bug may be in Open MPI.


I'm not sure what you mean by this -- getting a stack trace out of  
Open MPI doesn't necessarily mean a bug in Open MPI.


Can you get corefile and look and see what exactly failed?  Or run  
under a debugger to see where/how exactly the process fails?  From the  
stack trace above, it looks like the failure occurs in application  
code, not Open MPI...?


--
Jeff Squyres
Cisco Systems



Re: [OMPI users] openmpi 1.2.9 with Xgrid support more information

2009-02-25 Thread Brian W. Barrett

Ricardo -

That's really interesting.  THis is on a Leopard system, right?  I'm the 
author/maintainer of the xgrid code.  Unfortunately, I've been hiding 
trying to finish my dissertation the last couple of months.  I can't offer 
much advice without digging into it in more detail than I have time to do 
in the near future.


Brian

On Wed, 25 Feb 2009, Ricardo Fernández-Perea wrote:


HI
I Have checked the crash log.

the result is bellow.

If I am reading it and following the mpirun code correctly the release of
the last
mca_pls_xgrid_component.client
 by orte_pls_xgrid_finalize
causes a call to method dealloc for PlsXGridClient

where a 

[connection finalize]  

is call that ends as a  [NSObject finalize]
I think is as intended,  anyone knows if that is correct?
but for some unknown reason is not liked for my configuration.
The only thing that I can find is that the behaviour of the finalize
method in NSObject  depends of the status of garbage collection.


I am using gcc-4.4 and Xcode 3.1.2.

Ricardo

Process:         mpirun [854]
Path:            /opt/openmpi/bin/mpirun
Identifier:      mpirun
Version:         ??? (???)
Code Type:       X86 (Native)
Parent Process:  bash [829]

Date/Time:       2009-02-25 17:09:53.411 +0100
OS Version:      Mac OS X Server 10.5.6 (9G71)
Report Version:  6

Exception Type:  EXC_BREAKPOINT (SIGTRAP)
Exception Codes: 0x0002, 0x
Crashed Thread:  0

Application Specific Information:
*** Terminating app due to uncaught exception
'NSInvalidArgumentException', reason: '***
-[NSKVONotifying_XGConnection<0x216910> finalize]: called when collecting
not enabled'

Thread 0 Crashed:
0   com.apple.CoreFoundation       0x917dffb4
___TERMINATING_DUE_TO_UNCAUGHT_EXCEPTION___ + 4
1   libobjc.A.dylib               0x91255e3b objc_exception_throw + 40
2   com.apple.CoreFoundation       0x917e701d -[NSObject finalize] + 157
3   mca_pls_xgrid.so               0x0019bf8b -[PlsXGridClient dealloc] +
59 (opal_object.h:403)
4   mca_pls_xgrid.so               0x0019a120 orte_pls_xgrid_finalize +
48 (pls_xgrid_module.m:219)
5   libopen-rte.0.dylib           0x0007b093 orte_pls_base_close + 35
6   libopen-rte.0.dylib           0x0005cb5e orte_system_finalize + 142
7   libopen-rte.0.dylib           0x0005932f orte_finalize + 47
8   mpirun                         0x2702 orterun + 2202
(orterun.c:496)
9   mpirun                         0x1b06 main + 24 (main.c:14)
10  mpirun                         0x1ac2 start + 54




Re: [OMPI users] 3.5 seconds before application launches

2009-02-25 Thread doriankrause

Vittorio wrote:

Hi!
I'm using OpenMPI 1.3 on two nodes connected with Infiniband; i'm using
Gentoo Linux x86_64.

I've noticed that before any application starts there is a variable amount
of time (around 3.5 seconds) in which the terminal just hangs with no output
and then the application starts and works well.

I imagined that there might have been some initialization routine somewhere
in the Infiniband layer or in the software stack, but as i continued my
tests i observed that this "latency" time is not present in other MPI
implementations (like mvapich2) where my application starts immediately (but
performs worse).

Is my MPI configuration/installation broken or is this expected behaviour?
  


Hi,

I'm not really qualified to answer this question, but I know that in 
contrast
to other MPI implementations (MPICH) the modular structure of Open MPI 
is based

on shared libs that are dlopened at the startup. As symbol relocation can be
costly this might be a reason why the startup time is higher.

Have you checked wether this is an mpiexec start issue or the MPI_Init call?

Regards,
Dorian


thanks a lot!
Vittorio

  



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Signal: Segmentation fault (11); Signal code: Address not mapped (1)

2009-02-25 Thread George Bosilca
Based on this info from the error report it appears that the segfault  
is generated directly in you application main function. Somehow, you  
call a function at address 0x, which doesn't make much sense.


  george.

On Feb 25, 2009, at 12:25 , Ken Mighell wrote:

[oblix:21522] [ 0] 2   libSystem.B.dylib  0x92a892bb  
_sigtramp + 43
[oblix:21522] [ 1] 3   ???0x  
0x0 + 4294967295
[oblix:21522] [ 2] 4   exe.out0x0001281b  
MAIN__ + 4875




Re: [OMPI users] Signal: Segmentation fault (11); Signal code: Address not mapped (1)

2009-02-25 Thread Ken Mighell

Dear Jeff and George,

The problem was in our code.

Thanks for your help interpreting the error message.

Best regards,

-Ken Mighell



Re: [OMPI users] 3.5 seconds before application launches

2009-02-25 Thread Jeff Squyres

Dorian raises a good point.

You might want to try some simple tests of launching non-MPI codes  
(e.g., hostname, uptime, etc.) and see how they fare.  Those will more  
accurately depict OMPI's launching speeds.  Getting through MPI_INIT  
is another matter (although on 2 nodes, the startup should be pretty  
darn fast).


Two other things that *may* impact you:

1. Is your ssh speed between the machines slow?  OMPI uses ssh by  
default, but will fall back to rsh (or you can force rsh if you  
want).  MVAPICH may use rsh by default...?  (I don't actually know)


2. OMPI may be spending time creating shared memory files.  You can  
disable OMPI's use of shared memory by running with:


mpirun --mca btl ^sm ...

Meaning "use anything except the 'sm' (shared memory) transport for  
MPI messages".



On Feb 25, 2009, at 4:01 PM, doriankrause wrote:


Vittorio wrote:

Hi!
I'm using OpenMPI 1.3 on two nodes connected with Infiniband; i'm  
using

Gentoo Linux x86_64.

I've noticed that before any application starts there is a variable  
amount
of time (around 3.5 seconds) in which the terminal just hangs with  
no output

and then the application starts and works well.

I imagined that there might have been some initialization routine  
somewhere
in the Infiniband layer or in the software stack, but as i  
continued my

tests i observed that this "latency" time is not present in other MPI
implementations (like mvapich2) where my application starts  
immediately (but

performs worse).

Is my MPI configuration/installation broken or is this expected  
behaviour?




Hi,

I'm not really qualified to answer this question, but I know that in  
contrast
to other MPI implementations (MPICH) the modular structure of Open  
MPI is based
on shared libs that are dlopened at the startup. As symbol  
relocation can be

costly this might be a reason why the startup time is higher.

Have you checked wether this is an mpiexec start issue or the  
MPI_Init call?


Regards,
Dorian


thanks a lot!
Vittorio

  



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Cisco Systems