[OMPI users] 1.2b1 make failed on Mac 10.4

2006-11-22 Thread Iannetti, Anthony C. (GRC-RTB0)
Dear OpenMPI List:

 

My attempt at compiling the prerelease of OpenMPI 1.2
failed.  Attached are the logs of the configure and make process.

 

I am running..

Darwin Cortland 8.8.1 Darwin Kernel Cersion 8.8.1: Mon Sep 25 19:45:30
PDT 2006; root:xnu-792.13.8.obj~1/RELEASE_PPC Power Macintosh powerpc

 

Thanks,

Tony

 

Anthony C. Iannetti, P.E.

NASA Glenn Research Center

Propulsion Systems Division, Combustion Branch

21000 Brookpark Road, MS 5-10

Cleveland, OH 44135

phone: (216)433-5586

email: anthony.c.ianne...@nasa.gov

 

Please note:  All opinions expressed in this message are my own and NOT
of NASA.  Only the NASA Administrator can speak on behalf of NASA.

 



ompi-output.tar
Description: ompi-output.tar


Re: [OMPI users] MX performance problem on two processor nodes

2006-11-22 Thread Brock Palen

Feel free to correct me if im wrong.

OMPI assumes you have a fast network and checks for them.  If they  
are not found it falls back to tcp.


So if you leave out the --mca etc etc   It should use the mx if  
its available.  Im not sure how MX responds  if one of the hosts does  
not have a working card (not activated)  because the mpi job will  
still run.  Just not using MX to that host.  All other hosts will us MX.


If openmpi sees that a node has more than one cpu (SMP)  It will use  
the sm (shared mem)  method to communicate over the mx.  and if a  
proc sends to its self,  the self method is used.  So its like a  
priority order.


I know there is a way (its in the archives)  to see the priority on  
how OMPI choses what method to use.  and uses the highest priority  
that will allow the communication to complete.


I know there is also some magic being working on/implemented.  That  
will stripe over multiple networks for large messages when more  
bandwidth is needed.  I dont know if OMPI will have this ability or  
not.  Someone else can chime in on that.


Brock Palen
Center for Advanced Computing
bro...@umich.edu
(734)936-1985


On Nov 21, 2006, at 11:28 PM, Iannetti, Anthony C. ((GRC-RTB0)) wrote:


Dear OpenMPI List:



From looking at a recent thread, I see an mpirun command with  
shared memory and mx:




mpirun –mca btl mx,sm,self –np 2 pi3f90.x



This works.  I may have forgot to mention it, but I am using  
1.1.2.  I see there is an –mca mtl in version 1.2b1 .  I do not  
think this exists in 1.1.2.


Still, I would like to know what –mca is given automatically.



Thanks,

Tony







Anthony C. Iannetti, P.E.

NASA Glenn Research Center

Propulsion Systems Division, Combustion Branch

21000 Brookpark Road, MS 5-10

Cleveland, OH 44135

phone: (216)433-5586

email: anthony.c.ianne...@nasa.gov



Please note:  All opinions expressed in this message are my own and  
NOT of NASA.  Only the NASA Administrator can speak on behalf of NASA.




From: Iannetti, Anthony C. (GRC-RTB0)
Sent: Tuesday, November 21, 2006 8:39 PM
To: 'us...@open-mpi.org'
Subject: MX performance problem on two processor nodes



Dear OpenMPI List:



I am running the Myrinet MX btl with OpenMPI on MacOSX  
10.4.  I am running into a problem.  When I run on one processor  
per node, OpenMPI runs just fine.   When I run on two processors  
per node (slots=2), it seems to take forever (something is hanging).




Here is the command:

mpirun –mca btl mx,self –np 2 pi3f90.x



However, if I give the command:

mpirun –np 2 pi3f90.x



The process runs normally. But, I do not know if it is using the  
Myrinet network.  Is there a way to diagnose this problem.  mpirun – 
v and –d do not seem to indicate which mca is actually being used.




Thanks,

Tony



Anthony C. Iannetti, P.E.

NASA Glenn Research Center

Propulsion Systems Division, Combustion Branch

21000 Brookpark Road, MS 5-10

Cleveland, OH 44135

phone: (216)433-5586

email: anthony.c.ianne...@nasa.gov



Please note:  All opinions expressed in this message are my own and  
NOT of NASA.  Only the NASA Administrator can speak on behalf of NASA.




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Advice for a cluster software

2006-11-22 Thread Epitropakis Mixalis 00064

On Mon, 20 Nov 2006, Reuti wrote:


Hi,

Am 20.11.2006 um 13:12 schrieb Epitropakis Mixalis 00064:


Hello everyone!




Hello,


I think this question is of broader audience on the beowulf.org
mailing list, but anyway: what are you using in the cluster besides
Yes, I think you are right, but we are going to use OpenMPI and we wanted 
to hear and your opinion (OpenMPI experts opinion) :)



OpenMPI? Although I'm biased, I would suggest SGE GridEngine, as it
supports more parallel libs than Torque by its qrsh replacement; e.g.
Linda or PVM. Also the integration between the qmaster and scheduler


At this moment we use MPI and PVM but we would like to test and use other 
technologies, projects, ideas as well. I think that SGE GridEngine is a 
very good project and maybe that is our final choise!



is tighter. In Torque you have two commands: "qstat" and "showq". The
former is the view of the cluster by Torque, the latter the one of
the Maui scheduler - and sometimes I observe that they disagree about
what's running in the cluster and what not (we use SGE, but we have
access to some clusters in other locations which prefer Torque).

The support for SGE will be in OpenMPI in1.2 AFAIK.

Question: you have a central filer server in the cluster, to serve
the home directory to the nodes and which could also act as a NIS,
NTP and SGE qmaster server? You mentioned only the nodes.


Yes, at this time we think to use an additional node as the master node 
with a better HDD for these jobs




-- Reuti


Thank you for your help and your time :)

Michael



Thanks very much for your time and I am sure that your opinion will be
of of great help to us!

Michael
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] Build OpenMPI for SHM only

2006-11-22 Thread Adam Moody
Tim, yes, your suggestion makes sense.  I didn't realize that would be a 
safe thing to do.
Brian, I've verified that configuring with 
"--enable-mca-no-build=btl-tcp" prevents the tcp btl component from 
being built in the first place.

Thanks for the help,
-Adam


Tim Prins wrote:


Hi,

I don't know if there is a way to do it in configure, but after installing you 
can go into the $prefix/lib/openmpi directory and delete mca_btl_tcp.*


This will remove the tcp component and thus users will not be able to use it. 
Note that you must NOT delete the mca_oob_tcp.* files, as these are used for 
our internal administrative messaging and we currently require it to be 
there.


Thanks,

Tim Prins


On Tuesday 21 November 2006 07:49 pm, Adam Moody wrote:
 


Hello,
We have some clusters which consist of a large pool of 8-way nodes
connected via ethernet.  On these particular machines, we'd like our
users to be able to run 8-way MPI jobs on node, but we *don't* want them
to run MPI jobs across nodes via the ethernet.  Thus, I'd like to
configure and build OpenMPI to provide shared memory support (or TCP
loopback) but disable general TCP support.

I realize that you can run without tcp via something like "mpirun --mca
btl ^tcp", but this is up to the user's discretion.  I need a way to
disable it systematically.  Is there a way to configure it out at build
time or is there some runtime configuration file I can modify to turn it
off?  Also, when we configure "--without-tcp", the configure script
doesn't complain, but TCP support is added anyway.

Thanks,
-Adam Moody
MPI Support @ LLNL
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
   


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

 



[OMPI users] openmpi, mx

2006-11-22 Thread Lydia Heck

I have - again - successfully built and installed
mx and openmpi and I can run 64 and 128 cpus jobs on a 256 CPU cluster
version of openmpi is 1.2b1

compiler used: studio11

The code is a benchmark b_eff which runs usually fine - I have used extensively
it for benchmarking

When I try 192 CPUs I get
m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
 ...
..
..

The myrinet ports have been opened and the job is running
as one of the nodes shows 

 ps -eaf | grep dph0elh
 dph0elh  1068 1   0 20:40:00 ??  0:00 /opt/ompi/bin/orted
--bootproxy 1 --name 0.0.64 --num_procs 65 --vpid_start 0 -
root  1110  1106   0 20:43:46 pts/4   0:00 grep dph0elh
 dph0elh  1070  1068   0 20:40:02 ??  0:00 ../b_eff
 dph0elh  1074  1068   0 20:40:02 ??  0:00 ../b_eff
 dph0elh  1072  1068   0 20:40:02 ??  0:00 ../b_eff

any idea ?

Lydia


--
Dr E L  Heck

University of Durham
Institute for Computational Cosmology
Ogden Centre
Department of Physics
South Road

DURHAM, DH1 3LE
United Kingdom

e-mail: lydia.h...@durham.ac.uk

Tel.: + 44 191 - 334 3628
Fax.: + 44 191 - 334 3645
___


Re: [OMPI users] openmpi, mx

2006-11-22 Thread Rolf Vandevaart


Hi Lydia:

errno 24 means "Too many open files".  When we have seen this, I believe
we increased the number of file descriptors available to the mpirun process
to get past this.

In my case, my shell (tcsh) defaults to 256.  I increase it with a call 
to "limit descriptors"

as shown below.  I think other shells may have other commands.

burl-ct-v40z-0 41 =>limit
cputime unlimited
filesizeunlimited
datasizeunlimited
stacksize   10240 kbytes
coredumpsize0 kbytes
vmemoryuse  unlimited
descriptors 256
burl-ct-v40z-0 42 =>limit descriptors 64000
burl-ct-v40z-0 43 =>limit
cputime unlimited
filesizeunlimited
datasizeunlimited
stacksize   10240 kbytes
coredumpsize0 kbytes
vmemoryuse  unlimited
descriptors 64000
burl-ct-v40z-0 44 =>


Lydia Heck wrote On 11/22/06 15:45,:


I have - again - successfully built and installed
mx and openmpi and I can run 64 and 128 cpus jobs on a 256 CPU cluster
version of openmpi is 1.2b1

compiler used: studio11

The code is a benchmark b_eff which runs usually fine - I have used extensively
it for benchmarking

When I try 192 CPUs I get
m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
[m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
...
..
..

The myrinet ports have been opened and the job is running
as one of the nodes shows 

ps -eaf | grep dph0elh
dph0elh  1068 1   0 20:40:00 ??  0:00 /opt/ompi/bin/orted
--bootproxy 1 --name 0.0.64 --num_procs 65 --vpid_start 0 -
   root  1110  1106   0 20:43:46 pts/4   0:00 grep dph0elh
dph0elh  1070  1068   0 20:40:02 ??  0:00 ../b_eff
dph0elh  1074  1068   0 20:40:02 ??  0:00 ../b_eff
dph0elh  1072  1068   0 20:40:02 ??  0:00 ../b_eff

any idea ?

Lydia


--
Dr E L  Heck

University of Durham
Institute for Computational Cosmology
Ogden Centre
Department of Physics
South Road

DURHAM, DH1 3LE
United Kingdom

e-mail: lydia.h...@durham.ac.uk

Tel.: + 44 191 - 334 3628
Fax.: + 44 191 - 334 3645
___
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
 



--

=
rolf.vandeva...@sun.com
781-442-3043
=



Re: [OMPI users] openmpi, mx

2006-11-22 Thread Ralph Castain
One of our users/friends has also sent us some example code to do this
internally - I hope to find the time to include that capability in the code
base shortly. I'll advise when we do.


On 11/22/06 2:16 PM, "Rolf Vandevaart"  wrote:

> 
> Hi Lydia:
> 
> errno 24 means "Too many open files".  When we have seen this, I believe
> we increased the number of file descriptors available to the mpirun process
> to get past this.
> 
> In my case, my shell (tcsh) defaults to 256.  I increase it with a call
> to "limit descriptors"
> as shown below.  I think other shells may have other commands.
> 
>  burl-ct-v40z-0 41 =>limit
> cputime unlimited
> filesizeunlimited
> datasizeunlimited
> stacksize   10240 kbytes
> coredumpsize0 kbytes
> vmemoryuse  unlimited
> descriptors 256
>  burl-ct-v40z-0 42 =>limit descriptors 64000
>  burl-ct-v40z-0 43 =>limit
> cputime unlimited
> filesizeunlimited
> datasizeunlimited
> stacksize   10240 kbytes
> coredumpsize0 kbytes
> vmemoryuse  unlimited
> descriptors 64000
>  burl-ct-v40z-0 44 =>
> 
> 
> Lydia Heck wrote On 11/22/06 15:45,:
> 
>> I have - again - successfully built and installed
>> mx and openmpi and I can run 64 and 128 cpus jobs on a 256 CPU cluster
>> version of openmpi is 1.2b1
>> 
>> compiler used: studio11
>> 
>> The code is a benchmark b_eff which runs usually fine - I have used
>> extensively
>> it for benchmarking
>> 
>> When I try 192 CPUs I get
>> m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24.
>> ...
>> ..
>> ..
>> 
>> The myrinet ports have been opened and the job is running
>> as one of the nodes shows 
>> 
>> ps -eaf | grep dph0elh
>> dph0elh  1068 1   0 20:40:00 ??  0:00 /opt/ompi/bin/orted
>> --bootproxy 1 --name 0.0.64 --num_procs 65 --vpid_start 0 -
>>root  1110  1106   0 20:43:46 pts/4   0:00 grep dph0elh
>> dph0elh  1070  1068   0 20:40:02 ??  0:00 ../b_eff
>> dph0elh  1074  1068   0 20:40:02 ??  0:00 ../b_eff
>> dph0elh  1072  1068   0 20:40:02 ??  0:00 ../b_eff
>> 
>> any idea ?
>> 
>> Lydia
>> 
>> 
>> --
>> Dr E L  Heck
>> 
>> University of Durham
>> Institute for Computational Cosmology
>> Ogden Centre
>> Department of Physics
>> South Road
>> 
>> DURHAM, DH1 3LE
>> United Kingdom
>> 
>> e-mail: lydia.h...@durham.ac.uk
>> 
>> Tel.: + 44 191 - 334 3628
>> Fax.: + 44 191 - 334 3645
>> ___
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>  
>> 




Re: [OMPI users] 1.2b1 make failed on Mac 10.4

2006-11-22 Thread Iannetti, Anthony C. (GRC-RTB0)
Dear OpenMPI List:

 

OpenMPI 1.2b1 will compile in 32 bit (-arch ppc), but it
will not compile in 64 bit (-arch ppc64).  So my previous email was
about compiling in 64 bit  (-arch ppc64) on Mac OSX 10.4.

 

Thanks,

Tony

 

 

Anthony C. Iannetti, P.E.

NASA Glenn Research Center

Propulsion Systems Division, Combustion Branch

21000 Brookpark Road, MS 5-10

Cleveland, OH 44135

phone: (216)433-5586

email: anthony.c.ianne...@nasa.gov

 

Please note:  All opinions expressed in this message are my own and NOT
of NASA.  Only the NASA Administrator can speak on behalf of NASA.

 



From: Iannetti, Anthony C. (GRC-RTB0) 
Sent: Wednesday, November 22, 2006 12:08 AM
To: 'us...@open-mpi.org'
Subject: 1.2b1 make failed on Mac 10.4

 

Dear OpenMPI List:

 

My attempt at compiling the prerelease of OpenMPI 1.2
failed.  Attached are the logs of the configure and make process.

 

I am running..

Darwin Cortland 8.8.1 Darwin Kernel Cersion 8.8.1: Mon Sep 25 19:45:30
PDT 2006; root:xnu-792.13.8.obj~1/RELEASE_PPC Power Macintosh powerpc

 

Thanks,

Tony

 

Anthony C. Iannetti, P.E.

NASA Glenn Research Center

Propulsion Systems Division, Combustion Branch

21000 Brookpark Road, MS 5-10

Cleveland, OH 44135

phone: (216)433-5586

email: anthony.c.ianne...@nasa.gov

 

Please note:  All opinions expressed in this message are my own and NOT
of NASA.  Only the NASA Administrator can speak on behalf of NASA.