[OMPI users] 1.2b1 make failed on Mac 10.4
Dear OpenMPI List: My attempt at compiling the prerelease of OpenMPI 1.2 failed. Attached are the logs of the configure and make process. I am running.. Darwin Cortland 8.8.1 Darwin Kernel Cersion 8.8.1: Mon Sep 25 19:45:30 PDT 2006; root:xnu-792.13.8.obj~1/RELEASE_PPC Power Macintosh powerpc Thanks, Tony Anthony C. Iannetti, P.E. NASA Glenn Research Center Propulsion Systems Division, Combustion Branch 21000 Brookpark Road, MS 5-10 Cleveland, OH 44135 phone: (216)433-5586 email: anthony.c.ianne...@nasa.gov Please note: All opinions expressed in this message are my own and NOT of NASA. Only the NASA Administrator can speak on behalf of NASA. ompi-output.tar Description: ompi-output.tar
Re: [OMPI users] MX performance problem on two processor nodes
Feel free to correct me if im wrong. OMPI assumes you have a fast network and checks for them. If they are not found it falls back to tcp. So if you leave out the --mca etc etc It should use the mx if its available. Im not sure how MX responds if one of the hosts does not have a working card (not activated) because the mpi job will still run. Just not using MX to that host. All other hosts will us MX. If openmpi sees that a node has more than one cpu (SMP) It will use the sm (shared mem) method to communicate over the mx. and if a proc sends to its self, the self method is used. So its like a priority order. I know there is a way (its in the archives) to see the priority on how OMPI choses what method to use. and uses the highest priority that will allow the communication to complete. I know there is also some magic being working on/implemented. That will stripe over multiple networks for large messages when more bandwidth is needed. I dont know if OMPI will have this ability or not. Someone else can chime in on that. Brock Palen Center for Advanced Computing bro...@umich.edu (734)936-1985 On Nov 21, 2006, at 11:28 PM, Iannetti, Anthony C. ((GRC-RTB0)) wrote: Dear OpenMPI List: From looking at a recent thread, I see an mpirun command with shared memory and mx: mpirun –mca btl mx,sm,self –np 2 pi3f90.x This works. I may have forgot to mention it, but I am using 1.1.2. I see there is an –mca mtl in version 1.2b1 . I do not think this exists in 1.1.2. Still, I would like to know what –mca is given automatically. Thanks, Tony Anthony C. Iannetti, P.E. NASA Glenn Research Center Propulsion Systems Division, Combustion Branch 21000 Brookpark Road, MS 5-10 Cleveland, OH 44135 phone: (216)433-5586 email: anthony.c.ianne...@nasa.gov Please note: All opinions expressed in this message are my own and NOT of NASA. Only the NASA Administrator can speak on behalf of NASA. From: Iannetti, Anthony C. (GRC-RTB0) Sent: Tuesday, November 21, 2006 8:39 PM To: 'us...@open-mpi.org' Subject: MX performance problem on two processor nodes Dear OpenMPI List: I am running the Myrinet MX btl with OpenMPI on MacOSX 10.4. I am running into a problem. When I run on one processor per node, OpenMPI runs just fine. When I run on two processors per node (slots=2), it seems to take forever (something is hanging). Here is the command: mpirun –mca btl mx,self –np 2 pi3f90.x However, if I give the command: mpirun –np 2 pi3f90.x The process runs normally. But, I do not know if it is using the Myrinet network. Is there a way to diagnose this problem. mpirun – v and –d do not seem to indicate which mca is actually being used. Thanks, Tony Anthony C. Iannetti, P.E. NASA Glenn Research Center Propulsion Systems Division, Combustion Branch 21000 Brookpark Road, MS 5-10 Cleveland, OH 44135 phone: (216)433-5586 email: anthony.c.ianne...@nasa.gov Please note: All opinions expressed in this message are my own and NOT of NASA. Only the NASA Administrator can speak on behalf of NASA. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Advice for a cluster software
On Mon, 20 Nov 2006, Reuti wrote: Hi, Am 20.11.2006 um 13:12 schrieb Epitropakis Mixalis 00064: Hello everyone! Hello, I think this question is of broader audience on the beowulf.org mailing list, but anyway: what are you using in the cluster besides Yes, I think you are right, but we are going to use OpenMPI and we wanted to hear and your opinion (OpenMPI experts opinion) :) OpenMPI? Although I'm biased, I would suggest SGE GridEngine, as it supports more parallel libs than Torque by its qrsh replacement; e.g. Linda or PVM. Also the integration between the qmaster and scheduler At this moment we use MPI and PVM but we would like to test and use other technologies, projects, ideas as well. I think that SGE GridEngine is a very good project and maybe that is our final choise! is tighter. In Torque you have two commands: "qstat" and "showq". The former is the view of the cluster by Torque, the latter the one of the Maui scheduler - and sometimes I observe that they disagree about what's running in the cluster and what not (we use SGE, but we have access to some clusters in other locations which prefer Torque). The support for SGE will be in OpenMPI in1.2 AFAIK. Question: you have a central filer server in the cluster, to serve the home directory to the nodes and which could also act as a NIS, NTP and SGE qmaster server? You mentioned only the nodes. Yes, at this time we think to use an additional node as the master node with a better HDD for these jobs -- Reuti Thank you for your help and your time :) Michael Thanks very much for your time and I am sure that your opinion will be of of great help to us! Michael ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Build OpenMPI for SHM only
Tim, yes, your suggestion makes sense. I didn't realize that would be a safe thing to do. Brian, I've verified that configuring with "--enable-mca-no-build=btl-tcp" prevents the tcp btl component from being built in the first place. Thanks for the help, -Adam Tim Prins wrote: Hi, I don't know if there is a way to do it in configure, but after installing you can go into the $prefix/lib/openmpi directory and delete mca_btl_tcp.* This will remove the tcp component and thus users will not be able to use it. Note that you must NOT delete the mca_oob_tcp.* files, as these are used for our internal administrative messaging and we currently require it to be there. Thanks, Tim Prins On Tuesday 21 November 2006 07:49 pm, Adam Moody wrote: Hello, We have some clusters which consist of a large pool of 8-way nodes connected via ethernet. On these particular machines, we'd like our users to be able to run 8-way MPI jobs on node, but we *don't* want them to run MPI jobs across nodes via the ethernet. Thus, I'd like to configure and build OpenMPI to provide shared memory support (or TCP loopback) but disable general TCP support. I realize that you can run without tcp via something like "mpirun --mca btl ^tcp", but this is up to the user's discretion. I need a way to disable it systematically. Is there a way to configure it out at build time or is there some runtime configuration file I can modify to turn it off? Also, when we configure "--without-tcp", the configure script doesn't complain, but TCP support is added anyway. Thanks, -Adam Moody MPI Support @ LLNL ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] openmpi, mx
I have - again - successfully built and installed mx and openmpi and I can run 64 and 128 cpus jobs on a 256 CPU cluster version of openmpi is 1.2b1 compiler used: studio11 The code is a benchmark b_eff which runs usually fine - I have used extensively it for benchmarking When I try 192 CPUs I get m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. ... .. .. The myrinet ports have been opened and the job is running as one of the nodes shows ps -eaf | grep dph0elh dph0elh 1068 1 0 20:40:00 ?? 0:00 /opt/ompi/bin/orted --bootproxy 1 --name 0.0.64 --num_procs 65 --vpid_start 0 - root 1110 1106 0 20:43:46 pts/4 0:00 grep dph0elh dph0elh 1070 1068 0 20:40:02 ?? 0:00 ../b_eff dph0elh 1074 1068 0 20:40:02 ?? 0:00 ../b_eff dph0elh 1072 1068 0 20:40:02 ?? 0:00 ../b_eff any idea ? Lydia -- Dr E L Heck University of Durham Institute for Computational Cosmology Ogden Centre Department of Physics South Road DURHAM, DH1 3LE United Kingdom e-mail: lydia.h...@durham.ac.uk Tel.: + 44 191 - 334 3628 Fax.: + 44 191 - 334 3645 ___
Re: [OMPI users] openmpi, mx
Hi Lydia: errno 24 means "Too many open files". When we have seen this, I believe we increased the number of file descriptors available to the mpirun process to get past this. In my case, my shell (tcsh) defaults to 256. I increase it with a call to "limit descriptors" as shown below. I think other shells may have other commands. burl-ct-v40z-0 41 =>limit cputime unlimited filesizeunlimited datasizeunlimited stacksize 10240 kbytes coredumpsize0 kbytes vmemoryuse unlimited descriptors 256 burl-ct-v40z-0 42 =>limit descriptors 64000 burl-ct-v40z-0 43 =>limit cputime unlimited filesizeunlimited datasizeunlimited stacksize 10240 kbytes coredumpsize0 kbytes vmemoryuse unlimited descriptors 64000 burl-ct-v40z-0 44 => Lydia Heck wrote On 11/22/06 15:45,: I have - again - successfully built and installed mx and openmpi and I can run 64 and 128 cpus jobs on a 256 CPU cluster version of openmpi is 1.2b1 compiler used: studio11 The code is a benchmark b_eff which runs usually fine - I have used extensively it for benchmarking When I try 192 CPUs I get m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. ... .. .. The myrinet ports have been opened and the job is running as one of the nodes shows ps -eaf | grep dph0elh dph0elh 1068 1 0 20:40:00 ?? 0:00 /opt/ompi/bin/orted --bootproxy 1 --name 0.0.64 --num_procs 65 --vpid_start 0 - root 1110 1106 0 20:43:46 pts/4 0:00 grep dph0elh dph0elh 1070 1068 0 20:40:02 ?? 0:00 ../b_eff dph0elh 1074 1068 0 20:40:02 ?? 0:00 ../b_eff dph0elh 1072 1068 0 20:40:02 ?? 0:00 ../b_eff any idea ? Lydia -- Dr E L Heck University of Durham Institute for Computational Cosmology Ogden Centre Department of Physics South Road DURHAM, DH1 3LE United Kingdom e-mail: lydia.h...@durham.ac.uk Tel.: + 44 191 - 334 3628 Fax.: + 44 191 - 334 3645 ___ ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- = rolf.vandeva...@sun.com 781-442-3043 =
Re: [OMPI users] openmpi, mx
One of our users/friends has also sent us some example code to do this internally - I hope to find the time to include that capability in the code base shortly. I'll advise when we do. On 11/22/06 2:16 PM, "Rolf Vandevaart" wrote: > > Hi Lydia: > > errno 24 means "Too many open files". When we have seen this, I believe > we increased the number of file descriptors available to the mpirun process > to get past this. > > In my case, my shell (tcsh) defaults to 256. I increase it with a call > to "limit descriptors" > as shown below. I think other shells may have other commands. > > burl-ct-v40z-0 41 =>limit > cputime unlimited > filesizeunlimited > datasizeunlimited > stacksize 10240 kbytes > coredumpsize0 kbytes > vmemoryuse unlimited > descriptors 256 > burl-ct-v40z-0 42 =>limit descriptors 64000 > burl-ct-v40z-0 43 =>limit > cputime unlimited > filesizeunlimited > datasizeunlimited > stacksize 10240 kbytes > coredumpsize0 kbytes > vmemoryuse unlimited > descriptors 64000 > burl-ct-v40z-0 44 => > > > Lydia Heck wrote On 11/22/06 15:45,: > >> I have - again - successfully built and installed >> mx and openmpi and I can run 64 and 128 cpus jobs on a 256 CPU cluster >> version of openmpi is 1.2b1 >> >> compiler used: studio11 >> >> The code is a benchmark b_eff which runs usually fine - I have used >> extensively >> it for benchmarking >> >> When I try 192 CPUs I get >> m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. >> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. >> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. >> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. >> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. >> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. >> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. >> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. >> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. >> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. >> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. >> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. >> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. >> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. >> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. >> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. >> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. >> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. >> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. >> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. >> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. >> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. >> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. >> [m2001:16147] mca_oob_tcp_accept: accept() failed with errno 24. >> ... >> .. >> .. >> >> The myrinet ports have been opened and the job is running >> as one of the nodes shows >> >> ps -eaf | grep dph0elh >> dph0elh 1068 1 0 20:40:00 ?? 0:00 /opt/ompi/bin/orted >> --bootproxy 1 --name 0.0.64 --num_procs 65 --vpid_start 0 - >>root 1110 1106 0 20:43:46 pts/4 0:00 grep dph0elh >> dph0elh 1070 1068 0 20:40:02 ?? 0:00 ../b_eff >> dph0elh 1074 1068 0 20:40:02 ?? 0:00 ../b_eff >> dph0elh 1072 1068 0 20:40:02 ?? 0:00 ../b_eff >> >> any idea ? >> >> Lydia >> >> >> -- >> Dr E L Heck >> >> University of Durham >> Institute for Computational Cosmology >> Ogden Centre >> Department of Physics >> South Road >> >> DURHAM, DH1 3LE >> United Kingdom >> >> e-mail: lydia.h...@durham.ac.uk >> >> Tel.: + 44 191 - 334 3628 >> Fax.: + 44 191 - 334 3645 >> ___ >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >>
Re: [OMPI users] 1.2b1 make failed on Mac 10.4
Dear OpenMPI List: OpenMPI 1.2b1 will compile in 32 bit (-arch ppc), but it will not compile in 64 bit (-arch ppc64). So my previous email was about compiling in 64 bit (-arch ppc64) on Mac OSX 10.4. Thanks, Tony Anthony C. Iannetti, P.E. NASA Glenn Research Center Propulsion Systems Division, Combustion Branch 21000 Brookpark Road, MS 5-10 Cleveland, OH 44135 phone: (216)433-5586 email: anthony.c.ianne...@nasa.gov Please note: All opinions expressed in this message are my own and NOT of NASA. Only the NASA Administrator can speak on behalf of NASA. From: Iannetti, Anthony C. (GRC-RTB0) Sent: Wednesday, November 22, 2006 12:08 AM To: 'us...@open-mpi.org' Subject: 1.2b1 make failed on Mac 10.4 Dear OpenMPI List: My attempt at compiling the prerelease of OpenMPI 1.2 failed. Attached are the logs of the configure and make process. I am running.. Darwin Cortland 8.8.1 Darwin Kernel Cersion 8.8.1: Mon Sep 25 19:45:30 PDT 2006; root:xnu-792.13.8.obj~1/RELEASE_PPC Power Macintosh powerpc Thanks, Tony Anthony C. Iannetti, P.E. NASA Glenn Research Center Propulsion Systems Division, Combustion Branch 21000 Brookpark Road, MS 5-10 Cleveland, OH 44135 phone: (216)433-5586 email: anthony.c.ianne...@nasa.gov Please note: All opinions expressed in this message are my own and NOT of NASA. Only the NASA Administrator can speak on behalf of NASA.