Re: [OMPI users] How to use a wrapper for ssh?
Hi Ralph, 2. use MCA parameters described in http://www.open-mpi.org/faq/?category=rsh#rsh-not-ssh to bend the call to my wrapper, e.g. export OMPI_MCA_plm_rsh_agent=WrapPer export OMPI_MCA_orte_rsh_agent=WrapPer the oddly thing is, that the OMPI_MCA_orte_rsh_agent envvar seem not to have any effect, whereas OMPI_MCA_plm_rsh_agent works. Why I believe so? orte_rsh_agent doesn't exist in the 1.4 series :-) Only plm_rsh_agent is available in 1.4. "ompi_info --param orte all" and "ompi_info --param plm rsh" will confirm that fact. If so, then the Wiki is not correct. Maybe someone can correct it? This would save some time for people like me... Best wishes Paul Kapinos -- Dipl.-Inform. Paul Kapinos - High Performance Computing, RWTH Aachen University, Center for Computing and Communication Seffenter Weg 23, D 52074 Aachen (Germany) Tel: +49 241/80-24915 smime.p7s Description: S/MIME Cryptographic Signature
Re: [OMPI users] How to use a wrapper for ssh?
Yes, I guess it looks like http://www.open-mpi.org/faq/?category=rsh#rsh-not-ssh is a little out of date. Thanks for the heads-up... On Jul 13, 2011, at 4:35 AM, Paul Kapinos wrote: > Hi Ralph, >>> 2. use MCA parameters described in >>> http://www.open-mpi.org/faq/?category=rsh#rsh-not-ssh >>> to bend the call to my wrapper, e.g. >>> export OMPI_MCA_plm_rsh_agent=WrapPer >>> export OMPI_MCA_orte_rsh_agent=WrapPer >>> >>> the oddly thing is, that the OMPI_MCA_orte_rsh_agent envvar seem not to >>> have any effect, whereas OMPI_MCA_plm_rsh_agent works. >>> Why I believe so? >> orte_rsh_agent doesn't exist in the 1.4 series :-) >> Only plm_rsh_agent is available in 1.4. "ompi_info --param orte all" and >> "ompi_info --param plm rsh" will confirm that fact. > > If so, then the Wiki is not correct. Maybe someone can correct it? This would > save some time for people like me... > > Best wishes > Paul Kapinos > > > > > -- > Dipl.-Inform. Paul Kapinos - High Performance Computing, > RWTH Aachen University, Center for Computing and Communication > Seffenter Weg 23, D 52074 Aachen (Germany) > Tel: +49 241/80-24915 > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] tcp communication problems with 1.4.3 and 1.4.4 rc2 on FreeBSD
On Jul 12, 2011, at 1:37 PM, Steve Kargl wrote: > (many lines removed) > checking prefix for function in .type... @ > checking if .size is needed... yes > checking if .align directive takes logarithmic value... no > configure: error: No atomic primitives available for amd64-unknown-freebsd9.0 Hmm; this is quite odd. This worked in v1.4, but didn't work in trunk? There are a bunch of changes to our configure assembly tests between v1.4, but I don't see any that should affect AMD vs. Intel. Weird. I wonder if this has to do with versions of config.* scripts. What does config/config.guess report from the trunk tarball, and what does it report from the v1.4 tarball? -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] Mpirun only works when n< 3
Got it. Building a new openMPI solved it. I don't know if the standard Ubuntu install was the problem or if it just didn't like the slightly later kernel. Seems to be reason to be suspicious of Ubuntu 10.10 OpenMPI builds if you have anything unusual in your system. Thanks. --- On Tue, 12/7/11, Jeff Squyres wrote: From: Jeff Squyres Subject: Re: [OMPI users] Mpirun only works when n< 3 To: randolph_pul...@yahoo.com.au Cc: "Open MPI Users" Received: Tuesday, 12 July, 2011, 10:29 PM On Jul 11, 2011, at 11:31 AM, Randolph Pullen wrote: > There are no firewalls by default. I can ssh between both nodes without a > password so I assumed that all is good with the comms. FWIW, ssh'ing is different than "comms" (which I assume you mean opening random TCP sockets between two servers). > I can also get both nodes to participate in the ring program at the same time. > Its just that I am limited to inly 2 processes if they are split between the > nodes > ie: > mpirun -H A,B ring (works) > mpirun -H A,A,A,A,A,A,A ring (works) > mpirun -H B,B,B,B ring (works) > mpirun -H A,B,A ring (hangs) It is odd that A,B works and A,B,A does not. > I have discovered slightly more information: > When I replace node 'B' from the new cluster with node 'C' from the old > cluster > I get the similar behavior but with an error message: > mpirun -H A,A,A,A,A,A,A ring (works from either node) > mpirun -H C,C,C ring (works from either node) > mpirun -H A,C ring (Fails from either node:) > Process 0 sending 10 to 1, tag 201 (3 processes in ring) > [C:23465] *** An error occurred in MPI_Recv > [C:23465] *** on communicator MPI_COMM_WORLD > [C:23465] *** MPI_ERRORS_ARE FATAL (your job will now abort) > Process 0 sent to 1 > -- > Running this on either node A or C produces the same result > Node C runs openMPI 1.4.1 and is an ordinary dual core on FC10 , not an i5 > 2400 like the others. > all the binaries are compiled on FC10 with gcc 4.3.2 Are you sure that all the versions of Open MPI being used on all nodes are exactly the same? I.e., are you finding/using Open MPI v1.4.1 on all nodes? Are the nodes homogeneous in terms of software? If they're heterogeneous in terms of hardware, you *might* need to have separate OMPI installations on each machine (vs., for example, a network-filesystem-based install shared to all 3) because the compiler's optimizer may produce code tailored for one of the machines, and it may therefore fail in unexpected ways on the other(s). The same is true for your executable. See this FAQ entry about heterogeneous setups: http://www.open-mpi.org/faq/?category=building#where-to-install ...hmm. I could have sworn we had more on the FAQ about heterogeneity, but perhaps not. The old LAM/MPI FAQ on heterogeneity is somewhat outdated, but most of its concepts are directly relevant to Open MPI as well: http://www.lam-mpi.org/faq/category11.php3 I should probably copy most of that LAM/MPI heterogeneous FAQ to the Open MPI FAQ, but it'll be waaay down on my priority list. :-( If anyone could help out here, I'd be happy to point them in the right direction to convert the LAM/MPI FAQ PHP to Open MPI FAQ PHP... To be clear: the PHP conversion will be pretty trivial; I stole heavily from the LAM/MPI FAQ PHP to create the Open MPI FAQ PHP -- but there are points where the LAM/MPI heterogeneity text needs to be updated; that'll take an hour or two to update all that content. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
[OMPI users] Running your MPI application on a Computer Cluster in the Cloud - cloudnumbers.com
Dear MPI users and experts, cloudnumbers.com provides researchers and companies with the resources to perform high performance calculations in the cloud. As cloudnumbers.com's community manager I may invite you to register and test your MPI application on a computer cluster in the cloud for free: http://my.cloudnumbers.com/register Our aim is to change the way of research collaboration is done today by bringing together scientists and businesses from all over the world on a single platform. cloudnumbers.com is a Berlin (Germany) based international high-tech startup striving for enabling everyone to benefit from the High Performance Computing related advantages of the cloud. We provide easy access to applications running on any kind of computer hardware: from single core high memory machines up to 1000 cores computer clusters. Our platform provides several advantages: * Turn fixed into variable costs and pay only for the capacity you need. Watch our latest saving costs with cloudnumbers.com video: http://www.youtube.com/watch?v=ln_BSVigUhg&feature=player_embedded * Enter the cloud using an intuitive and user friendly platform. Watch our latest cloudnumbers.com in a nutshell video: http://www.youtube.com/watch?v=0ZNEpR_ElV0&feature=player_embedded * Be released from ongoing technological obsolescence and continuous maintenance costs (e.g. linking to libraries or system dependencies) * Accelerated your C, C++, Fortran, R, Python, ... calculations through parallel processing and great computing capacity - more than 1000 cores are available and GPUs are coming soon. * Share your results worldwide (coming soon). * Get high speed access to public databases (please let us know, if your favorite database is missing!). * We have developed a security architecture that meets high requirements of data security and privacy. Read our security white paper: http://d1372nki7bx5yg.cloudfront.net/wp-content/uploads/2011/06/cloudnumberscom-security.whitepaper.pdf This is only a selection of our top features. To get more information check out our web-page (http://www.cloudnumbers.com/) or follow our blog about cloud computing, HPC and HPC applications: http://cloudnumbers.com/blog Register and test for free now at cloudnumbers.com: http://my.cloudnumbers.com/register We are looking forward to get your feedback and consumer insights. Take the chance and have an impact to the development of a new cloud computing calculation platform. Best Markus -- Dr. rer. nat. Markus Schmidberger Senior Community Manager Cloudnumbers.com GmbH Chausseestraße 6 10119 Berlin www.cloudnumbers.com E-Mail: markus.schmidber...@cloudnumbers.com * Amtsgericht München, HRB 191138 Geschäftsführer: Erik Muttersbach, Markus Fensterer, Moritz v. Petersdorff-Campen
Re: [OMPI users] tcp communication problems with 1.4.3 and 1.4.4 rc2 on FreeBSD
On Jul 12, 2011, at 3:26 PM, Steve Kargl wrote: > % /usr/local/ompi/bin/mpiexec -machinefile mf --mca btl self,tcp \ > --mca btl_base_verbose 30 ./z > > with mf containing > > node11 slots=1 (node11 contains a single bge0=168.192.0.11) > node16 slots=1 (node16 contains a single bge0=168.192.0.16) > > or > > node11 slots=2 (communication on memory bus) > > However, if mf contains > > node10 slots=1 (node10 contains bge0=10.208.xx and bge1=192.168.0.10) > node16 slots=1 (node16 contains a single bge0=192.168.0.16) > > I see the same problem where node10 cannot communicate with node16. If you ever get the time to check into the code to see why this is happening, I'd be curious to hear what you find (per my explanation of the TCP BTL here: http://www.open-mpi.org/community/lists/users/2011/07/16872.php). > Good News: > > Adding 'btl_tcp_if_include=192.168.0.0/16' to my ~/.openmpi/mca-params.conf > file seems to cure the communication problem. Good. > Thanks for the help. If I run into any other problems with trunk, > I'll report those here. Keep in mind the usual disclaimers with development trunks -- it's *usually* stable, but sometimes it does break. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] Mixed Mellanox and Qlogic problems
I finally got access to the systems again (the original ones are part of our real time system). I thought I would try one other test I had set up first. I went to OFED 1.6 and it started running with no errors. It must have been an OFED bug. Now I just have the speed problem. Anyone have a way to make the mixture of mlx4 and qlogic work together without slowing down? On 07/07/11 17:19, Jeff Squyres wrote: Huh; wonky. Can you set the MCA parameter "mpi_abort_delay" to -1 and run your job again? This will prevent all the processes from dying when MPI_ABORT is invoked. Then attach a debugger to one of the still-live processes after the error message is printed. Can you send the stack trace? It would be interesting to know what is going on here -- I can't think of a reason that would happen offhand. On Jun 30, 2011, at 5:03 PM, David Warren wrote: I have a cluster with mostly Mellanox ConnectX hardware and a few with Qlogic QLE7340's. After looking through the web, FAQs etc. I built openmpi-1.5.3 with psm and openib. If I run within the same hardware it is fast and works fine. If I run between without specifying an MTL (e.g. mpirun -np 24 -machinefile dwhosts --byslot --bind-to-core --mca btl ^tcp ...) it dies with *** The MPI_Init() function was called before MPI_INIT was invoked. *** This is disallowed by the MPI standard. *** Your MPI job will now abort. [n16:9438] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed! *** The MPI_Init() function was called before MPI_INIT was invoked. *** This is disallowed by the MPI standard. *** Your MPI job will now abort. ... I can make it run but giving a bad mtl e.g. -mca mtl psm,none. All the processes run after complaining that mtl none does not exist. However, they run just as slow (about 10% slower than either set alone) Pertinent info: On the Qlogic Nodes: OFED: QLogic-OFED.SLES11-x86_64.1.5.3.0.22 On the Mellanox Nodes: OFED-1.5.2.1-20101105-0600 All: debian lenny kernel 2.6.32.41 OpenSM limit | grep memorylocked gives unlimited on all nodes. Configure line: ./configure --with-libnuma --with-openib --prefix=/usr/local/openmpi-1.5.3 --with-psm=/usr --enable-btl-openib-failover --enable-openib-connectx-xrc --enable-openib-rdmacm I thought that with 1.5.3 I am supposed to be able to do this. Am I just wrong? Does anyone see what I am doing wrong? Thanks ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users <>