Re: [OMPI users] open-mpi ssh hostname problem
Thanks for the hint. If I set the hostname via the console command hostname it does not work but if I use the GUI instead to change the name it works fine (problem solved). May be there are more commands necessary than simply hostname to make it running on the console? Bernhard -- Message: 4 Date: Fri, 6 Feb 2009 17:48:44 -0500 From: Jeff Squyres Subject: Re: [OMPI users] open-mpi ssh hostname problem To: Open MPI Users Message-ID: <340a96dd-6cd3-4bec-bcbd-92aa2cfdd...@cisco.com> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes I'm not quite sure what you did here; did you set the IP address and hostname to something that is resolvable via gethostbyname()? E.g., does the hostname exist in DNS or in /etc/hosts and match the IP address that you set? On Feb 6, 2009, at 6:18 AM, Bernhard Knapp wrote: Dear users I am using the parallel software Gromacs on Fedora8 nodes. I installed the software and run it without problems but thereafter I moved the node to our server-room and did the following: - set ip adress, subnetmask and gateway - changed the ssh port in /etc/ssh/sshd_config since we use port forwarding on our router and /usr/sbin/semanage port -a -t inetd_child_port_t -p tcp 5101 - changed the firewall settings to additionally allow the new port - changed the hostname via hostname command Then I started exactly the same simulation (same command, same data) as before (before the network configuration) and it comes up with the following error: ssh: quoVadis01: Name or service not known -- A daemon (pid 5039) died unexpectedly with status 255 while attempting to launch so we are aborting. There may be more information reported by the environment (see above). This may be because the daemon was unable to find all the needed shared libraries on the remote node. You may set your LD_LIBRARY_PATH to have the location of the shared libraries on the remote nodes and this will automatically be forwarded to the remote nodes. -- -- mpirun noticed that the job aborted, but has no info as to the process that caused that situation. -- mpirun: clean termination accomplished Currently the simulation is only running in parallel on the local 4 cores and not using the network at all. Why is it a problem for open-mpi to change the hostname from "localhost" to "quoVadis01"? If i change the hostname back it works again. How can I make open-mpi running using a hostname different to localhost. Simply to reinstall mpi after changing the hostname does not help. cheers Bernhard ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems
Re: [OMPI users] Job hangs when daemon does not report back from remote machine
The default launcher is ssh - the "rsh" things you see are the name of the particular component, not the name of the actual command being used. That launcher looks for "ssh" first, and then falls back to "rsh" if ssh isn't found. OMPI currently doesn't support restricted port ranges. We are working on a new release that does, but it won't be out for a few weeks. Until that time, my only suggestion would be to look at removing the firewall on every node in favor of a firewall on the outside of the cluster. I'm not sure any other solution is available just yet. Ralph On Feb 8, 2009, at 2:08 PM, Kersey Black wrote: Many thanks. The firewall is the issue. On Feb 9, 2009, at 5:56 AM, Ralph Castain wrote: It sounds to me like TCP communication isn't getting through for some reason. Try the following: mpirun --mca plm_base_verbose 5 --hostfile myh3 -pernode hostname black@ccn3:~/Documents/mp> mpirun --mca plm_base_verbose 5 -- hostfile myh3 -pernode hostname [ccn3:26932] mca:base:select:( plm) Querying component [rsh] [ccn3:26932] mca:base:select:( plm) Query of component [rsh] set priority to 10 [ccn3:26932] mca:base:select:( plm) Querying component [slurm] [ccn3:26932] mca:base:select:( plm) Skipping component [slurm]. Query failed to return a module [ccn3:26932] mca:base:select:( plm) Selected component [rsh] -hangs here But, when I turn off the firewall for a moment on both machines, local and remote, everything works: black@ccn3:~/Documents/mp> mpirun --mca plm_base_verbose 5 -- hostfile myh3 -pernode hostname [ccn3:26442] mca:base:select:( plm) Querying component [rsh] [ccn3:26442] mca:base:select:( plm) Query of component [rsh] set priority to 10 [ccn3:26442] mca:base:select:( plm) Querying component [slurm] [ccn3:26442] mca:base:select:( plm) Skipping component [slurm]. Query failed to return a module [ccn3:26442] mca:base:select:( plm) Selected component [rsh] ccn3 ccn4 2 Questions: 1) Is it really trying to use 'rsh', or is that just part of the language in the debugging reporting? I assume it is actually using ssh under the hood, but it is worth asking. I am using the default configuration on this. black@ccn3:~/Documents/mp> ompi_info --param all all | grep pls MCA plm: parameter "plm_rsh_agent" (current value: "ssh : rsh", data source: default value, synonyms: pls_rsh_agent) 2) Since it is a firewall issue, I read what I could find and it seems there is not a means of restricting port ranges. Right now, each node in this small cluster is running its own firewall rather than all being hidden behind some other machine or switch. Any pointers for handling this most easily. Cheers, Kersey You should see output from the receipt of a daemon callback for each daemon, the the sending of the launch command. My guess is that you won't see all the daemons callback, which is why you hang. This should tell you which node isn't getting a message back to wherever mpirun is executing. You might then check to ensure no firewalls are in the way to that node, there is a TCP path back from it, etc. I can help with additional diagnostics once we get that far. Ralph On Feb 7, 2009, at 2:40 PM, Kersey Black wrote: ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] Linux Itanium Configure and Make Logs for 1.2.8
I have attached the ./configure and make all output for version 1.2.8 as directed in the Open MPI "Getting Help" section. Hopefully, this will guide us on what is going on with the 1.3 assembler code. Tony Anthony C. Iannetti, P.E. NASA Glenn Research Center Aeropropulsion Division, Combustion Branch 21000 Brookpark Road, MS 5-10 Cleveland, OH 44135 phone: (216)433-5586 email: anthony.c.ianne...@nasa.gov Please note: All opinions expressed in this message are my own and NOT of NASA. Only the NASA Administrator can speak on behalf of NASA. LinuxItanium-output.tar.gz Description: LinuxItanium-output.tar.gz
Re: [OMPI users] Job hangs when daemon does not report back from remote machine
Thanks much for all the help. I will work to wall things off, but as the means of doing that is not obvious with the way the network is configured, I will also be watchful for new versions which might provide options for this situation. Cheers, Kersey On Feb 10, 2009, at 2:54 AM, Ralph Castain wrote: The default launcher is ssh - the "rsh" things you see are the name of the particular component, not the name of the actual command being used. That launcher looks for "ssh" first, and then falls back to "rsh" if ssh isn't found. OMPI currently doesn't support restricted port ranges. We are working on a new release that does, but it won't be out for a few weeks. Until that time, my only suggestion would be to look at removing the firewall on every node in favor of a firewall on the outside of the cluster. I'm not sure any other solution is available just yet. Ralph
Re: [OMPI users] Linux Itanium Configure and Make Logs for 1.2.8
Tony, My compile line with the error was the following. I believe the one you had with the error was similar: icc -DHAVE_CONFIG_H -I. -I../../opal/include -I../../orte/include -I../../ompi/include -I../../opal/mca/paffinity/linux/plpa/src/libplpa -I../.. \ -O3 -DNDEBUG -finline-functions -fno-strict-aliasing -restrict -MT atomic-asm.lo -MD -MP -MF .deps/atomic-asm.Tpo -c atomic-asm.S -fPIC -DPIC -o .libs/atomic-asm.o However, your 1.2.8 output had: icc -DHAVE_CONFIG_H -I. -I../../opal/include -I../../orte/include -I../../ompi/include -I../.. \ -O3 -DNDEBUG -finline-functions -fno-strict-aliasing -restrict -pthread -MT asm.lo -MD -MP -MF .deps/asm.Tpo -c asm.c -fPIC -DPIC -o .libs/asm.o If I use these options, the error goes away. Here is output from my screen: ia64b <94> pwd /scratch/open13/openmpi-1.3/opal/asm ia64b <95> icc -DHAVE_CONFIG_H -I. -I../../opal/include -I../../orte/include -I../../ompi/include -I../../opal/mca/paffinity/linux/plpa/src/libplpa -I../.. -O3 -DNDEBUG -finline-functions -fno-strict-aliasing -restrict -MT atomic-asm.lo -MD -MP -MF .deps/atomic-asm.Tpo -c atomic-asm.S -fPIC -DPIC -o .libs/atomic-asm.o /scratch/icc777XKf.s(1) : error A2040: Unexpected token: Unary Diez Operator at: Start /scratch/icc777XKf.s(2) : error A2040: Unexpected token: Unary Diez Operator at: Start /scratch/icc777XKf.s(3) : error A2040: Unexpected token: Unary Diez Operator at: Start /scratch/icc777XKf.s(4) : error A2040: Unexpected token: Unary Diez Operator at: Start .libs/atomic-asm.o - 4 error(s), 0 warning(s) ia64b <96> icc -DHAVE_CONFIG_H -I. -I../../opal/include -I../../orte/include -I../../ompi/include -I../.. -O3 -DNDEBUG -finline-functions -fno-strict-aliasing -restrict -pthread -MT asm.lo -MD -MP -MF .deps/asm.Tpo -c asm.c -fPIC -DPIC -o .libs/asm.o ia64b <97> ls -l .libs/asm.o -rw-r--r-- 1 jjg develop 472 Feb 9 16:27 .libs/asm.o So ... for some reasons the compiler options changed. Can you please 1. cd into the .../opal/asm directory 2. Issue the BAD command I have at my prompt 95 and verify the error. 3. Issue the GOOD command I have at my prompt 96 and verify it works. Now .. as to why the options are different .. .I don't know. Just trying to help, Joe From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Iannetti, Anthony C. (GRC-RTB0) Sent: Monday, February 09, 2009 6:10 AM To: Open MPI Users Subject: [OMPI users] Linux Itanium Configure and Make Logs for 1.2.8 I have attached the ./configure and make all output for version 1.2.8 as directed in the Open MPI "Getting Help" section. Hopefully, this will guide us on what is going on with the 1.3 assembler code. Tony Anthony C. Iannetti, P.E. NASA Glenn Research Center Aeropropulsion Division, Combustion Branch 21000 Brookpark Road, MS 5-10 Cleveland, OH 44135 phone: (216)433-5586 email: anthony.c.ianne...@nasa.gov Please note: All opinions expressed in this message are my own and NOT of NASA. Only the NASA Administrator can speak on behalf of NASA.
Re: [OMPI users] Open MPI 1.3 segfault on amd64 with Rmpi
To bring closure to this thread, we found that the following simple patch to Rmpi/src/Rmpi.c fixes the problem: --- rmpi-0.5-6.orig/src/Rmpi.c +++ rmpi-0.5-6/src/Rmpi.c @@ -63,7 +63,7 @@ else { #ifdef OPENMPI - dlopen("libmpi.so.0", RTLD_GLOBAL); + dlopen("libmpi.so.0", RTLD_GLOBAL | RTLD_LAZY); #endif #ifndef MPI2 The fix has been applied to Debian's package and should also be forthcoming in future releases of Rmpi. Big thanks to Jeff Squyres for patient help with the debugging. Dirk -- Three out of two people have difficulties with fractions.
Re: [OMPI users] Linux Itanium Configure and Make Logs for 1.2.8
Joe - There are two different files being discussed, which might be the cause of the confusion. And this is really complicated, undocumented code I'm shamefully responsible for, so the confusion is quite understandable :). There's asm.c, which on all non-Sparc v8 platforms just pre-processes down to the line: #include "opal/sys/atomic.h" That header file includes all the inlined versions of the assembly, if the compiler is detected as supporting inline assembly. There's then atomic-asm.S, which on all platforms is an assembly file (obviously) of all the functions which would be defined by opal/sys/atomic.h, to help deal with weird compilerisms. This file is generated from opal/sys/atomic.h by hand, which is a pain. The file is then preprocessed at configure time to generate a file that should work with the given compiler. Anyway, that describes the difference between your two commands, the one that works and the one that doesn't. Why there's a failure, I'm not sure and unfortunately, I dont' have time to look into it in detail for the next month or so (in that mad, must finish dissertation this month, mode). Brian On Mon, 9 Feb 2009, Joe Griffin wrote: Tony, My compile line with the error was the following. I believe the one you had with the error was similar: icc -DHAVE_CONFIG_H -I. -I../../opal/include -I../../orte/include -I../../ompi/include -I../../opal/mca/paffinity/linux/plpa/src/libplpa -I../.. \ -O3 -DNDEBUG -finline-functions -fno-strict-aliasing -restrict -MT atomic-asm.lo -MD -MP -MF .deps/atomic-asm.Tpo -c atomic-asm.S -fPIC -DPIC -o .libs/atomic-asm.o However, your 1.2.8 output had: icc -DHAVE_CONFIG_H -I. -I../../opal/include -I../../orte/include -I../../ompi/include -I../.. \ -O3 -DNDEBUG -finline-functions -fno-strict-aliasing -restrict -pthread -MT asm.lo -MD -MP -MF .deps/asm.Tpo -c asm.c -fPIC -DPIC -o .libs/asm.o If I use these options, the error goes away. Here is output from my screen: ia64b <94> pwd /scratch/open13/openmpi-1.3/opal/asm ia64b <95> icc -DHAVE_CONFIG_H -I. -I../../opal/include -I../../orte/include -I../../ompi/include -I../../opal/mca/paffinity/linux/plpa/src/libplpa -I../.. -O3 -DNDEBUG -finline-functions -fno-strict-aliasing -restrict -MT atomic-asm.lo -MD -MP -MF .deps/atomic-asm.Tpo -c atomic-asm.S -fPIC -DPIC -o .libs/atomic-asm.o /scratch/icc777XKf.s(1) : error A2040: Unexpected token: Unary Diez Operator at: Start /scratch/icc777XKf.s(2) : error A2040: Unexpected token: Unary Diez Operator at: Start /scratch/icc777XKf.s(3) : error A2040: Unexpected token: Unary Diez Operator at: Start /scratch/icc777XKf.s(4) : error A2040: Unexpected token: Unary Diez Operator at: Start .libs/atomic-asm.o - 4 error(s), 0 warning(s) ia64b <96> icc -DHAVE_CONFIG_H -I. -I../../opal/include -I../../orte/include -I../../ompi/include -I../.. -O3 -DNDEBUG -finline-functions -fno-strict-aliasing -restrict -pthread -MT asm.lo -MD -MP -MF .deps/asm.Tpo -c asm.c -fPIC -DPIC -o .libs/asm.o ia64b <97> ls -l .libs/asm.o -rw-r--r-- 1 jjg develop 472 Feb 9 16:27 .libs/asm.o So ? for some reasons the compiler options changed. Can you please 1. cd into the ?/opal/asm directory 2. Issue the BAD command I have at my prompt 95 and verify the error. 3. Issue the GOOD command I have at my prompt 96 and verify it works. Now .. as to why the options are different .. .I don?t know. Just trying to help, Joe From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Iannetti, Anthony C. (GRC-RTB0) Sent: Monday, February 09, 2009 6:10 AM To: Open MPI Users Subject: [OMPI users] Linux Itanium Configure and Make Logs for 1.2.8 I have attached the ./configure and make all output for version 1.2.8 as directed in the Open MPI "Getting Help" section. Hopefully, this will guide us on what is going on with the 1.3 assembler code. Tony Anthony C. Iannetti, P.E. NASA Glenn Research Center Aeropropulsion Division, Combustion Branch 21000 Brookpark Road, MS 5-10 Cleveland, OH 44135 phone: (216)433-5586 email: anthony.c.ianne...@nasa.gov Please note: All opinions expressed in this message are my own and NOT of NASA. Only the NASA Administrator can speak on behalf of NASA.
[OMPI users] undefined symbol: tm_init
Hey, I've just installed OpenMPI 1.3 on our cluster, and am getting this issue on jobs > 1 node. mpiexec: symbol lookup error: /usr/local/openmpi/1.3-pgi/lib/openmpi/mca_plm_tm.so: undefined symbol: tm_init As reported before, I saw someone saying that they solved this with: --enable-mca-static=plm:tm A new install using this configure option does work for me, but only for code recompiled with this new mpicc. Existing code doesn't spawn properly. As such, I'd much rather get the existing install working again. It was suggested that I need the torque libraries on the compute nodes, which they are. However adding them to ld.so.conf has not solved this, so I'm not sure what more needs to be done to solve this without recompiling openmpi. Thanks in advance for any help. / Brett -- Brett Pemberton - VPAC Senior Systems Administrator http://www.vpac.org/ - (03) 9925 4899 signature.asc Description: OpenPGP digital signature
Re: [OMPI users] Linux Itanium Configure and Make Logs for 1.2.8
Hi Brian, First off I want to thank you and Jeff for all the work you do. The issue was actually Tony's. I got involved just because I have a few itaniums and was willing to try. Sorry I did not notice the asm.c and atomic-asm.S ... argh ... been too long a day. So .. the real question is why is the atomic-asm.S now in the compile. I am good as I do not have the problem. Perhaps we we upgrade ( probably 2010 I will hunt more ). Joe From: users-boun...@open-mpi.org on behalf of Brian W. Barrett Sent: Mon 2/9/2009 5:21 PM To: Open MPI Users Subject: Re: [OMPI users] Linux Itanium Configure and Make Logs for 1.2.8 Joe - There are two different files being discussed, which might be the cause of the confusion. And this is really complicated, undocumented code I'm shamefully responsible for, so the confusion is quite understandable :). There's asm.c, which on all non-Sparc v8 platforms just pre-processes down to the line: #include "opal/sys/atomic.h" That header file includes all the inlined versions of the assembly, if the compiler is detected as supporting inline assembly. There's then atomic-asm.S, which on all platforms is an assembly file (obviously) of all the functions which would be defined by opal/sys/atomic.h, to help deal with weird compilerisms. This file is generated from opal/sys/atomic.h by hand, which is a pain. The file is then preprocessed at configure time to generate a file that should work with the given compiler. Anyway, that describes the difference between your two commands, the one that works and the one that doesn't. Why there's a failure, I'm not sure and unfortunately, I dont' have time to look into it in detail for the next month or so (in that mad, must finish dissertation this month, mode). Brian On Mon, 9 Feb 2009, Joe Griffin wrote: > > Tony, > > > > My compile line with the error was the following. I believe the one you > had with the error was similar: > > > > icc -DHAVE_CONFIG_H -I. -I../../opal/include -I../../orte/include > -I../../ompi/include -I../../opal/mca/paffinity/linux/plpa/src/libplpa > -I../.. \ > > -O3 -DNDEBUG -finline-functions -fno-strict-aliasing -restrict -MT > atomic-asm.lo -MD -MP -MF .deps/atomic-asm.Tpo -c atomic-asm.S -fPIC > -DPIC -o .libs/atomic-asm.o > > > > However, your 1.2.8 output had: > > > > icc -DHAVE_CONFIG_H -I. -I../../opal/include -I../../orte/include > -I../../ompi/include -I../.. \ > > -O3 -DNDEBUG -finline-functions -fno-strict-aliasing -restrict -pthread > -MT asm.lo -MD -MP -MF .deps/asm.Tpo -c asm.c -fPIC -DPIC -o .libs/asm.o > > > > If I use these options, the error goes away. Here is output from my > screen: > > > > ia64b <94> pwd > > /scratch/open13/openmpi-1.3/opal/asm > > > > ia64b <95> icc -DHAVE_CONFIG_H -I. -I../../opal/include > -I../../orte/include -I../../ompi/include > -I../../opal/mca/paffinity/linux/plpa/src/libplpa -I../.. -O3 -DNDEBUG > -finline-functions -fno-strict-aliasing -restrict -MT atomic-asm.lo -MD > -MP -MF .deps/atomic-asm.Tpo -c atomic-asm.S -fPIC -DPIC -o > .libs/atomic-asm.o > > /scratch/icc777XKf.s(1) : error A2040: Unexpected token: Unary Diez > Operator at: Start > > /scratch/icc777XKf.s(2) : error A2040: Unexpected token: Unary Diez > Operator at: Start > > /scratch/icc777XKf.s(3) : error A2040: Unexpected token: Unary Diez > Operator at: Start > > /scratch/icc777XKf.s(4) : error A2040: Unexpected token: Unary Diez > Operator at: Start > > .libs/atomic-asm.o - 4 error(s), 0 warning(s) > > > > ia64b <96> icc -DHAVE_CONFIG_H -I. -I../../opal/include > -I../../orte/include -I../../ompi/include -I../.. -O3 -DNDEBUG > -finline-functions -fno-strict-aliasing -restrict -pthread -MT asm.lo -MD > -MP -MF .deps/asm.Tpo -c asm.c -fPIC -DPIC -o .libs/asm.o > > > > ia64b <97> ls -l .libs/asm.o > > -rw-r--r-- 1 jjg develop 472 Feb 9 16:27 .libs/asm.o > > > > So ? for some reasons the compiler options changed. Can you please > > > > 1. cd into the ?/opal/asm directory > > 2. Issue the BAD command I have at my prompt 95 and verify the error. > > 3. Issue the GOOD command I have at my prompt 96 and verify it works. > > > > Now .. as to why the options are different .. .I don?t know. > > > > Just trying to help, > > Joe > > > > > > > > > > > > From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On > Behalf Of Iannetti, Anthony C. (GRC-RTB0) > Sent: Monday, February 09, 2009 6:10 AM > To: Open MPI Users > Subject: [OMPI users] Linux Itanium Configure and Make Logs for 1.2.8 > > > > I have attached the ./configure and make all output for version 1.2.8 as > directed in the Open MPI "Getting Help" section. Hopefully, this will > guide us on what is going on with the 1.3 assembler code. > > > > Tony > > > > Anthony C. Iannetti, P.E. > > NASA Glenn Research Center > > Aeropropulsion Division, Combustion Branch >