Re: [OMPI users] mpirun hangs when launching job on remote node
Hi Ron, Ron Babich wrote: Thanks for your response. I had noticed your thread, which is why I'm embarrassed (but happy) to say that it looks like my problem was the same as yours. I mentioned in my original email that there was no firewall running, which it turns out was a lie. I think that when I checked before, I must have forgotten "sudo." Instead of "permission denied" or the like, I got this misleading response: Opps! My turn to apologize -- I have to admit that I read through your post very quickly and skipped the sentence about the lack of a firewall. However, everything else sounded exactly like what I was getting. Actually, it would be good to find out what the problem with the firewall is and to not simply turn it off. As I'm not the sysadmin, I can't really play with the settings to find out. In your message (which I will now read more carefully :-) ), it says that you are using CentOS, which is based on RH. Our systems are using Fedora. Perhaps it has something to do with RH's defaults for the firewall settings? Another system that worked "immediately" was a Debian system. Anyway, if you find out a solution that doesn't require the firewall to be turned off, please let me know -- I think our sysadmin would be interested, too. Ray
Re: [OMPI users] openmpi 1.3 and gridengine tight integration problem
Hi, it shouldn't be necessary to supply a machinefile, as the one generated by SGE is taken automatically (i.e. the granted nodes are honored). You submitted the job requesting a PE? -- Reuti Am 18.03.2009 um 04:51 schrieb Salmon, Rene: Hi, I have looked through the list archives and google but could not find anything related to what I am seeing. I am simply trying to run the basic cpi.c code using SGE and tight integration. If run outside SGE i can run my jobs just fine: hpcp7781(salmr0)132:mpiexec -np 2 --machinefile x a.out Process 0 on hpcp7781 Process 1 on hpcp7782 pi is approximately 3.1416009869231241, Error is 0.0809 wall clock time = 0.032325 If I submit to SGE I get this: [hpcp7781:08527] mca: base: components_open: Looking for plm components [hpcp7781:08527] mca: base: components_open: opening plm components [hpcp7781:08527] mca: base: components_open: found loaded component rsh [hpcp7781:08527] mca: base: components_open: component rsh has no register function [hpcp7781:08527] mca: base: components_open: component rsh open function successful [hpcp7781:08527] mca: base: components_open: found loaded component slurm [hpcp7781:08527] mca: base: components_open: component slurm has no register function [hpcp7781:08527] mca: base: components_open: component slurm open function successful [hpcp7781:08527] mca:base:select: Auto-selecting plm components [hpcp7781:08527] mca:base:select:( plm) Querying component [rsh] [hpcp7781:08527] [[INVALID],INVALID] plm:rsh: using /hpc/SGE/bin/ lx24-amd64/qrsh for launching [hpcp7781:08527] mca:base:select:( plm) Query of component [rsh] set priority to 10 [hpcp7781:08527] mca:base:select:( plm) Querying component [slurm] [hpcp7781:08527] mca:base:select:( plm) Skipping component [slurm]. Query failed to return a module [hpcp7781:08527] mca:base:select:( plm) Selected component [rsh] [hpcp7781:08527] mca: base: close: component slurm closed [hpcp7781:08527] mca: base: close: unloading component slurm Starting server daemon at host "hpcp7782" error: executing task of job 1702026 failed: -- A daemon (pid 8528) died unexpectedly with status 1 while attempting to launch so we are aborting. There may be more information reported by the environment (see above). This may be because the daemon was unable to find all the needed shared libraries on the remote node. You may set your LD_LIBRARY_PATH to have the location of the shared libraries on the remote nodes and this will automatically be forwarded to the remote nodes. -- -- mpirun noticed that the job aborted, but has no info as to the process that caused that situation. -- mpirun: clean termination accomplished [hpcp7781:08527] mca: base: close: component rsh closed [hpcp7781:08527] mca: base: close: unloading component rsh Seems to me orted is not starting on the remote node. I have LD_LIBRARY_PATH set on my shell startup files. If I do an ldd on orted i see this: hpcp7781(salmr0)135:ldd /bphpc7/vol0/salmr0/ompi/bin/orted libopen-rte.so.0 => /bphpc7/vol0/salmr0/ompi/lib/libopen- rte.so.0 (0x2ac5b14e2000) libopen-pal.so.0 => /bphpc7/vol0/salmr0/ompi/lib/libopen- pal.so.0 (0x2ac5b1628000) libdl.so.2 => /lib64/libdl.so.2 (0x2ac5b17a9000) libnsl.so.1 => /lib64/libnsl.so.1 (0x2ac5b18ad000) libutil.so.1 => /lib64/libutil.so.1 (0x2ac5b19c4000) libm.so.6 => /lib64/libm.so.6 (0x2ac5b1ac7000) libpthread.so.0 => /lib64/libpthread.so.0 (0x2ac5b1c1c000) libc.so.6 => /lib64/libc.so.6 (0x2ac5b1d34000) /lib64/ld-linux-x86-64.so.2 (0x2ac5b13c6000) Looks like gridengine is using qrsh to start orted on the remote nodes. qrsh might not be reading my shell startup file and setting LD_LIBRARY_PATH. Thanks for any help with this. Rene ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] open mpi on non standard ssh port
come on, it must be somehow possible to use openmpi not on port 22!? ;-) -- Message: 3 Date: Tue, 17 Mar 2009 09:45:29 +0100 From: Bernhard Knapp Subject: [OMPI users] open mpi on non standard ssh port To: us...@open-mpi.org Message-ID: <49bf6329.8090...@meduniwien.ac.at> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Hi I want to start a gromacs simulation on a small cluster where non standard ports are used for ssh. If I just use a "normal" maschinelist file (with the ips of the nodes), consequently, the following error comes up: ssh: connect to host 192.168.0.103 port 22: Connection refused I guess that I need to somehow tell him to use the other ports. I tried it in the following way (maschinelist file): 192.168.0.101 -p 5101 192.168.0.102 -p 5102 192.168.0.103 -p 5103 192.168.0.104 -p 5104 But it seems this is not the correct way to specifiy the port: Open RTE detected a parse error in the hostfile: /home/bknapp/scripts/machinefile.txt It occured on line number 1 on token 5: -p How can I tell him to use port 5101 on machine 192.168.0.101? May be the question is stupid but I could not find a solution via google or search function ... cheers Bernhard
Re: [OMPI users] open mpi on non standard ssh port
Bernhard, Am 18.03.2009 um 09:19 schrieb Bernhard Knapp: come on, it must be somehow possible to use openmpi not on port 22!? ;-) it's not an issue of Open MPI but ssh. You need in your home a file ~/.ssh/config with two lines: host * port 1234 or whatever port you need. -- Reuti -- Message: 3 Date: Tue, 17 Mar 2009 09:45:29 +0100 From: Bernhard Knapp Subject: [OMPI users] open mpi on non standard ssh port To: us...@open-mpi.org Message-ID: <49bf6329.8090...@meduniwien.ac.at> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Hi I want to start a gromacs simulation on a small cluster where non standard ports are used for ssh. If I just use a "normal" maschinelist file (with the ips of the nodes), consequently, the following error comes up: ssh: connect to host 192.168.0.103 port 22: Connection refused I guess that I need to somehow tell him to use the other ports. I tried it in the following way (maschinelist file): 192.168.0.101 -p 5101 192.168.0.102 -p 5102 192.168.0.103 -p 5103 192.168.0.104 -p 5104 But it seems this is not the correct way to specifiy the port: Open RTE detected a parse error in the hostfile: /home/bknapp/scripts/machinefile.txt It occured on line number 1 on token 5: -p How can I tell him to use port 5101 on machine 192.168.0.101? May be the question is stupid but I could not find a solution via google or search function ... cheers Bernhard ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] mpirun hangs when launching job on remote node
On Wed, 18 Mar 2009, Raymond Wan wrote: Perhaps it has something to do with RH's defaults for the firewall settings? If your sysadmin uses kickstart to configure the systems, (s)he has to add 'firewall --disabled'; similar for SELinux which seems to have caused problems to another person on this list. OTOH, if (s)he blindly copied the config for a workstation to a cluster node, maybe some more education is needed first... Another system that worked "immediately" was a Debian system. That's because Debian doesn't configure a firewall or SELinux, leaving the admin the responsability to do it. Anyway, if you find out a solution that doesn't require the firewall to be turned off, please let me know -- I think our sysadmin would be interested, too. Depending on your definition of 'firewall turned off', the new feature of restricting ports used by OpenMPI will help. The firewall can stay on, but it should be configured to open a range of ports used by OpenMPI. -- Bogdan Costescu IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany Phone: +49 6221 54 8240, Fax: +49 6221 54 8850 E-mail: bogdan.coste...@iwr.uni-heidelberg.de
Re: [OMPI users] mpirun hangs when launching job on remote node
Hi Bogdan, Thanks for the information and looking forward to the new OpenMPI feature of port restriction... About Debian, I was wondering about that...I've had no problems with it and I was thinking everything was just done for me; of course, another possibility is that there was no firewall to begin with and I didn't know about it. Alas, it's the latter...I better look into it as I was basically oblivious to the lack of a firewall... Ray Bogdan Costescu wrote: On Wed, 18 Mar 2009, Raymond Wan wrote: Perhaps it has something to do with RH's defaults for the firewall settings? If your sysadmin uses kickstart to configure the systems, (s)he has to add 'firewall --disabled'; similar for SELinux which seems to have caused problems to another person on this list. OTOH, if (s)he blindly copied the config for a workstation to a cluster node, maybe some more education is needed first... Another system that worked "immediately" was a Debian system. That's because Debian doesn't configure a firewall or SELinux, leaving the admin the responsability to do it. Anyway, if you find out a solution that doesn't require the firewall to be turned off, please let me know -- I think our sysadmin would be interested, too. Depending on your definition of 'firewall turned off', the new feature of restricting ports used by OpenMPI will help. The firewall can stay on, but it should be configured to open a range of ports used by OpenMPI.
[OMPI users] [Fwd: Re: open mpi on non standard ssh port]
Hey again, I tried to build a work around via port redirection: iptables -t nat -A PREROUTING -i eth1 -p tcp --dport 22 -j REDIRECT --to-port 5101 If I do that then I can start the job: mpirun -np 2 -machinefile /home/bknapp/scripts/machinefile.txt mdrun -np 2 -nice 0 -s 1fyt_PKYVKQNTLELAT_bindingRegionsOnly.md.tpr -o 1fyt_PKYVKQNTLELAT_bindingRegionsOnly.md.trr -c 1fyt_PKYVKQNTLELAT_bindingRegionsOnly.md.pdb -g 1fyt_PKYVKQNTLELAT_bindingRegionsOnly.md.log -e 1fyt_PKYVKQNTLELAT_bindingRegionsOnly.md.edr -v bknapp@192.168.0.104's password: NNODES=2, MYRANK=0, HOSTNAME=quoVadis01 NNODES=2, MYRANK=1, HOSTNAME=quoVadis04 but it comes up with "[quoVadis01][[24802,1],0][btl_tcp_endpoint.c:631:mca_btl_tcp_endpoint_complete_connect] connect() failed: No route to host (113)". The CPUs are calculating on both (physically different machines) but unfortunately no results are written ... Was the port redirection of 22 not enough or is there another problem? thx Bernhard Original Message Subject:Re: open mpi on non standard ssh port List-Post: users@lists.open-mpi.org Date: Wed, 18 Mar 2009 09:19:18 +0100 From: Bernhard Knapp To: us...@open-mpi.org References: come on, it must be somehow possible to use openmpi not on port 22!? ;-) -- Message: 3 Date: Tue, 17 Mar 2009 09:45:29 +0100 From: Bernhard Knapp Subject: [OMPI users] open mpi on non standard ssh port To: us...@open-mpi.org Message-ID: <49bf6329.8090...@meduniwien.ac.at> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Hi I want to start a gromacs simulation on a small cluster where non standard ports are used for ssh. If I just use a "normal" maschinelist file (with the ips of the nodes), consequently, the following error comes up: ssh: connect to host 192.168.0.103 port 22: Connection refused I guess that I need to somehow tell him to use the other ports. I tried it in the following way (maschinelist file): 192.168.0.101 -p 5101 192.168.0.102 -p 5102 192.168.0.103 -p 5103 192.168.0.104 -p 5104 But it seems this is not the correct way to specifiy the port: Open RTE detected a parse error in the hostfile: /home/bknapp/scripts/machinefile.txt It occured on line number 1 on token 5: -p How can I tell him to use port 5101 on machine 192.168.0.101? May be the question is stupid but I could not find a solution via google or search function ... cheers Bernhard -- Dipl.-Ing. (FH) Bernhard Knapp Univ.-Ass.postgrad. Unit for Medical Statistics and Informatics - Section for Biomedical Computersimulation and Bioinformatics Medical University of Vienna - General Hospital Spitalgasse 23 A-1090 WIEN / AUSTRIA Room: BT88 - 88.03.712 Phone: +43(1) 40400-6673
Re: [OMPI users] openmpi 1.3 and gridengine tight integrationproblem
Hi, Thanks for the help. I only use the machine file to run outside of SGE just to test/prove that things work outside of SGE. When I run with in SGE here is what the job script looks like: hpcp7781(salmr0)128:cat simple-job.sh #!/bin/csh # #$ -S /bin/csh setenv LD_LIBRARY_PATH /bphpc7/vol0/salmr0/ompi/lib mpirun --mca plm_base_verbose 20 --prefix /bphpc7/vol0/salmr0/ompi -np 16 /bphpc7/vol0/salmr0/SGE/a.out We are using PEs. Here is what the PE looks like: hpcp7781(salmr0)129:qconf -sp pavtest pe_name pavtest slots 16 user_listsNONE xuser_lists NONE start_proc_args /bin/true stop_proc_args/bin/true allocation_rule 8 control_slavesFALSE job_is_first_task FALSE urgency_slots min here is he qsub line to submit the job: >>qsub -pe pavtest 16 simple-job.sh The job seems to run fine with no problems with in SGE if I contain the job with in one node. As soon as the job has to use more than one one things stop working with the message I posted about LD_LIBRARY_PATH and orted seems not to start on the remote nodes. Thanks Rene On Wed, 2009-03-18 at 07:45 +, Reuti wrote: > Hi, > > it shouldn't be necessary to supply a machinefile, as the one > generated by SGE is taken automatically (i.e. the granted nodes are > honored). You submitted the job requesting a PE? > > -- Reuti > > > Am 18.03.2009 um 04:51 schrieb Salmon, Rene: > > > > > Hi, > > > > I have looked through the list archives and google but could not > > find anything related to what I am seeing. I am simply trying to > > run the basic cpi.c code using SGE and tight integration. > > > > If run outside SGE i can run my jobs just fine: > > hpcp7781(salmr0)132:mpiexec -np 2 --machinefile x a.out > > Process 0 on hpcp7781 > > Process 1 on hpcp7782 > > pi is approximately 3.1416009869231241, Error is 0.0809 > > wall clock time = 0.032325 > > > > > > If I submit to SGE I get this: > > > > [hpcp7781:08527] mca: base: components_open: Looking for plm > > components > > [hpcp7781:08527] mca: base: components_open: opening plm components > > [hpcp7781:08527] mca: base: components_open: found loaded component > > rsh > > [hpcp7781:08527] mca: base: components_open: component rsh has no > > register function > > [hpcp7781:08527] mca: base: components_open: component rsh open > > function successful > > [hpcp7781:08527] mca: base: components_open: found loaded component > > slurm > > [hpcp7781:08527] mca: base: components_open: component slurm has no > > register function > > [hpcp7781:08527] mca: base: components_open: component slurm open > > function successful > > [hpcp7781:08527] mca:base:select: Auto-selecting plm components > > [hpcp7781:08527] mca:base:select:( plm) Querying component [rsh] > > [hpcp7781:08527] [[INVALID],INVALID] plm:rsh: using /hpc/SGE/bin/ > > lx24-amd64/qrsh for launching > > [hpcp7781:08527] mca:base:select:( plm) Query of component [rsh] > > set priority to 10 > > [hpcp7781:08527] mca:base:select:( plm) Querying component [slurm] > > [hpcp7781:08527] mca:base:select:( plm) Skipping component > > [slurm]. Query failed to return a module > > [hpcp7781:08527] mca:base:select:( plm) Selected component [rsh] > > [hpcp7781:08527] mca: base: close: component slurm closed > > [hpcp7781:08527] mca: base: close: unloading component slurm > > Starting server daemon at host "hpcp7782" > > error: executing task of job 1702026 failed: > > > -- > > > > A daemon (pid 8528) died unexpectedly with status 1 while attempting > > to launch so we are aborting. > > > > There may be more information reported by the environment (see > above). > > > > This may be because the daemon was unable to find all the needed > > shared > > libraries on the remote node. You may set your LD_LIBRARY_PATH to > > have the > > location of the shared libraries on the remote nodes and this will > > automatically be forwarded to the remote nodes. > > > -- > > > > > -- > > > > mpirun noticed that the job aborted, but has no info as to the > process > > that caused that situation. > > > -- > > > > mpirun: clean termination accomplished > > > > [hpcp7781:08527] mca: base: close: component rsh closed > > [hpcp7781:08527] mca: base: close: unloading component rsh > > > > > > > > > > Seems to me orted is not starting on the remote node. I have > > LD_LIBRARY_PATH set on my shell startup files. If I do an ldd on > > orted i see this: > > > > hpcp7781(salmr0)135:ldd /bphpc7/vol0/salmr0/ompi/bin/orted > > libopen-rte.so.0 => /bphpc7/vol0/salmr0/ompi/lib/libopen- > > rte.so.0 (0x2ac5b14e2000) > > libopen-pal.so.0 => /bphpc7/vol0/salmr0/ompi/lib/libopen- > > pal.so.0 (0x00
Re: [OMPI users] openmpi 1.3 and gridengine tight integrationproblem
Hi, Am 18.03.2009 um 14:25 schrieb Rene Salmon: Thanks for the help. I only use the machine file to run outside of SGE just to test/prove that things work outside of SGE. aha. Did you compile Open MPI 1.3 with the SGE option? When I run with in SGE here is what the job script looks like: hpcp7781(salmr0)128:cat simple-job.sh #!/bin/csh # #$ -S /bin/csh -S will only work if the queue configuration is set to posix_compliant. If it's set to unix_behavior, the first line of the script is already sufficient. setenv LD_LIBRARY_PATH /bphpc7/vol0/salmr0/ompi/lib Maybe you have to set this LD_LIBRARY_PATH in your .cshrc, so it's known automatically on the nodes. mpirun --mca plm_base_verbose 20 --prefix /bphpc7/vol0/salmr0/ompi -np 16 /bphpc7/vol0/salmr0/SGE/a.out Do you use --mca... only for debugging or why is it added here? -- Reuti We are using PEs. Here is what the PE looks like: hpcp7781(salmr0)129:qconf -sp pavtest pe_name pavtest slots 16 user_listsNONE xuser_lists NONE start_proc_args /bin/true stop_proc_args/bin/true allocation_rule 8 control_slavesFALSE job_is_first_task FALSE urgency_slots min here is he qsub line to submit the job: qsub -pe pavtest 16 simple-job.sh The job seems to run fine with no problems with in SGE if I contain the job with in one node. As soon as the job has to use more than one one things stop working with the message I posted about LD_LIBRARY_PATH and orted seems not to start on the remote nodes. Thanks Rene On Wed, 2009-03-18 at 07:45 +, Reuti wrote: Hi, it shouldn't be necessary to supply a machinefile, as the one generated by SGE is taken automatically (i.e. the granted nodes are honored). You submitted the job requesting a PE? -- Reuti Am 18.03.2009 um 04:51 schrieb Salmon, Rene: Hi, I have looked through the list archives and google but could not find anything related to what I am seeing. I am simply trying to run the basic cpi.c code using SGE and tight integration. If run outside SGE i can run my jobs just fine: hpcp7781(salmr0)132:mpiexec -np 2 --machinefile x a.out Process 0 on hpcp7781 Process 1 on hpcp7782 pi is approximately 3.1416009869231241, Error is 0.0809 wall clock time = 0.032325 If I submit to SGE I get this: [hpcp7781:08527] mca: base: components_open: Looking for plm components [hpcp7781:08527] mca: base: components_open: opening plm components [hpcp7781:08527] mca: base: components_open: found loaded component rsh [hpcp7781:08527] mca: base: components_open: component rsh has no register function [hpcp7781:08527] mca: base: components_open: component rsh open function successful [hpcp7781:08527] mca: base: components_open: found loaded component slurm [hpcp7781:08527] mca: base: components_open: component slurm has no register function [hpcp7781:08527] mca: base: components_open: component slurm open function successful [hpcp7781:08527] mca:base:select: Auto-selecting plm components [hpcp7781:08527] mca:base:select:( plm) Querying component [rsh] [hpcp7781:08527] [[INVALID],INVALID] plm:rsh: using /hpc/SGE/bin/ lx24-amd64/qrsh for launching [hpcp7781:08527] mca:base:select:( plm) Query of component [rsh] set priority to 10 [hpcp7781:08527] mca:base:select:( plm) Querying component [slurm] [hpcp7781:08527] mca:base:select:( plm) Skipping component [slurm]. Query failed to return a module [hpcp7781:08527] mca:base:select:( plm) Selected component [rsh] [hpcp7781:08527] mca: base: close: component slurm closed [hpcp7781:08527] mca: base: close: unloading component slurm Starting server daemon at host "hpcp7782" error: executing task of job 1702026 failed: - - A daemon (pid 8528) died unexpectedly with status 1 while attempting to launch so we are aborting. There may be more information reported by the environment (see above). This may be because the daemon was unable to find all the needed shared libraries on the remote node. You may set your LD_LIBRARY_PATH to have the location of the shared libraries on the remote nodes and this will automatically be forwarded to the remote nodes. - - - - mpirun noticed that the job aborted, but has no info as to the process that caused that situation. - - mpirun: clean termination accomplished [hpcp7781:08527] mca: base: close: component rsh closed [hpcp7781:08527] mca: base: close: unloading component rsh Seems to me orted is not starting on the remote node. I have LD_LIBRARY_PATH set on my shell startup files. If I do an ldd on orted i see this: hpcp7781(salmr0)135:ldd /bphpc7/vol0/salmr0/ompi/bin/orted libopen-rte.so.0 => /bphpc7/vol0/salmr0/ompi/lib/libopen-
Re: [OMPI users] openmpi 1.3 and gridengine tight integrationproblem
> > aha. Did you compile Open MPI 1.3 with the SGE option? > Yes I did. hpcp7781(salmr0)142:ompi_info |grep grid MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.3) > > > setenv LD_LIBRARY_PATH /bphpc7/vol0/salmr0/ompi/lib > > Maybe you have to set this LD_LIBRARY_PATH in your .cshrc, so it's > known automatically on the nodes. > Yes. I also have "setenv LD_LIBRARY_PATH /bphpc7/vol0/salmr0/ompi/lib" on my .cshrc as well. I just wanted to make double sure that it was there. I also even tried putting "/bphpc7/vol0/salmr0/ompi/lib" in /etc/ld.so.conf system wide just to test and see if that would help but still same results. > > mpirun --mca plm_base_verbose 20 --prefix /bphpc7/vol0/salmr0/ompi > -np > > 16 /bphpc7/vol0/salmr0/SGE/a.out > > Do you use --mca... only for debugging or why is it added here? > I only put that there for debugging. Is there a different flag I should use to get more debug info? Thanks Rene > -- Reuti > > > > > > We are using PEs. Here is what the PE looks like: > > > > hpcp7781(salmr0)129:qconf -sp pavtest > > pe_name pavtest > > slots 16 > > user_listsNONE > > xuser_lists NONE > > start_proc_args /bin/true > > stop_proc_args/bin/true > > allocation_rule 8 > > control_slavesFALSE > > job_is_first_task FALSE > > urgency_slots min > > > > > > here is he qsub line to submit the job: > > > >>> qsub -pe pavtest 16 simple-job.sh > > > > > > The job seems to run fine with no problems with in SGE if I contain > > the > > job with in one node. As soon as the job has to use more than one > one > > things stop working with the message I posted about LD_LIBRARY_PATH > > and > > orted seems not to start on the remote nodes. > > > > Thanks > > Rene > > > > > > > > > > On Wed, 2009-03-18 at 07:45 +, Reuti wrote: > >> Hi, > >> > >> it shouldn't be necessary to supply a machinefile, as the one > >> generated by SGE is taken automatically (i.e. the granted nodes are > >> honored). You submitted the job requesting a PE? > >> > >> -- Reuti > >> > >> > >> Am 18.03.2009 um 04:51 schrieb Salmon, Rene: > >> > >>> > >>> Hi, > >>> > >>> I have looked through the list archives and google but could not > >>> find anything related to what I am seeing. I am simply trying to > >>> run the basic cpi.c code using SGE and tight integration. > >>> > >>> If run outside SGE i can run my jobs just fine: > >>> hpcp7781(salmr0)132:mpiexec -np 2 --machinefile x a.out > >>> Process 0 on hpcp7781 > >>> Process 1 on hpcp7782 > >>> pi is approximately 3.1416009869231241, Error is > 0.0809 > >>> wall clock time = 0.032325 > >>> > >>> > >>> If I submit to SGE I get this: > >>> > >>> [hpcp7781:08527] mca: base: components_open: Looking for plm > >>> components > >>> [hpcp7781:08527] mca: base: components_open: opening plm > components > >>> [hpcp7781:08527] mca: base: components_open: found loaded > component > >>> rsh > >>> [hpcp7781:08527] mca: base: components_open: component rsh has no > >>> register function > >>> [hpcp7781:08527] mca: base: components_open: component rsh open > >>> function successful > >>> [hpcp7781:08527] mca: base: components_open: found loaded > component > >>> slurm > >>> [hpcp7781:08527] mca: base: components_open: component slurm has > no > >>> register function > >>> [hpcp7781:08527] mca: base: components_open: component slurm open > >>> function successful > >>> [hpcp7781:08527] mca:base:select: Auto-selecting plm components > >>> [hpcp7781:08527] mca:base:select:( plm) Querying component [rsh] > >>> [hpcp7781:08527] [[INVALID],INVALID] plm:rsh: using /hpc/SGE/bin/ > >>> lx24-amd64/qrsh for launching > >>> [hpcp7781:08527] mca:base:select:( plm) Query of component [rsh] > >>> set priority to 10 > >>> [hpcp7781:08527] mca:base:select:( plm) Querying component > [slurm] > >>> [hpcp7781:08527] mca:base:select:( plm) Skipping component > >>> [slurm]. Query failed to return a module > >>> [hpcp7781:08527] mca:base:select:( plm) Selected component [rsh] > >>> [hpcp7781:08527] mca: base: close: component slurm closed > >>> [hpcp7781:08527] mca: base: close: unloading component slurm > >>> Starting server daemon at host "hpcp7782" > >>> error: executing task of job 1702026 failed: > >>> > >> > - > >> - > >>> > >>> A daemon (pid 8528) died unexpectedly with status 1 while > attempting > >>> to launch so we are aborting. > >>> > >>> There may be more information reported by the environment (see > >> above). > >>> > >>> This may be because the daemon was unable to find all the needed > >>> shared > >>> libraries on the remote node. You may set your LD_LIBRARY_PATH to > >>> have the > >>> location of the shared libraries on the remote nodes and this will > >>> automatically be forwarded to the remote nodes. > >>> > >> > - > >> - > >>> > >>> > >
Re: [OMPI users] openmpi 1.3 and gridengine tight integrationproblem
On 03/18/09 09:52, Reuti wrote: Hi, Am 18.03.2009 um 14:25 schrieb Rene Salmon: Thanks for the help. I only use the machine file to run outside of SGE just to test/prove that things work outside of SGE. aha. Did you compile Open MPI 1.3 with the SGE option? When I run with in SGE here is what the job script looks like: hpcp7781(salmr0)128:cat simple-job.sh #!/bin/csh # #$ -S /bin/csh -S will only work if the queue configuration is set to posix_compliant. If it's set to unix_behavior, the first line of the script is already sufficient. setenv LD_LIBRARY_PATH /bphpc7/vol0/salmr0/ompi/lib Maybe you have to set this LD_LIBRARY_PATH in your .cshrc, so it's known automatically on the nodes. mpirun --mca plm_base_verbose 20 --prefix /bphpc7/vol0/salmr0/ompi -np 16 /bphpc7/vol0/salmr0/SGE/a.out Do you use --mca... only for debugging or why is it added here? -- Reuti We are using PEs. Here is what the PE looks like: hpcp7781(salmr0)129:qconf -sp pavtest pe_name pavtest slots 16 user_listsNONE xuser_lists NONE start_proc_args /bin/true stop_proc_args/bin/true allocation_rule 8 control_slavesFALSE job_is_first_task FALSE urgency_slots min At this FAQ, we show an example of a parallel environment setup. http://www.open-mpi.org/faq/?category=running#run-n1ge-or-sge I am wondering if the control_slaves needs to be TRUE. And double check the that the PE (pavtest) is on the list for the queue (also mentioned at the FAQ). And perhaps start trying to run hostname first. Rolf here is he qsub line to submit the job: qsub -pe pavtest 16 simple-job.sh The job seems to run fine with no problems with in SGE if I contain the job with in one node. As soon as the job has to use more than one one things stop working with the message I posted about LD_LIBRARY_PATH and orted seems not to start on the remote nodes. Thanks Rene On Wed, 2009-03-18 at 07:45 +, Reuti wrote: Hi, it shouldn't be necessary to supply a machinefile, as the one generated by SGE is taken automatically (i.e. the granted nodes are honored). You submitted the job requesting a PE? -- Reuti Am 18.03.2009 um 04:51 schrieb Salmon, Rene: Hi, I have looked through the list archives and google but could not find anything related to what I am seeing. I am simply trying to run the basic cpi.c code using SGE and tight integration. If run outside SGE i can run my jobs just fine: hpcp7781(salmr0)132:mpiexec -np 2 --machinefile x a.out Process 0 on hpcp7781 Process 1 on hpcp7782 pi is approximately 3.1416009869231241, Error is 0.0809 wall clock time = 0.032325 If I submit to SGE I get this: [hpcp7781:08527] mca: base: components_open: Looking for plm components [hpcp7781:08527] mca: base: components_open: opening plm components [hpcp7781:08527] mca: base: components_open: found loaded component rsh [hpcp7781:08527] mca: base: components_open: component rsh has no register function [hpcp7781:08527] mca: base: components_open: component rsh open function successful [hpcp7781:08527] mca: base: components_open: found loaded component slurm [hpcp7781:08527] mca: base: components_open: component slurm has no register function [hpcp7781:08527] mca: base: components_open: component slurm open function successful [hpcp7781:08527] mca:base:select: Auto-selecting plm components [hpcp7781:08527] mca:base:select:( plm) Querying component [rsh] [hpcp7781:08527] [[INVALID],INVALID] plm:rsh: using /hpc/SGE/bin/ lx24-amd64/qrsh for launching [hpcp7781:08527] mca:base:select:( plm) Query of component [rsh] set priority to 10 [hpcp7781:08527] mca:base:select:( plm) Querying component [slurm] [hpcp7781:08527] mca:base:select:( plm) Skipping component [slurm]. Query failed to return a module [hpcp7781:08527] mca:base:select:( plm) Selected component [rsh] [hpcp7781:08527] mca: base: close: component slurm closed [hpcp7781:08527] mca: base: close: unloading component slurm Starting server daemon at host "hpcp7782" error: executing task of job 1702026 failed: -- A daemon (pid 8528) died unexpectedly with status 1 while attempting to launch so we are aborting. There may be more information reported by the environment (see above). This may be because the daemon was unable to find all the needed shared libraries on the remote node. You may set your LD_LIBRARY_PATH to have the location of the shared libraries on the remote nodes and this will automatically be forwarded to the remote nodes. -- -- mpirun noticed that the job aborted, but has no info as to the process that caused that situation. -- mpirun: clean termination accomplished [hpcp7781:08527] mca: bas
[OMPI users] Fwd: New MPI-2.1 standard in hardcover - the yellow book
I can't remember if I've forwarded this to the OMPI lists before; pardon if you have seen this before. I have one of these books and I find it quite handy. IMHO: it's quite a steal for US$25 (~600 pages). Begin forwarded message: From: "Rolf Rabenseifner" Date: March 18, 2009 10:21:31 AM EDT Subject: New MPI-2.1 standard in hardcover - the yellow book - Please forward this announcement to any colleagues who may be interested. Our apologies if you receive multiple copies. - Dear MPI user, now, the new official MPI-2.1 standard (June 2008) in hardcover can be shipped to anywhere in the world. As a service (at costs) for users of the Message Passing Interface, HLRS has printed the new Standard, Version 2.1, June 23, 2008 in hardcover. The price is only 17 Euro or 25 US-$. You can find a picture of the book and the text of the standard at http://www.mpi-forum.org/docs/docs.html Selling & shipping of the book is done through https://fs.hlrs.de/projects/par/mpi/mpi21/ It is the complete MPI in one book! This was one of the goals of MPI-2.1. I was responsible for this project in the international MPI Forum. HLRS is not a commercial publisher. Therefore, in the first months, the book was only available at some conferences. Now, the international shipping is organized through a web store: https://fs.hlrs.de/projects/par/mpi/mpi21/ Only a limited number of books is available. Best regards Rolf Rabenseifner - Dr. Rolf Rabenseifner .. . . . . . . . . . email rabenseif...@hlrs.de High Performance Computing Center (HLRS) . phone ++49(0)711/685-65530 University of Stuttgart .. . . . . . . . . fax : ++49(0)711/685-65832 Head of Dpmt Parallel Computing .. .. www.hlrs.de/people/rabenseifner Nobelstr. 19, D-70550 Stuttgart, Germany . . (Office: Allmandring 30) - -- Jeff Squyres Cisco Systems
Re: [OMPI users] openmpi 1.3 and gridengine tight integrationproblem
> > At this FAQ, we show an example of a parallel environment setup. > http://www.open-mpi.org/faq/?category=running#run-n1ge-or-sge > > I am wondering if the control_slaves needs to be TRUE. > And double check the that the PE (pavtest) is on the list for the > queue > (also mentioned at the FAQ). And perhaps start trying to run hostname > first. Changing control_slaves to true did not make thing work but it did provide me with a bit more info on this. Enough to figure things out. Now when I run i get a message about "rcmd:socket: Permission denied" : Starting server daemon at host "hpcp7782" Server daemon successfully started with task id "1.hpcp7782" Establishing /hpc/SGE/utilbin/lx24-amd64/rsh session to host hpcp7782 ... rcmd: socket: Permission denied /hpc/SGE/utilbin/lx24-amd64/rsh exited with exit code 1 reading exit code from shepherd ... timeout (60 s) expired while waiting on socket fd 4 error: error reading returncode of remote command -- A daemon (pid 31961) died unexpectedly with status 255 while attempting to launch so we are aborting. There may be more information reported by the environment (see above). This may be because the daemon was unable to find all the needed shared libraries on the remote node. You may set your LD_LIBRARY_PATH to have the location of the shared libraries on the remote nodes and this will automatically be forwarded to the remote nodes. -- -- mpirun noticed that the job aborted, but has no info as to the process that caused that situation. -- mpirun: clean termination accomplished [hpcp7781:31960] mca: base: close: component rsh closed [hpcp7781:31960] mca: base: close: unloading component rsh So it turns out the NFS mount for SGE on the clients had option "nosuid" set which does not allow the qrsh/rsh SGE binaries to run because they are setuid. Got rid of the "nosuid" and now things work just fine. Thank you for the help Rene
Re: [OMPI users] open mpi on non standard ssh port
FWIW, two other people said the same thing already: http://www.open-mpi.org/community/lists/users/2009/03/8479.php http://www.open-mpi.org/community/lists/users/2009/03/8481.php :-) On Mar 18, 2009, at 4:51 AM, Reuti wrote: Bernhard, Am 18.03.2009 um 09:19 schrieb Bernhard Knapp: > come on, it must be somehow possible to use openmpi not on port > 22!? ;-) it's not an issue of Open MPI but ssh. You need in your home a file ~/.ssh/config with two lines: host * port 1234 or whatever port you need. -- Reuti >> >> -- >> >> Message: 3 >> Date: Tue, 17 Mar 2009 09:45:29 +0100 >> From: Bernhard Knapp >> Subject: [OMPI users] open mpi on non standard ssh port >> To: us...@open-mpi.org >> Message-ID: <49bf6329.8090...@meduniwien.ac.at> >> Content-Type: text/plain; charset=ISO-8859-1; format=flowed >> >> Hi >> >> I want to start a gromacs simulation on a small cluster where non >> standard ports are used for ssh. If I just use a "normal" >> maschinelist file (with the ips of the nodes), consequently, the >> following error comes up: >> ssh: connect to host 192.168.0.103 port 22: Connection refused >> >> I guess that I need to somehow tell him to use the other ports. I >> tried it in the following way (maschinelist file): >> 192.168.0.101 -p 5101 >> 192.168.0.102 -p 5102 >> 192.168.0.103 -p 5103 >> 192.168.0.104 -p 5104 >> >> But it seems this is not the correct way to specifiy the port: >> Open RTE detected a parse error in the hostfile: >>/home/bknapp/scripts/machinefile.txt >> It occured on line number 1 on token 5: >>-p >> >> How can I tell him to use port 5101 on machine 192.168.0.101? >> May be the question is stupid but I could not find a solution via >> google or search function ... >> >> cheers >> Bernhard >> >> >> > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems
Re: [OMPI users] [Fwd: Re: open mpi on non standard ssh port]
It means you started the jobs ok (via ssh) but Open MPI wasn't able to open TCP sockets between the two MPI processes. Open MPI needs to be able to communicate via random TCP ports between its MPI processes. On Mar 18, 2009, at 8:39 AM, Bernhard Knapp wrote: Hey again, I tried to build a work around via port redirection: iptables -t nat -A PREROUTING -i eth1 -p tcp --dport 22 -j REDIRECT --to-port 5101 If I do that then I can start the job: mpirun -np 2 -machinefile /home/bknapp/scripts/machinefile.txt mdrun -np 2 -nice 0 -s 1fyt_PKYVKQNTLELAT_bindingRegionsOnly.md.tpr - o 1fyt_PKYVKQNTLELAT_bindingRegionsOnly.md.trr -c 1fyt_PKYVKQNTLELAT_bindingRegionsOnly.md.pdb -g 1fyt_PKYVKQNTLELAT_bindingRegionsOnly.md.log -e 1fyt_PKYVKQNTLELAT_bindingRegionsOnly.md.edr -v bknapp@192.168.0.104's password: NNODES=2, MYRANK=0, HOSTNAME=quoVadis01 NNODES=2, MYRANK=1, HOSTNAME=quoVadis04 but it comes up with "[quoVadis01][[24802,1],0][btl_tcp_endpoint.c: 631:mca_btl_tcp_endpoint_complete_connect] connect() failed: No route to host (113)". The CPUs are calculating on both (physically different machines) but unfortunately no results are written ... Was the port redirection of 22 not enough or is there another problem? thx Bernhard Original Message Subject: Re: open mpi on non standard ssh port Date: Wed, 18 Mar 2009 09:19:18 +0100 From: Bernhard Knapp To: us...@open-mpi.org References: come on, it must be somehow possible to use openmpi not on port 22!? ;-) > >-- > >Message: 3 >Date: Tue, 17 Mar 2009 09:45:29 +0100 >From: Bernhard Knapp >Subject: [OMPI users] open mpi on non standard ssh port >To: us...@open-mpi.org >Message-ID: <49bf6329.8090...@meduniwien.ac.at> >Content-Type: text/plain; charset=ISO-8859-1; format=flowed > >Hi > >I want to start a gromacs simulation on a small cluster where non >standard ports are used for ssh. If I just use a "normal" maschinelist >file (with the ips of the nodes), consequently, the following error >comes up: >ssh: connect to host 192.168.0.103 port 22: Connection refused > >I guess that I need to somehow tell him to use the other ports. I tried >it in the following way (maschinelist file): >192.168.0.101 -p 5101 >192.168.0.102 -p 5102 >192.168.0.103 -p 5103 >192.168.0.104 -p 5104 > >But it seems this is not the correct way to specifiy the port: >Open RTE detected a parse error in the hostfile: >/home/bknapp/scripts/machinefile.txt >It occured on line number 1 on token 5: >-p > >How can I tell him to use port 5101 on machine 192.168.0.101? >May be the question is stupid but I could not find a solution via google >or search function ... > >cheers >Bernhard > > > > -- Dipl.-Ing. (FH) Bernhard Knapp Univ.-Ass.postgrad. Unit for Medical Statistics and Informatics - Section for Biomedical Computersimulation and Bioinformatics Medical University of Vienna - General Hospital Spitalgasse 23 A-1090 WIEN / AUSTRIA Room: BT88 - 88.03.712 Phone: +43(1) 40400-6673 ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems
[OMPI users] selected pml cm, but peer [[2469, 1], 0] on compute-0-0 selected pml ob1
Hi all, anyone ever seen an error like this? Seems like I have some setting wrong in opemmpi. I thought I had it setup like the other machines but seems as though I have missed something. I only get the error when adding machine "fs1" to the hostfile list. The other 40+ machines seem fine. [fs1.calvin.edu:01750] [[2469,1],6] selected pml cm, but peer [[2469,1],0] on compute-0-0 selected pml ob1 When I use ompi_info the output looks like my other machines: [root@fs1 openmpi-1.3]# ompi_info | grep btl MCA btl: ofud (MCA v2.0, API v2.0, Component v1.3) MCA btl: openib (MCA v2.0, API v2.0, Component v1.3) MCA btl: self (MCA v2.0, API v2.0, Component v1.3) MCA btl: sm (MCA v2.0, API v2.0, Component v1.3) The whole error is below, any help would be greatly appreciated. Gary [admin@dahl 00.greetings]$ /usr/local/bin/mpirun --mca btl ^tcp --hostfile machines -np 7 greetings [fs1.calvin.edu:01959] [[2212,1],6] selected pml cm, but peer [[2212,1],0] on compute-0-0 selected pml ob1 -- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): PML add procs failed --> Returned "Unreachable" (-12) instead of "Success" (0) -- *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) [fs1.calvin.edu:1959] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed! -- At least one pair of MPI processes are unable to reach each other for MPI communications. This means that no Open MPI device has indicated that it can be used to communicate between these processes. This is an error; Open MPI requires that all MPI processes be able to reach each other. This error can sometimes be the result of forgetting to specify the "self" BTL. Process 1 ([[2212,1],3]) is on host: dahl.calvin.edu Process 2 ([[2212,1],0]) is on host: compute-0-0 BTLs attempted: openib self sm Your MPI job is now going to abort; sorry. -- *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) [dahl.calvin.edu:16884] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed! *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) [compute-0-0.local:1591] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed! *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) [fs2.calvin.edu:8826] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed! -- mpirun has exited due to process rank 3 with PID 16884 on node dahl.calvin.edu exiting without calling "finalize". This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). -- [dahl.calvin.edu:16879] 3 more processes have sent help message help-mpi-runtime / mpi_init:startup:internal-failure [dahl.calvin.edu:16879] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages [dahl.calvin.edu:16879] 2 more processes have sent help message help-mca-bml-r2.txt / unreachable proc