[OMPI users] can't run parallel job on cluster
hi,all: I am a first user of openmpi, I have used mpich before.I found there are many differenties between them.So I am confused. I build openmpi on a ps3 using default option,that is $ ./configure --prefiex= $ make all install I modify my .bash_profile file and add openmpi lib and executable file in LD_LIBRARY_PATH and PATH. I use NFS file system between server and node, I just install openmpi on server. I check the mailling list and FAQ, knowing default lancher is ssh,but I sitll add "pls_rsh_agent = ssh" in openmpi-mca-params.conf. I test the hello_c.c example. when I run: $mpiexec -host ps3-2 -n 4 ./hello it can run correctly(ps3-2 is hostname of server).I try it on each node. but when I run: $ mpiexec -hostfile host.txt -n 4 ./hello content of host.txt: ps3-1 ps3-2 there is error message: bash: orted: command not found [ps3-1:25154] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at line 275 [ps3-1:25154] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1164 [ps3-1:25154] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 90 [ps3-1:25154] ERROR: A daemon on node ps3-2 failed to start as expected. [ps3-1:25154] ERROR: There may be more information available from [ps3-1:25154] ERROR: the remote shell (see above). [ps3-1:25154] ERROR: The daemon exited unexpectedly with status 127. [ps3-1:25154] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at line 188 [ps3-1:25154] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1196 -- mpiexec was unable to cleanly terminate the daemons for this job. Returned value Timeout instead of ORTE_SUCCESS. -- I search the same problem in mailing list and FAQ, saying PATH and LD_LIBRARY_PATH are not setted correctly,but I ensure them in my path. I use openmpi in first time, so hope anybody help me,thanks a lot! -- Li, ChanjuanLanzhou University Distributed & Embedded System Lab http://dslab.lzu.edu.cn School of Information Science and Engeneering lichanjua...@lzu.cn Tianshui South Road 222. Lanzhou 73 .P.R.China Tel:+86-931-8912025Fax:+86-931-8912022 signature.asc Description: This is a digitally signed message part
[OMPI users] can't run parallel job on cluster
On Wed, 2007-06-13 at 11:47 +0800, lichanjua...@lzu.cn wrote: > hi,all: > I am a first user of openmpi, I have used mpich before.I found there > are many differenties between them.So I am confused. > I build openmpi on a ps3 using default option,that is > $ ./configure --prefiex= > $ make all install > I modify my .bash_profile file and add openmpi lib and > executable file > in LD_LIBRARY_PATH and PATH. > I use NFS file system between server and node, I just install > openmpi on > server. > I check the mailling list and FAQ, knowing default lancher is > ssh,but I > sitll add "pls_rsh_agent = ssh" in openmpi-mca-params.conf. > > I test the hello_c.c example. when I run: > $mpiexec -host ps3-2 -n 4 ./hello > it can run correctly(ps3-2 is hostname of server).I try it on > each node. > but when I run: > $ mpiexec -hostfile host.txt -n 4 ./hello > > content of host.txt: > ps3-1 > ps3-2 > > there is error message: > > bash: orted: command not found > [ps3-1:25154] [0,0,0] ORTE_ERROR_LOG: Timeout in file > base/pls_base_orted_cmds.c at line 275 > [ps3-1:25154] [0,0,0] ORTE_ERROR_LOG: Timeout in file > pls_rsh_module.c > at line 1164 > [ps3-1:25154] [0,0,0] ORTE_ERROR_LOG: Timeout in file > errmgr_hnp.c at > line 90 > [ps3-1:25154] ERROR: A daemon on node ps3-2 failed to start as > expected. > [ps3-1:25154] ERROR: There may be more information available > from > [ps3-1:25154] ERROR: the remote shell (see above). > [ps3-1:25154] ERROR: The daemon exited unexpectedly with status > 127. > [ps3-1:25154] [0,0,0] ORTE_ERROR_LOG: Timeout in file > base/pls_base_orted_cmds.c at line 188 > [ps3-1:25154] [0,0,0] ORTE_ERROR_LOG: Timeout in file > pls_rsh_module.c > at line 1196 > > -- > mpiexec was unable to cleanly terminate the daemons for this > job. > Returned value Timeout instead of ORTE_SUCCESS. > > > -- > I search the same problem in mailing list and FAQ, saying PATH > and > LD_LIBRARY_PATH are not setted correctly,but I ensure them in my > path. > I use openmpi in first time, so hope anybody help me,thanks a > lot! > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users sorry, I forget some information. I use openmpi1.2, I try to run the command on remote host such as ,run command on ps3-1: $ mpiexec -host ps3-2 -n 2 ./a.out there appear same error message.I think there is something wrong with rsh/ssh,but I don't where to modify or some file I missed. if someone met same problem,please tell me the solution. I will be grateful. thanks very much! Li chanjuan -- Li, ChanjuanLanzhou University Distributed & Embedded System Lab http://dslab.lzu.edu.cn School of Information Science and Engeneeringlichanjua...@lzu.cn Tianshui South Road 222. Lanzhou 73 .P.R.China Tel:+86-931-8912025Fax:+86-931-8912022 signature.asc Description: This is a digitally signed message part
Re: [OMPI users] can't run parallel job on cluster
Thanks for your help, I solved my problem. I make a stupid mistake,I just modify .bash_profile before, I modify .bashrc, it works. so stupid I am. anyway,thanks your reply. On Thu, 2007-06-14 at 07:10 -0400, Jeff Squyres wrote: > You have two options: > > 1. Ensure that your PATH and LD_LIBRARY_PATH are exactly what you > think they are on the remote nodes. A common problem that some > people run into is that they setup their PATH/LD_LIBRARY_PATH in the > "interactive" portions of their .bashrc, meaning that they are only > set for interactive logins (and therefore not set for non-interactive > logins). Try the following: > > ssh othernode 'echo $PATH' > > Note the single quotes; they are necessary to ensure that "echo > $PATH" is evaluated on the *remote* node. Do the same with > $LD_LIBRARY_PATH and ensure that they are really set to the values > that you think they are. Check out the following FAQ entry: > > http://www.open-mpi.org/faq/?category=running#adding-ompi-to-path > > 2. Use the --prefix functionality in mpirun to automatically set the > PATH / LD_LIBRARY_PATH values for the remote node. Check out this > FAQ entry: > > http://www.open-mpi.org/faq/?category=running#mpirun-prefix > > Note that a synonym to the --prefix functionality that is not [yet] > mentioned in that FAQ entry is that you can use the absolute pathname > to mpirun. For example: > > /path/to/mpirun ... > > Or you can use OMPI 1.2's --enable-mpirun-prefix-by-default option to > OMPI's configure, which will tell mpirun to always assume that it > needs to use --prefix-like behavior (without you needing to specify > it on the mpirun command line). > > Hope that helps. > > > On Jun 12, 2007, at 11:58 PM, lichanjua...@lzu.cn wrote: > > > On Wed, 2007-06-13 at 11:47 +0800, lichanjua...@lzu.cn wrote: > >> hi,all: > >> I am a first user of openmpi, I have used mpich before.I found > >> there > >> are many differenties between them.So I am confused. > >> I build openmpi on a ps3 using default option,that is > >> $ ./configure --prefiex= > >> $ make all install > >> I modify my .bash_profile file and add openmpi lib and > >> executable file > >> in LD_LIBRARY_PATH and PATH. > >> I use NFS file system between server and node, I just install > >> openmpi on > >> server. > >> I check the mailling list and FAQ, knowing default lancher is > >> ssh,but I > >> sitll add "pls_rsh_agent = ssh" in openmpi-mca-params.conf. > >> > >> I test the hello_c.c example. when I run: > >> $mpiexec -host ps3-2 -n 4 ./hello > >> it can run correctly(ps3-2 is hostname of server).I try it on > >> each node. > >> but when I run: > >> $ mpiexec -hostfile host.txt -n 4 ./hello > >> > >> content of host.txt: > >> ps3-1 > >> ps3-2 > >> > >> there is error message: > >> > >> bash: orted: command not found > >> [ps3-1:25154] [0,0,0] ORTE_ERROR_LOG: Timeout in file > >> base/pls_base_orted_cmds.c at line 275 > >> [ps3-1:25154] [0,0,0] ORTE_ERROR_LOG: Timeout in file > >> pls_rsh_module.c > >> at line 1164 > >> [ps3-1:25154] [0,0,0] ORTE_ERROR_LOG: Timeout in file > >> errmgr_hnp.c at > >> line 90 > >> [ps3-1:25154] ERROR: A daemon on node ps3-2 failed to > >> start as > >> expected. > >> [ps3-1:25154] ERROR: There may be more information available > >> from > >> [ps3-1:25154] ERROR: the remote shell (see above). > >> [ps3-1:25154] ERROR: The daemon exited unexpectedly with > >> status > >> 127. > >> [ps3-1:25154] [0,0,0] ORTE_ERROR_LOG: Timeout in file > >> base/pls_base_orted_cmds.c at line 188 > >> [ps3-1:25154] [0,0,0] ORTE_ERROR_LOG: Timeout in file > >> pls_rsh_module.c > >> at line 1196 > >> > >> - > >> - > >> mpiexec was unable to cleanly terminate the daemons for this > >> job. > >> Returned value Timeout instead