[OMPI users] can't run parallel job on cluster

2007-06-12 Thread lichanjua...@lzu.cn
hi,all:
I am a first user of openmpi, I have used mpich before.I found there
are many differenties between them.So I am confused.
I build openmpi on a ps3 using default option,that is
  $ ./configure --prefiex=
  $ make all install
I modify my .bash_profile file and add openmpi lib and
executable file
in LD_LIBRARY_PATH and PATH.
I use NFS file system between server and node, I just install
openmpi on
server.
I check the mailling list and FAQ, knowing default lancher is
ssh,but I
sitll add "pls_rsh_agent = ssh" in openmpi-mca-params.conf.

I test the hello_c.c example. when I run:
$mpiexec -host ps3-2 -n 4 ./hello
it can run correctly(ps3-2 is hostname of server).I try it on
each node.
but when I run:
$ mpiexec -hostfile host.txt -n 4 ./hello

content of host.txt:
ps3-1
ps3-2

there is error message:

bash: orted: command not found
[ps3-1:25154] [0,0,0] ORTE_ERROR_LOG: Timeout in file
base/pls_base_orted_cmds.c at line 275
[ps3-1:25154] [0,0,0] ORTE_ERROR_LOG: Timeout in file
pls_rsh_module.c
at line 1164
[ps3-1:25154] [0,0,0] ORTE_ERROR_LOG: Timeout in file
errmgr_hnp.c at
line 90
[ps3-1:25154] ERROR: A daemon on node ps3-2 failed to start as
expected.
[ps3-1:25154] ERROR: There may be more information available
from
[ps3-1:25154] ERROR: the remote shell (see above).
[ps3-1:25154] ERROR: The daemon exited unexpectedly with status
127.
[ps3-1:25154] [0,0,0] ORTE_ERROR_LOG: Timeout in file
base/pls_base_orted_cmds.c at line 188
[ps3-1:25154] [0,0,0] ORTE_ERROR_LOG: Timeout in file
pls_rsh_module.c
at line 1196

--
mpiexec was unable to cleanly terminate the daemons for this
job.
Returned value Timeout instead of ORTE_SUCCESS.


--
I search the same problem in mailing list and FAQ, saying PATH
and
LD_LIBRARY_PATH are not setted correctly,but I ensure them in my
path.
I use openmpi in first time, so hope anybody help me,thanks a
lot!
-- 
Li, ChanjuanLanzhou University
Distributed & Embedded System Lab  http://dslab.lzu.edu.cn
School of Information Science and Engeneering    lichanjua...@lzu.cn
Tianshui South Road 222. Lanzhou 73  .P.R.China
Tel:+86-931-8912025Fax:+86-931-8912022


signature.asc
Description: This is a digitally signed message part


[OMPI users] can't run parallel job on cluster

2007-06-12 Thread lichanjua...@lzu.cn
On Wed, 2007-06-13 at 11:47 +0800, lichanjua...@lzu.cn wrote:
> hi,all:
> I am a first user of openmpi, I have used mpich before.I found there
> are many differenties between them.So I am confused.
> I build openmpi on a ps3 using default option,that is
>   $ ./configure --prefiex=
>   $ make all install
> I modify my .bash_profile file and add openmpi lib and
> executable file
> in LD_LIBRARY_PATH and PATH.
> I use NFS file system between server and node, I just install
> openmpi on
> server.
> I check the mailling list and FAQ, knowing default lancher is
> ssh,but I
> sitll add "pls_rsh_agent = ssh" in openmpi-mca-params.conf.
> 
> I test the hello_c.c example. when I run:
> $mpiexec -host ps3-2 -n 4 ./hello
> it can run correctly(ps3-2 is hostname of server).I try it on
> each node.
> but when I run:
> $ mpiexec -hostfile host.txt -n 4 ./hello
> 
> content of host.txt:
> ps3-1
> ps3-2
> 
> there is error message:
> 
> bash: orted: command not found
> [ps3-1:25154] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> base/pls_base_orted_cmds.c at line 275
> [ps3-1:25154] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> pls_rsh_module.c
> at line 1164
> [ps3-1:25154] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> errmgr_hnp.c at
> line 90
> [ps3-1:25154] ERROR: A daemon on node ps3-2 failed to start as
> expected.
> [ps3-1:25154] ERROR: There may be more information available
> from
> [ps3-1:25154] ERROR: the remote shell (see above).
> [ps3-1:25154] ERROR: The daemon exited unexpectedly with status
> 127.
> [ps3-1:25154] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> base/pls_base_orted_cmds.c at line 188
> [ps3-1:25154] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> pls_rsh_module.c
> at line 1196
> 
> --
> mpiexec was unable to cleanly terminate the daemons for this
> job.
> Returned value Timeout instead of ORTE_SUCCESS.
> 
> 
> --
> I search the same problem in mailing list and FAQ, saying PATH
> and
> LD_LIBRARY_PATH are not setted correctly,but I ensure them in my
> path.
> I use openmpi in first time, so hope anybody help me,thanks a
> lot!
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
sorry, I forget some information. I use openmpi1.2, I try to run the
command on remote host such as ,run command on ps3-1:
$ mpiexec -host ps3-2 -n 2 ./a.out
there appear same error message.I think there is something wrong with
rsh/ssh,but I don't where to modify or some file I missed. 
if someone met same problem,please tell me the solution. I will be
grateful. thanks very much!

Li chanjuan
-- 
Li, ChanjuanLanzhou University
Distributed & Embedded System Lab  http://dslab.lzu.edu.cn
School of Information Science and Engeneeringlichanjua...@lzu.cn
Tianshui South Road 222. Lanzhou 73  .P.R.China
Tel:+86-931-8912025Fax:+86-931-8912022


signature.asc
Description: This is a digitally signed message part


Re: [OMPI users] can't run parallel job on cluster

2007-06-14 Thread lichanjua...@lzu.cn
Thanks for your help, I solved my problem. I make a stupid mistake,I
just modify .bash_profile before, I modify .bashrc, it works. so stupid
I am.
anyway,thanks your reply.
On Thu, 2007-06-14 at 07:10 -0400, Jeff Squyres wrote:
> You have two options:
> 
> 1. Ensure that your PATH and LD_LIBRARY_PATH are exactly what you  
> think they are on the remote nodes.  A common problem that some  
> people run into is that they setup their PATH/LD_LIBRARY_PATH in the  
> "interactive" portions of their .bashrc, meaning that they are only  
> set for interactive logins (and therefore not set for non-interactive  
> logins).  Try the following:
> 
>   ssh othernode 'echo $PATH'
> 
> Note the single quotes; they are necessary to ensure that "echo  
> $PATH" is evaluated on the *remote* node.  Do the same with  
> $LD_LIBRARY_PATH and ensure that they are really set to the values  
> that you think they are.  Check out the following FAQ entry:
> 
>  http://www.open-mpi.org/faq/?category=running#adding-ompi-to-path
> 
> 2. Use the --prefix functionality in mpirun to automatically set the  
> PATH / LD_LIBRARY_PATH values for the remote node.  Check out this  
> FAQ entry:
> 
>  http://www.open-mpi.org/faq/?category=running#mpirun-prefix
> 
> Note that a synonym to the --prefix functionality that is not [yet]  
> mentioned in that FAQ entry is that you can use the absolute pathname  
> to mpirun.  For example:
> 
>  /path/to/mpirun ...
> 
> Or you can use OMPI 1.2's --enable-mpirun-prefix-by-default option to  
> OMPI's configure, which will tell mpirun to always assume that it  
> needs to use --prefix-like behavior (without you needing to specify  
> it on the mpirun command line).
> 
> Hope that helps.
> 
> 
> On Jun 12, 2007, at 11:58 PM, lichanjua...@lzu.cn wrote:
> 
> > On Wed, 2007-06-13 at 11:47 +0800, lichanjua...@lzu.cn wrote:
> >> hi,all:
> >> I am a first user of openmpi, I have used mpich before.I found  
> >> there
> >> are many differenties between them.So I am confused.
> >> I build openmpi on a ps3 using default option,that is
> >>   $ ./configure --prefiex=
> >>   $ make all install
> >> I modify my .bash_profile file and add openmpi lib and
> >> executable file
> >> in LD_LIBRARY_PATH and PATH.
> >> I use NFS file system between server and node, I just install
> >> openmpi on
> >> server.
> >> I check the mailling list and FAQ, knowing default lancher is
> >> ssh,but I
> >> sitll add "pls_rsh_agent = ssh" in openmpi-mca-params.conf.
> >>
> >> I test the hello_c.c example. when I run:
> >> $mpiexec -host ps3-2 -n 4 ./hello
> >> it can run correctly(ps3-2 is hostname of server).I try it on
> >> each node.
> >> but when I run:
> >> $ mpiexec -hostfile host.txt -n 4 ./hello
> >>
> >> content of host.txt:
> >> ps3-1
> >> ps3-2
> >>
> >> there is error message:
> >>
> >> bash: orted: command not found
> >> [ps3-1:25154] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> >> base/pls_base_orted_cmds.c at line 275
> >> [ps3-1:25154] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> >> pls_rsh_module.c
> >> at line 1164
> >> [ps3-1:25154] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> >> errmgr_hnp.c at
> >> line 90
> >> [ps3-1:25154] ERROR: A daemon on node ps3-2 failed to  
> >> start as
> >> expected.
> >> [ps3-1:25154] ERROR: There may be more information available
> >> from
> >> [ps3-1:25154] ERROR: the remote shell (see above).
> >> [ps3-1:25154] ERROR: The daemon exited unexpectedly with  
> >> status
> >> 127.
> >> [ps3-1:25154] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> >> base/pls_base_orted_cmds.c at line 188
> >> [ps3-1:25154] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> >> pls_rsh_module.c
> >> at line 1196
> >>  
> >> - 
> >> -
> >> mpiexec was unable to cleanly terminate the daemons for this
> >> job.
> >> Returned value Timeout instead