Hi again,
Tim Prins schrieb:
Hi,
On Monday 01 October 2007 03:56:16 pm Dino Rossegger wrote:
Hi again,
Yes the error output is the same:
root@sun:~# mpirun --hostfile hostfile main
[sun:23748] [0,0,0] ORTE_ERROR_LOG: Timeout in file
base/pls_base_orted_cmds.c at line 275
[sun:23748] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at
line 1164
[sun:23748] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 90
[sun:23748] ERROR: A daemon on node saturn failed to start as expected.
[sun:23748] ERROR: There may be more information available from
[sun:23748] ERROR: the remote shell (see above).
[sun:23748] ERROR: The daemon exited unexpectedly with status 255.
[sun:23748] [0,0,0] ORTE_ERROR_LOG: Timeout in file
base/pls_base_orted_cmds.c at line 188
[sun:23748] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at
line 1196
--------------------------------------------------------------------------
mpirun was unable to cleanly terminate the daemons for this job.
Returned value Timeout instead of ORTE_SUCCESS.
--------------------------------------------------------------------------
Can you try:
mpirun --debug-daemons --hostfile hostfile main
Did it but it doesn't give me any special output (as far as I can see that)
Heres the output:
root@sun:~# mpirun --debug-daemons --hostfile hostfile ./main
Daemon [0,0,1] checking in as pid 27168 on host sun
[sun:27168] [0,0,1] orted_recv_pls: received message from [0,0
,0]
[sun:27168] [0,0,1] orted_recv_pls: received kill_local_procs
[sun:27167] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_b
ase_orted_cmds.c at line 275
[sun:27167] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_mo
dule.c at line 1164
[sun:27167] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp
.c at line 90
[sun:27167] ERROR: A daemon on node saturn failed to start as
expected.
[sun:27167] ERROR: There may be more information available fro
m
[sun:27167] ERROR: the remote shell (see above).
[sun:27167] ERROR: The daemon exited unexpectedly with status
255.
[sun:27168] [0,0,1] orted_recv_pls: received message from [0,0
,0]
[sun:27168] [0,0,1] orted_recv_pls: received exit
[sun:27167] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_b
ase_orted_cmds.c at line 188
[sun:27167] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_mo
dule.c at line 1196
--------------------------------------------------------------
------------
mpirun was unable to cleanly terminate the daemons for this jo
b. Returned value Timeout instead of ORTE_SUCCESS.
--------------------------------------------------------------
------------
This may give more output about the error. Also, try
mpirun -mca pls rsh -mca pls_rsh_debug 1 --hostfile hostfile main
Heres the output, but I cant decipher it ^^
root@sun:~# mpirun -mca pls rsh -mca pls_rsh_debug 1 --hostfil
e hostfile main
[sun:27175] pls:rsh: local csh: 0, local sh: 1
[sun:27175] pls:rsh: assuming same remote shell as local shell
[sun:27175] pls:rsh: remote csh: 0, remote sh: 1
[sun:27175] pls:rsh: final template argv:
[sun:27175] pls:rsh: /usr/bin/ssh <template> orted --bootp
roxy 1 --name <template> --num_procs 3 --vpid_start 0 --nodena
me <template> --universe root@sun:default-universe-27175 --nsr
eplica "0.0.0;tcp://192.168.1.254:4733;tcp://172.16.0.202:4733
" --gprreplica "0.0.0;tcp://192.168.1.254:4733;tcp://172.16.0.
202:4733"
[sun:27175] pls:rsh: launching on node sun
[sun:27175] pls:rsh: sun is a LOCAL node
[sun:27175] pls:rsh: changing to directory /root
[sun:27175] pls:rsh: executing: (/usr/local/bin/orted) orted -
-bootproxy 1 --name 0.0.1 --num_procs 3 --vpid_start 0 --noden
ame sun --universe root@sun:default-universe-27175 --nsreplica
"0.0.0;tcp://192.168.1.254:4733;tcp://172.16.0.202:4733" --gp
rreplica "0.0.0;tcp://192.168.1.254:4733;tcp://172.16.0.202:47
33" --set-sid [SSH_AGENT_PID=24793 TERM=xterm SHELL=/bin/bash
SSH_CLIENT=10.2.56.124 21001 22 SSH_TTY=/dev/pts/0 USER=root L
D_LIBRARY_PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/b
in:/sbin:/bin:/opt/c3-4/:/usr/lib:/usr/local/lib SSH_AUTH_SOCK
=/tmp/ssh-sxbbH24792/agent.24792 MAIL=/var/mail/root PATH=/usr
/local/bin:/usr/bin:/bin:/usr/games:/opt/c3-4/:/usr/local/lib
PWD=/root LANG=en_US.UTF-8 SHLVL=1 HOME=/root LOGNAME=root SSH
_CONNECTION=10.2.56.124 21001 172.16.0.202 22 _=/usr/local/bin
/mpirun OMPI_MCA_rds_hostfile_path=hostfile orte-job-globals O
MPI_MCA_pls_rsh_debug=1 OMPI_MCA_seed=0]
[sun:27175] pls:rsh: launching on node saturn
[sun:27175] pls:rsh: saturn is a REMOTE node
[sun:27175] pls:rsh: executing: (//usr/bin/ssh) /usr/bin/ssh s
aturn orted --bootproxy 1 --name
0.0.2 --num_procs 3 --vpid_st
art 0 --nodename saturn --universe root@sun:default-universe-2
7175 --nsreplica
"0.0.0;tcp://192.168.1.254:4733;tcp://172.16.
0.202:4733" --gprreplica
"0.0.0;tcp://192.168.1.254:4733;tcp:/
/172.16.0.202:4733" [SSH_AGENT_PID=24793 TERM=xterm
SHELL=/bin
/bash
SSH_CLIENT=10.2.56.124 21001 22 SSH_TTY=/dev/pts/0 USER=
root
LD_LIBRARY_PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:
/usr/bin:/sbin:/bin:/opt/c3-4/:/usr/lib:/usr/local/lib SSH_AUT
H_SOCK=/tmp/ssh-sxbbH24792/agent.24792 MAIL=/var/mail/root PAT
H=/usr/local/bin:/usr/bin:/bin:/usr/games:/opt/c3-4/:/usr/loca
l/lib PWD=/root LANG=en_US.UTF-8
SHLVL=1 HOME=/root LOGNAME=ro
ot SSH_CONNECTION=10.2.56.124 21001 172.16.0.202 22 _=/usr/loc
al/bin/mpirun
OMPI_MCA_rds_hostfile_path=hostfile orte-job-glo
bals OMPI_MCA_pls_rsh_debug=1 OMPI_MCA_seed=0]
[sun:27175] [0,0,0] ORTE_ERROR_LOG: Timeout in file
base/pls_base_orted_cmds.c at line 275
[sun:27175] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at
line 1164
[sun:27175] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 90
[sun:27175] ERROR: A daemon on node saturn failed to start as expected.
[sun:27175] ERROR: There may be more information available from
[sun:27175] ERROR: the remote shell (see above).
[sun:27175] ERROR: The daemon exited unexpectedly with status 255.
[sun:27175] [0,0,0] ORTE_ERROR_LOG: Timeout in file
base/pls_base_orted_cmds.c at line 188
[sun:27175] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at
line 1196
--------------------------------------------------------------------------
mpirun was unable to cleanly terminate the daemons for this job.
Returned value Timeout instead of ORTE_SUCCESS.
--------------------------------------------------------------------------
This will print out the exact command that is used to launch the orted.
Also, I would highly recommend not running Open MPI as root. It is just a bad
idea.
Yes I know, I'm doing it just now for testing.
I wrote the following to my .ssh/environment (on all machines)
LD_LIBRARY_PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bi
n:/opt/c3-4/:/usr/lib:/usr/local/lib;
PATH=$PATH:/usr/local/lib;
export LD_LIBRARY_PATH;
export PATH;
and added the statement you told me to the ssd_config (on all machines):
PermitUserEnvironment yes
And it seems to me that the pathes are correct now.
My shell is bash (/bin/bash)
When running locate orted (to find out where exactly my openmpi
installation is (compilation defaults) i saw that, on sun there was a
/usr/bin/orted while there wasn't one on saturn.
I deleted /usr/bin/orted on sun and tried again with the option --prefix
/usr/local/ (which seems to be my installation directory) but it
didn't work (same error).
Is it possible that you are mixing 2 different installations of Open MPI? You
may consider installing OpenMPI to a NFS drive to make these things a bit
easier.
Is there a script or anything like that with which I can uninstall
openmpi, because i'll might try a new compilation to /opt/openmpi since
it doesn't look like I would be able to solve the problem.
If you still have the tree around that you used to 'make' Open MPI, you can
just go into that tree and type 'make uninstall'.
Hope this helps,
Tim
jody schrieb:
Now that the PATHs seem to be set correctly for
ssh i don't know what the problem could be.
Is the error message still the same on as in the first mail?
Did you do the envorpnment/sshd_config on both machines?
What shell are you using?
On other test you could make is to start your application
with the --prefix option:
$mpirun -np 2 --prefix /opt/openmpi -H sun,saturn ./main
(assuming your Open MPI installation lies in /opt/openmpi
on both machines)
Jody
On 10/1/07, Dino Rossegger <dino.rosseg...@gmx.at> wrote:
Hi Jodi,
did the steps as you said, but it didn't work for me.
I set LD_LIBRARY_PATH in /etc/environment and ~/.shh/environment and
made the changes to sshd_config.
But this all didn't solve my problem, although the pahts seemed to be
set correctly (judging what ssh saturn `printenv >> test` says). I also
restarted the ssh server, the error is the same.
Hope you can help me out here and thanks for your help so far
dino
jody schrieb:
Dino -
I had a similar problem.
I was only able to solve it by setting PATH and LS_LIBRARY_PATH
in the file ~/ssh/environment on the client and setting
PermitUserEnvironment yes
in /etc/ssh/sshd_config on the server (for this you need root
prioviledge though)
To be on the safe side, i did both on all my nodes
Jody
On 9/27/07, Dino Rossegger <dino.rosseg...@gmx.at> wrote:
Hi Jody,
Thanks for your help, it really is the case that either in PATH nor in
LD_LIBRARY_PATH the path to the libs is set correctly. I'll try out,
hope it works.
jody schrieb:
Hi Dino
Try
ssh saturn printenv | grep PATH
>from your host sun to see what your environment variables are when
ssh is run without a shell.
On 9/27/07, Dino Rossegger <dino.rosseg...@gmx.at> wrote:
Hi,
I have a problem running a simple programm mpihello.cpp.
Here is a excerp of the error and the command
root@sun:~# mpirun -H sun,saturn main
[sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file
base/pls_base_orted_cmds.c at line 275
[sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c
at line 1164
[sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at
line 90 [sun:25213] ERROR: A daemon on node saturn failed to start
as expected. [sun:25213] ERROR: There may be more information
available from [sun:25213] ERROR: the remote shell (see above).
[sun:25213] ERROR: The daemon exited unexpectedly with status 255.
[sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file
base/pls_base_orted_cmds.c at line 188
[sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c
at line 1196
--------------------------------------------------------------------
------ mpirun was unable to cleanly terminate the daemons for this
job. Returned value Timeout instead of ORTE_SUCCESS.
--------------------------------------------------------------------
------
The program is runable from each node alone (mpirun -np2 main)
My PathVariables:
$PATH
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/c3
-4/:/usr/lib:/usr/local/libecho $LD_LIBRARY_PATH
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/c3
-4/:/usr/lib:/usr/local/lib
Passwordless ssh is up 'n running
I walked through the FAQ and Mailing Lists but couldn't find any
solution for my problem.
Thanks
Dino R.
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users