Hi again, Tim Prins schrieb: > Hi, > > On Monday 01 October 2007 03:56:16 pm Dino Rossegger wrote: >> Hi again, >> >> Yes the error output is the same: >> root@sun:~# mpirun --hostfile hostfile main >> [sun:23748] [0,0,0] ORTE_ERROR_LOG: Timeout in file >> base/pls_base_orted_cmds.c at line 275 >> [sun:23748] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at >> line 1164 >> [sun:23748] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 90 >> [sun:23748] ERROR: A daemon on node saturn failed to start as expected. >> [sun:23748] ERROR: There may be more information available from >> [sun:23748] ERROR: the remote shell (see above). >> [sun:23748] ERROR: The daemon exited unexpectedly with status 255. >> [sun:23748] [0,0,0] ORTE_ERROR_LOG: Timeout in file >> base/pls_base_orted_cmds.c at line 188 >> [sun:23748] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at >> line 1196 >> -------------------------------------------------------------------------- >> mpirun was unable to cleanly terminate the daemons for this job. >> Returned value Timeout instead of ORTE_SUCCESS. >> >> -------------------------------------------------------------------------- > Can you try: > mpirun --debug-daemons --hostfile hostfile main > Did it but it doesn't give me any special output (as far as I can see that) Heres the output: root@sun:~# mpirun --debug-daemons --hostfile hostfile ./main Daemon [0,0,1] checking in as pid 27168 on host sun [sun:27168] [0,0,1] orted_recv_pls: received message from [0,0 ,0] [sun:27168] [0,0,1] orted_recv_pls: received kill_local_procs [sun:27167] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_b ase_orted_cmds.c at line 275 [sun:27167] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_mo dule.c at line 1164 [sun:27167] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp .c at line 90 [sun:27167] ERROR: A daemon on node saturn failed to start as expected. [sun:27167] ERROR: There may be more information available fro m [sun:27167] ERROR: the remote shell (see above). [sun:27167] ERROR: The daemon exited unexpectedly with status 255. [sun:27168] [0,0,1] orted_recv_pls: received message from [0,0 ,0] [sun:27168] [0,0,1] orted_recv_pls: received exit
[sun:27167] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_b ase_orted_cmds.c at line 188 [sun:27167] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_mo dule.c at line 1196 -------------------------------------------------------------- ------------ mpirun was unable to cleanly terminate the daemons for this jo b. Returned value Timeout instead of ORTE_SUCCESS. -------------------------------------------------------------- ------------ > This may give more output about the error. Also, try > mpirun -mca pls rsh -mca pls_rsh_debug 1 --hostfile hostfile main Heres the output, but I cant decipher it ^^ root@sun:~# mpirun -mca pls rsh -mca pls_rsh_debug 1 --hostfil e hostfile main [sun:27175] pls:rsh: local csh: 0, local sh: 1 [sun:27175] pls:rsh: assuming same remote shell as local shell [sun:27175] pls:rsh: remote csh: 0, remote sh: 1 [sun:27175] pls:rsh: final template argv: [sun:27175] pls:rsh: /usr/bin/ssh <template> orted --bootp roxy 1 --name <template> --num_procs 3 --vpid_start 0 --nodena me <template> --universe root@sun:default-universe-27175 --nsr eplica "0.0.0;tcp://192.168.1.254:4733;tcp://172.16.0.202:4733 " --gprreplica "0.0.0;tcp://192.168.1.254:4733;tcp://172.16.0. 202:4733" [sun:27175] pls:rsh: launching on node sun [sun:27175] pls:rsh: sun is a LOCAL node [sun:27175] pls:rsh: changing to directory /root [sun:27175] pls:rsh: executing: (/usr/local/bin/orted) orted - -bootproxy 1 --name 0.0.1 --num_procs 3 --vpid_start 0 --noden ame sun --universe root@sun:default-universe-27175 --nsreplica "0.0.0;tcp://192.168.1.254:4733;tcp://172.16.0.202:4733" --gp rreplica "0.0.0;tcp://192.168.1.254:4733;tcp://172.16.0.202:47 33" --set-sid [SSH_AGENT_PID=24793 TERM=xterm SHELL=/bin/bash SSH_CLIENT=10.2.56.124 21001 22 SSH_TTY=/dev/pts/0 USER=root L D_LIBRARY_PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/b in:/sbin:/bin:/opt/c3-4/:/usr/lib:/usr/local/lib SSH_AUTH_SOCK =/tmp/ssh-sxbbH24792/agent.24792 MAIL=/var/mail/root PATH=/usr /local/bin:/usr/bin:/bin:/usr/games:/opt/c3-4/:/usr/local/lib PWD=/root LANG=en_US.UTF-8 SHLVL=1 HOME=/root LOGNAME=root SSH _CONNECTION=10.2.56.124 21001 172.16.0.202 22 _=/usr/local/bin /mpirun OMPI_MCA_rds_hostfile_path=hostfile orte-job-globals O MPI_MCA_pls_rsh_debug=1 OMPI_MCA_seed=0] [sun:27175] pls:rsh: launching on node saturn [sun:27175] pls:rsh: saturn is a REMOTE node [sun:27175] pls:rsh: executing: (//usr/bin/ssh) /usr/bin/ssh s aturn orted --bootproxy 1 --name 0.0.2 --num_procs 3 --vpid_st art 0 --nodename saturn --universe root@sun:default-universe-2 7175 --nsreplica "0.0.0;tcp://192.168.1.254:4733;tcp://172.16. 0.202:4733" --gprreplica "0.0.0;tcp://192.168.1.254:4733;tcp:/ /172.16.0.202:4733" [SSH_AGENT_PID=24793 TERM=xterm SHELL=/bin /bash SSH_CLIENT=10.2.56.124 21001 22 SSH_TTY=/dev/pts/0 USER= root LD_LIBRARY_PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin: /usr/bin:/sbin:/bin:/opt/c3-4/:/usr/lib:/usr/local/lib SSH_AUT H_SOCK=/tmp/ssh-sxbbH24792/agent.24792 MAIL=/var/mail/root PAT H=/usr/local/bin:/usr/bin:/bin:/usr/games:/opt/c3-4/:/usr/loca l/lib PWD=/root LANG=en_US.UTF-8 SHLVL=1 HOME=/root LOGNAME=ro ot SSH_CONNECTION=10.2.56.124 21001 172.16.0.202 22 _=/usr/loc al/bin/mpirun OMPI_MCA_rds_hostfile_path=hostfile orte-job-glo bals OMPI_MCA_pls_rsh_debug=1 OMPI_MCA_seed=0] [sun:27175] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at line 275 [sun:27175] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1164 [sun:27175] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 90 [sun:27175] ERROR: A daemon on node saturn failed to start as expected. [sun:27175] ERROR: There may be more information available from [sun:27175] ERROR: the remote shell (see above). [sun:27175] ERROR: The daemon exited unexpectedly with status 255. [sun:27175] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at line 188 [sun:27175] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1196 -------------------------------------------------------------------------- mpirun was unable to cleanly terminate the daemons for this job. Returned value Timeout instead of ORTE_SUCCESS. -------------------------------------------------------------------------- > This will print out the exact command that is used to launch the orted. > > Also, I would highly recommend not running Open MPI as root. It is just a bad > idea. Yes I know, I'm doing it just now for testing. >> I wrote the following to my .ssh/environment (on all machines) >> LD_LIBRARY_PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bi >> n:/opt/c3-4/:/usr/lib:/usr/local/lib; >> >> PATH=$PATH:/usr/local/lib; >> >> export LD_LIBRARY_PATH; >> export PATH; >> >> and added the statement you told me to the ssd_config (on all machines): >> PermitUserEnvironment yes >> >> And it seems to me that the pathes are correct now. >> >> My shell is bash (/bin/bash) >> >> When running locate orted (to find out where exactly my openmpi >> installation is (compilation defaults) i saw that, on sun there was a >> /usr/bin/orted while there wasn't one on saturn. >> I deleted /usr/bin/orted on sun and tried again with the option --prefix >> /usr/local/ (which seems to be my installation directory) but it >> didn't work (same error). > Is it possible that you are mixing 2 different installations of Open MPI? You > may consider installing OpenMPI to a NFS drive to make these things a bit > easier. >> Is there a script or anything like that with which I can uninstall >> openmpi, because i'll might try a new compilation to /opt/openmpi since >> it doesn't look like I would be able to solve the problem. > If you still have the tree around that you used to 'make' Open MPI, you can > just go into that tree and type 'make uninstall'. > > Hope this helps, > > Tim > >> jody schrieb: >>> Now that the PATHs seem to be set correctly for >>> ssh i don't know what the problem could be. >>> >>> Is the error message still the same on as in the first mail? >>> Did you do the envorpnment/sshd_config on both machines? >>> What shell are you using? >>> >>> On other test you could make is to start your application >>> with the --prefix option: >>> >>> $mpirun -np 2 --prefix /opt/openmpi -H sun,saturn ./main >>> >>> (assuming your Open MPI installation lies in /opt/openmpi >>> on both machines) >>> >>> >>> Jody >>> >>> On 10/1/07, Dino Rossegger <dino.rosseg...@gmx.at> wrote: >>>> Hi Jodi, >>>> did the steps as you said, but it didn't work for me. >>>> I set LD_LIBRARY_PATH in /etc/environment and ~/.shh/environment and >>>> made the changes to sshd_config. >>>> >>>> But this all didn't solve my problem, although the pahts seemed to be >>>> set correctly (judging what ssh saturn `printenv >> test` says). I also >>>> restarted the ssh server, the error is the same. >>>> >>>> Hope you can help me out here and thanks for your help so far >>>> dino >>>> >>>> jody schrieb: >>>>> Dino - >>>>> I had a similar problem. >>>>> I was only able to solve it by setting PATH and LS_LIBRARY_PATH >>>>> in the file ~/ssh/environment on the client and setting >>>>> PermitUserEnvironment yes >>>>> in /etc/ssh/sshd_config on the server (for this you need root >>>>> prioviledge though) >>>>> >>>>> To be on the safe side, i did both on all my nodes >>>>> >>>>> Jody >>>>> >>>>> On 9/27/07, Dino Rossegger <dino.rosseg...@gmx.at> wrote: >>>>>> Hi Jody, >>>>>> >>>>>> Thanks for your help, it really is the case that either in PATH nor in >>>>>> LD_LIBRARY_PATH the path to the libs is set correctly. I'll try out, >>>>>> hope it works. >>>>>> >>>>>> jody schrieb: >>>>>>> Hi Dino >>>>>>> >>>>>>> Try >>>>>>> ssh saturn printenv | grep PATH >>>>>>> >>>>>>> >from your host sun to see what your environment variables are when >>>>>>> >>>>>>> ssh is run without a shell. >>>>>>> >>>>>>> On 9/27/07, Dino Rossegger <dino.rosseg...@gmx.at> wrote: >>>>>>>> Hi, >>>>>>>> >>>>>>>> I have a problem running a simple programm mpihello.cpp. >>>>>>>> >>>>>>>> Here is a excerp of the error and the command >>>>>>>> root@sun:~# mpirun -H sun,saturn main >>>>>>>> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file >>>>>>>> base/pls_base_orted_cmds.c at line 275 >>>>>>>> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c >>>>>>>> at line 1164 >>>>>>>> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at >>>>>>>> line 90 [sun:25213] ERROR: A daemon on node saturn failed to start >>>>>>>> as expected. [sun:25213] ERROR: There may be more information >>>>>>>> available from [sun:25213] ERROR: the remote shell (see above). >>>>>>>> [sun:25213] ERROR: The daemon exited unexpectedly with status 255. >>>>>>>> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file >>>>>>>> base/pls_base_orted_cmds.c at line 188 >>>>>>>> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c >>>>>>>> at line 1196 >>>>>>>> -------------------------------------------------------------------- >>>>>>>> ------ mpirun was unable to cleanly terminate the daemons for this >>>>>>>> job. Returned value Timeout instead of ORTE_SUCCESS. >>>>>>>> >>>>>>>> -------------------------------------------------------------------- >>>>>>>> ------ >>>>>>>> >>>>>>>> The program is runable from each node alone (mpirun -np2 main) >>>>>>>> >>>>>>>> My PathVariables: >>>>>>>> $PATH >>>>>>>> /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/c3 >>>>>>>> -4/:/usr/lib:/usr/local/libecho $LD_LIBRARY_PATH >>>>>>>> /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/c3 >>>>>>>> -4/:/usr/lib:/usr/local/lib >>>>>>>> >>>>>>>> Passwordless ssh is up 'n running >>>>>>>> >>>>>>>> I walked through the FAQ and Mailing Lists but couldn't find any >>>>>>>> solution for my problem. >>>>>>>> >>>>>>>> Thanks >>>>>>>> Dino R. >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >