Here the Syntax & Output of the Command: root@sun:~# mpirun --hostfile hostfile saturn [sun:28777] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at line 275 [sun:28777] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1164 [sun:28777] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 90 [sun:28777] ERROR: A daemon on node saturn failed to start as expected. [sun:28777] ERROR: There may be more information available from [sun:28777] ERROR: the remote shell (see above). [sun:28777] ERROR: The daemon exited unexpectedly with status 255. [sun:28777] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at line 188 [sun:28777] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1196 -------------------------------------------------------------------------- mpirun was unable to cleanly terminate the daemons for this job. Returned value Timeout instead of ORTE_SUCCESS.
-------------------------------------------------------------------------- I'm using version 1.2.3, got it from openmpi.org. I'm using the same version of openmpi on all nodes. Thanks dino Tim Prins schrieb: > This is very odd. The daemon is being launched properly, but then things > get strange. It looks like mpirun is sending a message to kill > application processes on saturn. > > What version of Open MPI are you using? > > Are you sure that the same version of Open MPI us being used everywhere? > > Can you try: > mpirun --hostfile hostfile hostname > > Thanks, > > Tim > > Dino Rossegger wrote: >> Hi again, >> >> Tim Prins schrieb: >>> Hi, >>> >>> On Monday 01 October 2007 03:56:16 pm Dino Rossegger wrote: >>>> Hi again, >>>> >>>> Yes the error output is the same: >>>> root@sun:~# mpirun --hostfile hostfile main >>>> [sun:23748] [0,0,0] ORTE_ERROR_LOG: Timeout in file >>>> base/pls_base_orted_cmds.c at line 275 >>>> [sun:23748] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at >>>> line 1164 >>>> [sun:23748] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 90 >>>> [sun:23748] ERROR: A daemon on node saturn failed to start as expected. >>>> [sun:23748] ERROR: There may be more information available from >>>> [sun:23748] ERROR: the remote shell (see above). >>>> [sun:23748] ERROR: The daemon exited unexpectedly with status 255. >>>> [sun:23748] [0,0,0] ORTE_ERROR_LOG: Timeout in file >>>> base/pls_base_orted_cmds.c at line 188 >>>> [sun:23748] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at >>>> line 1196 >>>> -------------------------------------------------------------------------- >>>> mpirun was unable to cleanly terminate the daemons for this job. >>>> Returned value Timeout instead of ORTE_SUCCESS. >>>> >>>> -------------------------------------------------------------------------- >>> Can you try: >>> mpirun --debug-daemons --hostfile hostfile main >>> >> Did it but it doesn't give me any special output (as far as I can see that) >> Heres the output: >> root@sun:~# mpirun --debug-daemons --hostfile hostfile ./main >> Daemon [0,0,1] checking in as pid 27168 on host sun >> [sun:27168] [0,0,1] orted_recv_pls: received message from [0,0 >> ,0] >> [sun:27168] [0,0,1] orted_recv_pls: received kill_local_procs >> [sun:27167] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_b >> ase_orted_cmds.c at line 275 >> [sun:27167] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_mo >> dule.c at line 1164 >> [sun:27167] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp >> .c at line 90 >> [sun:27167] ERROR: A daemon on node saturn failed to start as >> expected. >> [sun:27167] ERROR: There may be more information available fro >> m >> [sun:27167] ERROR: the remote shell (see above). >> [sun:27167] ERROR: The daemon exited unexpectedly with status >> 255. >> [sun:27168] [0,0,1] orted_recv_pls: received message from [0,0 >> ,0] >> [sun:27168] [0,0,1] orted_recv_pls: received exit >> >> >> [sun:27167] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_b >> ase_orted_cmds.c at line 188 >> [sun:27167] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_mo >> dule.c at line 1196 >> -------------------------------------------------------------- >> ------------ >> mpirun was unable to cleanly terminate the daemons for this jo >> b. Returned value Timeout instead of ORTE_SUCCESS. >> >> -------------------------------------------------------------- >> ------------ >> >>> This may give more output about the error. Also, try >>> mpirun -mca pls rsh -mca pls_rsh_debug 1 --hostfile hostfile main >> Heres the output, but I cant decipher it ^^ >> root@sun:~# mpirun -mca pls rsh -mca pls_rsh_debug 1 --hostfil >> e hostfile main >> [sun:27175] pls:rsh: local csh: 0, local sh: 1 >> [sun:27175] pls:rsh: assuming same remote shell as local shell >> [sun:27175] pls:rsh: remote csh: 0, remote sh: 1 >> [sun:27175] pls:rsh: final template argv: >> [sun:27175] pls:rsh: /usr/bin/ssh <template> orted --bootp >> roxy 1 --name <template> --num_procs 3 --vpid_start 0 --nodena >> me <template> --universe root@sun:default-universe-27175 --nsr >> eplica "0.0.0;tcp://192.168.1.254:4733;tcp://172.16.0.202:4733 >> " --gprreplica "0.0.0;tcp://192.168.1.254:4733;tcp://172.16.0. >> 202:4733" >> [sun:27175] pls:rsh: launching on node sun >> [sun:27175] pls:rsh: sun is a LOCAL node >> [sun:27175] pls:rsh: changing to directory /root >> [sun:27175] pls:rsh: executing: (/usr/local/bin/orted) orted - >> -bootproxy 1 --name 0.0.1 --num_procs 3 --vpid_start 0 --noden >> ame sun --universe root@sun:default-universe-27175 --nsreplica >> "0.0.0;tcp://192.168.1.254:4733;tcp://172.16.0.202:4733" --gp >> rreplica "0.0.0;tcp://192.168.1.254:4733;tcp://172.16.0.202:47 >> 33" --set-sid [SSH_AGENT_PID=24793 TERM=xterm SHELL=/bin/bash >> SSH_CLIENT=10.2.56.124 21001 22 SSH_TTY=/dev/pts/0 USER=root L >> D_LIBRARY_PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/b >> in:/sbin:/bin:/opt/c3-4/:/usr/lib:/usr/local/lib SSH_AUTH_SOCK >> =/tmp/ssh-sxbbH24792/agent.24792 MAIL=/var/mail/root PATH=/usr >> /local/bin:/usr/bin:/bin:/usr/games:/opt/c3-4/:/usr/local/lib >> PWD=/root LANG=en_US.UTF-8 SHLVL=1 HOME=/root LOGNAME=root SSH >> _CONNECTION=10.2.56.124 21001 172.16.0.202 22 _=/usr/local/bin >> /mpirun OMPI_MCA_rds_hostfile_path=hostfile orte-job-globals O >> MPI_MCA_pls_rsh_debug=1 OMPI_MCA_seed=0] >> [sun:27175] pls:rsh: launching on node saturn >> [sun:27175] pls:rsh: saturn is a REMOTE node >> [sun:27175] pls:rsh: executing: (//usr/bin/ssh) /usr/bin/ssh s >> >> aturn orted --bootproxy 1 --name >> 0.0.2 --num_procs 3 --vpid_st >> >> art 0 --nodename saturn --universe root@sun:default-universe-2 >> >> 7175 --nsreplica >> "0.0.0;tcp://192.168.1.254:4733;tcp://172.16. >> >> 0.202:4733" --gprreplica >> "0.0.0;tcp://192.168.1.254:4733;tcp:/ >> >> /172.16.0.202:4733" [SSH_AGENT_PID=24793 TERM=xterm >> SHELL=/bin >> /bash >> SSH_CLIENT=10.2.56.124 21001 22 SSH_TTY=/dev/pts/0 USER= >> >> root >> LD_LIBRARY_PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin: >> >> >> /usr/bin:/sbin:/bin:/opt/c3-4/:/usr/lib:/usr/local/lib SSH_AUT >> >> >> H_SOCK=/tmp/ssh-sxbbH24792/agent.24792 MAIL=/var/mail/root PAT >> >> >> H=/usr/local/bin:/usr/bin:/bin:/usr/games:/opt/c3-4/:/usr/loca >> >> l/lib PWD=/root LANG=en_US.UTF-8 >> SHLVL=1 HOME=/root LOGNAME=ro >> >> ot SSH_CONNECTION=10.2.56.124 21001 172.16.0.202 22 _=/usr/loc >> >> al/bin/mpirun >> OMPI_MCA_rds_hostfile_path=hostfile orte-job-glo >> >> bals OMPI_MCA_pls_rsh_debug=1 OMPI_MCA_seed=0] >> [sun:27175] [0,0,0] ORTE_ERROR_LOG: Timeout in file >> base/pls_base_orted_cmds.c at line 275 >> [sun:27175] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at >> line 1164 >> [sun:27175] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 90 >> [sun:27175] ERROR: A daemon on node saturn failed to start as expected. >> [sun:27175] ERROR: There may be more information available from >> [sun:27175] ERROR: the remote shell (see above). >> [sun:27175] ERROR: The daemon exited unexpectedly with status 255. >> [sun:27175] [0,0,0] ORTE_ERROR_LOG: Timeout in file >> base/pls_base_orted_cmds.c at line 188 >> [sun:27175] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at >> line 1196 >> -------------------------------------------------------------------------- >> mpirun was unable to cleanly terminate the daemons for this job. >> Returned value Timeout instead of ORTE_SUCCESS. >> >> -------------------------------------------------------------------------- >> >>> This will print out the exact command that is used to launch the orted. >>> >>> Also, I would highly recommend not running Open MPI as root. It is just a >>> bad >>> idea. >> Yes I know, I'm doing it just now for testing. >>>> I wrote the following to my .ssh/environment (on all machines) >>>> LD_LIBRARY_PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bi >>>> n:/opt/c3-4/:/usr/lib:/usr/local/lib; >>>> >>>> PATH=$PATH:/usr/local/lib; >>>> >>>> export LD_LIBRARY_PATH; >>>> export PATH; >>>> >>>> and added the statement you told me to the ssd_config (on all machines): >>>> PermitUserEnvironment yes >>>> >>>> And it seems to me that the pathes are correct now. >>>> >>>> My shell is bash (/bin/bash) >>>> >>>> When running locate orted (to find out where exactly my openmpi >>>> installation is (compilation defaults) i saw that, on sun there was a >>>> /usr/bin/orted while there wasn't one on saturn. >>>> I deleted /usr/bin/orted on sun and tried again with the option --prefix >>>> /usr/local/ (which seems to be my installation directory) but it >>>> didn't work (same error). >>> Is it possible that you are mixing 2 different installations of Open MPI? >>> You >>> may consider installing OpenMPI to a NFS drive to make these things a bit >>> easier. >>>> Is there a script or anything like that with which I can uninstall >>>> openmpi, because i'll might try a new compilation to /opt/openmpi since >>>> it doesn't look like I would be able to solve the problem. >>> If you still have the tree around that you used to 'make' Open MPI, you can >>> just go into that tree and type 'make uninstall'. >>> >>> Hope this helps, >>> >>> Tim >>> >>>> jody schrieb: >>>>> Now that the PATHs seem to be set correctly for >>>>> ssh i don't know what the problem could be. >>>>> >>>>> Is the error message still the same on as in the first mail? >>>>> Did you do the envorpnment/sshd_config on both machines? >>>>> What shell are you using? >>>>> >>>>> On other test you could make is to start your application >>>>> with the --prefix option: >>>>> >>>>> $mpirun -np 2 --prefix /opt/openmpi -H sun,saturn ./main >>>>> >>>>> (assuming your Open MPI installation lies in /opt/openmpi >>>>> on both machines) >>>>> >>>>> >>>>> Jody >>>>> >>>>> On 10/1/07, Dino Rossegger <dino.rosseg...@gmx.at> wrote: >>>>>> Hi Jodi, >>>>>> did the steps as you said, but it didn't work for me. >>>>>> I set LD_LIBRARY_PATH in /etc/environment and ~/.shh/environment and >>>>>> made the changes to sshd_config. >>>>>> >>>>>> But this all didn't solve my problem, although the pahts seemed to be >>>>>> set correctly (judging what ssh saturn `printenv >> test` says). I also >>>>>> restarted the ssh server, the error is the same. >>>>>> >>>>>> Hope you can help me out here and thanks for your help so far >>>>>> dino >>>>>> >>>>>> jody schrieb: >>>>>>> Dino - >>>>>>> I had a similar problem. >>>>>>> I was only able to solve it by setting PATH and LS_LIBRARY_PATH >>>>>>> in the file ~/ssh/environment on the client and setting >>>>>>> PermitUserEnvironment yes >>>>>>> in /etc/ssh/sshd_config on the server (for this you need root >>>>>>> prioviledge though) >>>>>>> >>>>>>> To be on the safe side, i did both on all my nodes >>>>>>> >>>>>>> Jody >>>>>>> >>>>>>> On 9/27/07, Dino Rossegger <dino.rosseg...@gmx.at> wrote: >>>>>>>> Hi Jody, >>>>>>>> >>>>>>>> Thanks for your help, it really is the case that either in PATH nor in >>>>>>>> LD_LIBRARY_PATH the path to the libs is set correctly. I'll try out, >>>>>>>> hope it works. >>>>>>>> >>>>>>>> jody schrieb: >>>>>>>>> Hi Dino >>>>>>>>> >>>>>>>>> Try >>>>>>>>> ssh saturn printenv | grep PATH >>>>>>>>> >>>>>>>>> >from your host sun to see what your environment variables are when >>>>>>>>> >>>>>>>>> ssh is run without a shell. >>>>>>>>> >>>>>>>>> On 9/27/07, Dino Rossegger <dino.rosseg...@gmx.at> wrote: >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> I have a problem running a simple programm mpihello.cpp. >>>>>>>>>> >>>>>>>>>> Here is a excerp of the error and the command >>>>>>>>>> root@sun:~# mpirun -H sun,saturn main >>>>>>>>>> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file >>>>>>>>>> base/pls_base_orted_cmds.c at line 275 >>>>>>>>>> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c >>>>>>>>>> at line 1164 >>>>>>>>>> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at >>>>>>>>>> line 90 [sun:25213] ERROR: A daemon on node saturn failed to start >>>>>>>>>> as expected. [sun:25213] ERROR: There may be more information >>>>>>>>>> available from [sun:25213] ERROR: the remote shell (see above). >>>>>>>>>> [sun:25213] ERROR: The daemon exited unexpectedly with status 255. >>>>>>>>>> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file >>>>>>>>>> base/pls_base_orted_cmds.c at line 188 >>>>>>>>>> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c >>>>>>>>>> at line 1196 >>>>>>>>>> -------------------------------------------------------------------- >>>>>>>>>> ------ mpirun was unable to cleanly terminate the daemons for this >>>>>>>>>> job. Returned value Timeout instead of ORTE_SUCCESS. >>>>>>>>>> >>>>>>>>>> -------------------------------------------------------------------- >>>>>>>>>> ------ >>>>>>>>>> >>>>>>>>>> The program is runable from each node alone (mpirun -np2 main) >>>>>>>>>> >>>>>>>>>> My PathVariables: >>>>>>>>>> $PATH >>>>>>>>>> /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/c3 >>>>>>>>>> -4/:/usr/lib:/usr/local/libecho $LD_LIBRARY_PATH >>>>>>>>>> /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/c3 >>>>>>>>>> -4/:/usr/lib:/usr/local/lib >>>>>>>>>> >>>>>>>>>> Passwordless ssh is up 'n running >>>>>>>>>> >>>>>>>>>> I walked through the FAQ and Mailing Lists but couldn't find any >>>>>>>>>> solution for my problem. >>>>>>>>>> >>>>>>>>>> Thanks >>>>>>>>>> Dino R. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> users mailing list >>>>>>>>>> us...@open-mpi.org >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>> _______________________________________________ >>>>>>>>> users mailing list >>>>>>>>> us...@open-mpi.org >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >