I'll try to reinstall openmpi on a nfsdevice, maybe it works then. Thanks for your help dino
Tim Prins schrieb: > Unfortunately, I am out of ideas on this one. It is very strange. Maybe > someone else has an idea. > > I would recommend trying to install Open MPI again. First be sure to get > rid of all of the old installs of Open MPI from all your nodes, then > reinstall and try again. > > Tim > > Dino Rossegger wrote: >> Here the Syntax & Output of the Command: >> root@sun:~# mpirun --hostfile hostfile saturn >> [sun:28777] [0,0,0] ORTE_ERROR_LOG: Timeout in file >> base/pls_base_orted_cmds.c at line 275 >> [sun:28777] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at >> line 1164 >> [sun:28777] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 90 >> [sun:28777] ERROR: A daemon on node saturn failed to start as expected. >> [sun:28777] ERROR: There may be more information available from >> [sun:28777] ERROR: the remote shell (see above). >> [sun:28777] ERROR: The daemon exited unexpectedly with status 255. >> [sun:28777] [0,0,0] ORTE_ERROR_LOG: Timeout in file >> base/pls_base_orted_cmds.c at line 188 >> [sun:28777] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at >> line 1196 >> -------------------------------------------------------------------------- >> mpirun was unable to cleanly terminate the daemons for this job. >> Returned value Timeout instead of ORTE_SUCCESS. >> >> -------------------------------------------------------------------------- >> >> I'm using version 1.2.3, got it from openmpi.org. I'm using the same >> version of openmpi on all nodes. >> >> Thanks >> dino >> >> Tim Prins schrieb: >>> This is very odd. The daemon is being launched properly, but then things >>> get strange. It looks like mpirun is sending a message to kill >>> application processes on saturn. >>> >>> What version of Open MPI are you using? >>> >>> Are you sure that the same version of Open MPI us being used everywhere? >>> >>> Can you try: >>> mpirun --hostfile hostfile hostname >>> >>> Thanks, >>> >>> Tim >>> >>> Dino Rossegger wrote: >>>> Hi again, >>>> >>>> Tim Prins schrieb: >>>>> Hi, >>>>> >>>>> On Monday 01 October 2007 03:56:16 pm Dino Rossegger wrote: >>>>>> Hi again, >>>>>> >>>>>> Yes the error output is the same: >>>>>> root@sun:~# mpirun --hostfile hostfile main >>>>>> [sun:23748] [0,0,0] ORTE_ERROR_LOG: Timeout in file >>>>>> base/pls_base_orted_cmds.c at line 275 >>>>>> [sun:23748] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at >>>>>> line 1164 >>>>>> [sun:23748] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line >>>>>> 90 >>>>>> [sun:23748] ERROR: A daemon on node saturn failed to start as expected. >>>>>> [sun:23748] ERROR: There may be more information available from >>>>>> [sun:23748] ERROR: the remote shell (see above). >>>>>> [sun:23748] ERROR: The daemon exited unexpectedly with status 255. >>>>>> [sun:23748] [0,0,0] ORTE_ERROR_LOG: Timeout in file >>>>>> base/pls_base_orted_cmds.c at line 188 >>>>>> [sun:23748] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at >>>>>> line 1196 >>>>>> -------------------------------------------------------------------------- >>>>>> mpirun was unable to cleanly terminate the daemons for this job. >>>>>> Returned value Timeout instead of ORTE_SUCCESS. >>>>>> >>>>>> -------------------------------------------------------------------------- >>>>> Can you try: >>>>> mpirun --debug-daemons --hostfile hostfile main >>>>> >>>> Did it but it doesn't give me any special output (as far as I can see that) >>>> Heres the output: >>>> root@sun:~# mpirun --debug-daemons --hostfile hostfile ./main >>>> Daemon [0,0,1] checking in as pid 27168 on host sun >>>> [sun:27168] [0,0,1] orted_recv_pls: received message from [0,0 >>>> ,0] >>>> [sun:27168] [0,0,1] orted_recv_pls: received kill_local_procs >>>> [sun:27167] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_b >>>> ase_orted_cmds.c at line 275 >>>> [sun:27167] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_mo >>>> dule.c at line 1164 >>>> [sun:27167] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp >>>> .c at line 90 >>>> [sun:27167] ERROR: A daemon on node saturn failed to start as >>>> expected. >>>> [sun:27167] ERROR: There may be more information available fro >>>> m >>>> [sun:27167] ERROR: the remote shell (see above). >>>> [sun:27167] ERROR: The daemon exited unexpectedly with status >>>> 255. >>>> [sun:27168] [0,0,1] orted_recv_pls: received message from [0,0 >>>> ,0] >>>> [sun:27168] [0,0,1] orted_recv_pls: received exit >>>> >>>> >>>> [sun:27167] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_b >>>> ase_orted_cmds.c at line 188 >>>> [sun:27167] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_mo >>>> dule.c at line 1196 >>>> -------------------------------------------------------------- >>>> ------------ >>>> mpirun was unable to cleanly terminate the daemons for this jo >>>> b. Returned value Timeout instead of ORTE_SUCCESS. >>>> >>>> -------------------------------------------------------------- >>>> ------------ >>>> >>>>> This may give more output about the error. Also, try >>>>> mpirun -mca pls rsh -mca pls_rsh_debug 1 --hostfile hostfile main >>>> Heres the output, but I cant decipher it ^^ >>>> root@sun:~# mpirun -mca pls rsh -mca pls_rsh_debug 1 --hostfil >>>> e hostfile main >>>> [sun:27175] pls:rsh: local csh: 0, local sh: 1 >>>> [sun:27175] pls:rsh: assuming same remote shell as local shell >>>> [sun:27175] pls:rsh: remote csh: 0, remote sh: 1 >>>> [sun:27175] pls:rsh: final template argv: >>>> [sun:27175] pls:rsh: /usr/bin/ssh <template> orted --bootp >>>> roxy 1 --name <template> --num_procs 3 --vpid_start 0 --nodena >>>> me <template> --universe root@sun:default-universe-27175 --nsr >>>> eplica "0.0.0;tcp://192.168.1.254:4733;tcp://172.16.0.202:4733 >>>> " --gprreplica "0.0.0;tcp://192.168.1.254:4733;tcp://172.16.0. >>>> 202:4733" >>>> [sun:27175] pls:rsh: launching on node sun >>>> [sun:27175] pls:rsh: sun is a LOCAL node >>>> [sun:27175] pls:rsh: changing to directory /root >>>> [sun:27175] pls:rsh: executing: (/usr/local/bin/orted) orted - >>>> -bootproxy 1 --name 0.0.1 --num_procs 3 --vpid_start 0 --noden >>>> ame sun --universe root@sun:default-universe-27175 --nsreplica >>>> "0.0.0;tcp://192.168.1.254:4733;tcp://172.16.0.202:4733" --gp >>>> rreplica "0.0.0;tcp://192.168.1.254:4733;tcp://172.16.0.202:47 >>>> 33" --set-sid [SSH_AGENT_PID=24793 TERM=xterm SHELL=/bin/bash >>>> SSH_CLIENT=10.2.56.124 21001 22 SSH_TTY=/dev/pts/0 USER=root L >>>> D_LIBRARY_PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/b >>>> in:/sbin:/bin:/opt/c3-4/:/usr/lib:/usr/local/lib SSH_AUTH_SOCK >>>> =/tmp/ssh-sxbbH24792/agent.24792 MAIL=/var/mail/root PATH=/usr >>>> /local/bin:/usr/bin:/bin:/usr/games:/opt/c3-4/:/usr/local/lib >>>> PWD=/root LANG=en_US.UTF-8 SHLVL=1 HOME=/root LOGNAME=root SSH >>>> _CONNECTION=10.2.56.124 21001 172.16.0.202 22 _=/usr/local/bin >>>> /mpirun OMPI_MCA_rds_hostfile_path=hostfile orte-job-globals O >>>> MPI_MCA_pls_rsh_debug=1 OMPI_MCA_seed=0] >>>> [sun:27175] pls:rsh: launching on node saturn >>>> [sun:27175] pls:rsh: saturn is a REMOTE node >>>> [sun:27175] pls:rsh: executing: (//usr/bin/ssh) /usr/bin/ssh s >>>> >>>> aturn orted --bootproxy 1 --name >>>> 0.0.2 --num_procs 3 --vpid_st >>>> >>>> art 0 --nodename saturn --universe root@sun:default-universe-2 >>>> >>>> 7175 --nsreplica >>>> "0.0.0;tcp://192.168.1.254:4733;tcp://172.16. >>>> >>>> 0.202:4733" --gprreplica >>>> "0.0.0;tcp://192.168.1.254:4733;tcp:/ >>>> >>>> /172.16.0.202:4733" [SSH_AGENT_PID=24793 TERM=xterm >>>> SHELL=/bin >>>> /bash >>>> SSH_CLIENT=10.2.56.124 21001 22 SSH_TTY=/dev/pts/0 USER= >>>> >>>> root >>>> LD_LIBRARY_PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin: >>>> >>>> >>>> /usr/bin:/sbin:/bin:/opt/c3-4/:/usr/lib:/usr/local/lib SSH_AUT >>>> >>>> >>>> H_SOCK=/tmp/ssh-sxbbH24792/agent.24792 MAIL=/var/mail/root PAT >>>> >>>> >>>> H=/usr/local/bin:/usr/bin:/bin:/usr/games:/opt/c3-4/:/usr/loca >>>> >>>> l/lib PWD=/root LANG=en_US.UTF-8 >>>> SHLVL=1 HOME=/root LOGNAME=ro >>>> >>>> ot SSH_CONNECTION=10.2.56.124 21001 172.16.0.202 22 _=/usr/loc >>>> >>>> al/bin/mpirun >>>> OMPI_MCA_rds_hostfile_path=hostfile orte-job-glo >>>> >>>> bals OMPI_MCA_pls_rsh_debug=1 OMPI_MCA_seed=0] >>>> [sun:27175] [0,0,0] ORTE_ERROR_LOG: Timeout in file >>>> base/pls_base_orted_cmds.c at line 275 >>>> [sun:27175] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at >>>> line 1164 >>>> [sun:27175] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 90 >>>> [sun:27175] ERROR: A daemon on node saturn failed to start as expected. >>>> [sun:27175] ERROR: There may be more information available from >>>> [sun:27175] ERROR: the remote shell (see above). >>>> [sun:27175] ERROR: The daemon exited unexpectedly with status 255. >>>> [sun:27175] [0,0,0] ORTE_ERROR_LOG: Timeout in file >>>> base/pls_base_orted_cmds.c at line 188 >>>> [sun:27175] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at >>>> line 1196 >>>> -------------------------------------------------------------------------- >>>> mpirun was unable to cleanly terminate the daemons for this job. >>>> Returned value Timeout instead of ORTE_SUCCESS. >>>> >>>> -------------------------------------------------------------------------- >>>> >>>>> This will print out the exact command that is used to launch the orted. >>>>> >>>>> Also, I would highly recommend not running Open MPI as root. It is just a >>>>> bad >>>>> idea. >>>> Yes I know, I'm doing it just now for testing. >>>>>> I wrote the following to my .ssh/environment (on all machines) >>>>>> LD_LIBRARY_PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bi >>>>>> n:/opt/c3-4/:/usr/lib:/usr/local/lib; >>>>>> >>>>>> PATH=$PATH:/usr/local/lib; >>>>>> >>>>>> export LD_LIBRARY_PATH; >>>>>> export PATH; >>>>>> >>>>>> and added the statement you told me to the ssd_config (on all machines): >>>>>> PermitUserEnvironment yes >>>>>> >>>>>> And it seems to me that the pathes are correct now. >>>>>> >>>>>> My shell is bash (/bin/bash) >>>>>> >>>>>> When running locate orted (to find out where exactly my openmpi >>>>>> installation is (compilation defaults) i saw that, on sun there was a >>>>>> /usr/bin/orted while there wasn't one on saturn. >>>>>> I deleted /usr/bin/orted on sun and tried again with the option --prefix >>>>>> /usr/local/ (which seems to be my installation directory) but it >>>>>> didn't work (same error). >>>>> Is it possible that you are mixing 2 different installations of Open MPI? >>>>> You >>>>> may consider installing OpenMPI to a NFS drive to make these things a bit >>>>> easier. >>>>>> Is there a script or anything like that with which I can uninstall >>>>>> openmpi, because i'll might try a new compilation to /opt/openmpi since >>>>>> it doesn't look like I would be able to solve the problem. >>>>> If you still have the tree around that you used to 'make' Open MPI, you >>>>> can >>>>> just go into that tree and type 'make uninstall'. >>>>> >>>>> Hope this helps, >>>>> >>>>> Tim >>>>> >>>>>> jody schrieb: >>>>>>> Now that the PATHs seem to be set correctly for >>>>>>> ssh i don't know what the problem could be. >>>>>>> >>>>>>> Is the error message still the same on as in the first mail? >>>>>>> Did you do the envorpnment/sshd_config on both machines? >>>>>>> What shell are you using? >>>>>>> >>>>>>> On other test you could make is to start your application >>>>>>> with the --prefix option: >>>>>>> >>>>>>> $mpirun -np 2 --prefix /opt/openmpi -H sun,saturn ./main >>>>>>> >>>>>>> (assuming your Open MPI installation lies in /opt/openmpi >>>>>>> on both machines) >>>>>>> >>>>>>> >>>>>>> Jody >>>>>>> >>>>>>> On 10/1/07, Dino Rossegger <dino.rosseg...@gmx.at> wrote: >>>>>>>> Hi Jodi, >>>>>>>> did the steps as you said, but it didn't work for me. >>>>>>>> I set LD_LIBRARY_PATH in /etc/environment and ~/.shh/environment and >>>>>>>> made the changes to sshd_config. >>>>>>>> >>>>>>>> But this all didn't solve my problem, although the pahts seemed to be >>>>>>>> set correctly (judging what ssh saturn `printenv >> test` says). I also >>>>>>>> restarted the ssh server, the error is the same. >>>>>>>> >>>>>>>> Hope you can help me out here and thanks for your help so far >>>>>>>> dino >>>>>>>> >>>>>>>> jody schrieb: >>>>>>>>> Dino - >>>>>>>>> I had a similar problem. >>>>>>>>> I was only able to solve it by setting PATH and LS_LIBRARY_PATH >>>>>>>>> in the file ~/ssh/environment on the client and setting >>>>>>>>> PermitUserEnvironment yes >>>>>>>>> in /etc/ssh/sshd_config on the server (for this you need root >>>>>>>>> prioviledge though) >>>>>>>>> >>>>>>>>> To be on the safe side, i did both on all my nodes >>>>>>>>> >>>>>>>>> Jody >>>>>>>>> >>>>>>>>> On 9/27/07, Dino Rossegger <dino.rosseg...@gmx.at> wrote: >>>>>>>>>> Hi Jody, >>>>>>>>>> >>>>>>>>>> Thanks for your help, it really is the case that either in PATH nor >>>>>>>>>> in >>>>>>>>>> LD_LIBRARY_PATH the path to the libs is set correctly. I'll try out, >>>>>>>>>> hope it works. >>>>>>>>>> >>>>>>>>>> jody schrieb: >>>>>>>>>>> Hi Dino >>>>>>>>>>> >>>>>>>>>>> Try >>>>>>>>>>> ssh saturn printenv | grep PATH >>>>>>>>>>> >>>>>>>>>>> >from your host sun to see what your environment variables are when >>>>>>>>>>> >>>>>>>>>>> ssh is run without a shell. >>>>>>>>>>> >>>>>>>>>>> On 9/27/07, Dino Rossegger <dino.rosseg...@gmx.at> wrote: >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> I have a problem running a simple programm mpihello.cpp. >>>>>>>>>>>> >>>>>>>>>>>> Here is a excerp of the error and the command >>>>>>>>>>>> root@sun:~# mpirun -H sun,saturn main >>>>>>>>>>>> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file >>>>>>>>>>>> base/pls_base_orted_cmds.c at line 275 >>>>>>>>>>>> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file >>>>>>>>>>>> pls_rsh_module.c >>>>>>>>>>>> at line 1164 >>>>>>>>>>>> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at >>>>>>>>>>>> line 90 [sun:25213] ERROR: A daemon on node saturn failed to start >>>>>>>>>>>> as expected. [sun:25213] ERROR: There may be more information >>>>>>>>>>>> available from [sun:25213] ERROR: the remote shell (see above). >>>>>>>>>>>> [sun:25213] ERROR: The daemon exited unexpectedly with status 255. >>>>>>>>>>>> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file >>>>>>>>>>>> base/pls_base_orted_cmds.c at line 188 >>>>>>>>>>>> [sun:25213] [0,0,0] ORTE_ERROR_LOG: Timeout in file >>>>>>>>>>>> pls_rsh_module.c >>>>>>>>>>>> at line 1196 >>>>>>>>>>>> -------------------------------------------------------------------- >>>>>>>>>>>> ------ mpirun was unable to cleanly terminate the daemons for this >>>>>>>>>>>> job. Returned value Timeout instead of ORTE_SUCCESS. >>>>>>>>>>>> >>>>>>>>>>>> -------------------------------------------------------------------- >>>>>>>>>>>> ------ >>>>>>>>>>>> >>>>>>>>>>>> The program is runable from each node alone (mpirun -np2 main) >>>>>>>>>>>> >>>>>>>>>>>> My PathVariables: >>>>>>>>>>>> $PATH >>>>>>>>>>>> /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/c3 >>>>>>>>>>>> -4/:/usr/lib:/usr/local/libecho $LD_LIBRARY_PATH >>>>>>>>>>>> /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/c3 >>>>>>>>>>>> -4/:/usr/lib:/usr/local/lib >>>>>>>>>>>> >>>>>>>>>>>> Passwordless ssh is up 'n running >>>>>>>>>>>> >>>>>>>>>>>> I walked through the FAQ and Mailing Lists but couldn't find any >>>>>>>>>>>> solution for my problem. >>>>>>>>>>>> >>>>>>>>>>>> Thanks >>>>>>>>>>>> Dino R. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> users mailing list >>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> users mailing list >>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>> _______________________________________________ >>>>>>>>>> users mailing list >>>>>>>>>> us...@open-mpi.org >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>> _______________________________________________ >>>>>>>>> users mailing list >>>>>>>>> us...@open-mpi.org >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >