Hi Tim Just a quick update about my ssh/LD_LIBRARY_PATH problem.
Apparently on my System the sshd was configured not to permit user defined environment variables (security reasons?). To fix that i had to change the file /etc/ssh/sshd_config By changing the entry #PermitUserEnvironment no to PermitUserEnvironment yes and adding these lines to the file ~/.ssh/environment PATH=/opt/openmpi/bin:/usr/local/bin:/bin:/usr/bin LD_LIBRARY_PATH=/opt/openmpi/lib Maybe it is an overkill, but at least ssh now makes the two variables available, and simple openmpi test applications run. I have done this fixes on all my 7 gentoo machines (nano_00 - nano_06), and simple openmpi test applications run with any number of processes. But the fedora machine (plankton) still has problems in some cases. In the test application i use, process #0 broadcasts a number to all other processes. This works in the following cases always calling from nano_02: mpirun -np 3 --host nano_00 ./MPITest mpirun -np 3 --host plankton ./MPITest mpirun -np 3 --host plankton,nano_00 ./MPITest But it doesn't work like this: mpirun -np 4 --host nano_00,plankton ./MPITest as soon as the MPI_Broadcast statement is rached, i get an errorr message: [nano_00][0,1,0][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=113 Does this still agree with your firewall hypothesis? Thanks Jody On 8/14/07, Tim Prins <tpr...@open-mpi.org> wrote: > Jody, > > jody wrote: > > Hi TIm > > thanks for the suggestions. > > > > I now set both paths in .zshenv but it seems that LD_LIBRARY_PATH > > still does not get set. > > The ldd experment shows that all openmpi libraries are not found, > > and indeed the printenv shows that PATH is there but LD_LIBRARY_PATH is > > not. > Are you setting LD_LIBRARY_PATH anywhere else in your scripts? I have, > on more than one occasion, forgotten that I needed to do: > export LD_LIBRARY_PATH="/foo:$LD_LIBRARY_PATH" > > Instead of just: > export LD_LIBRARY_PATH="/foo" > > > > > It is rather unclear why this happens... > > > > As to thew second problem: > > $ mpirun --debug-daemons -np 2 --prefix /opt/openmpi --host nano_02 > > ./MPI2Test2 > > [aim-nano_02:05455] [0,0,1]-[0,0,0] mca_oob_tcp_peer_try_connect: > > connect to 130.60.49.134:40618 <http://130.60.49.134:40618> failed: > > (103) > > [aim-nano_02:05455] [0,0,1]-[0,0,0] mca_oob_tcp_peer_try_connect: > > connect to 130.60.49.134:40618 <http://130.60.49.134:40618> failed, > > connecting over all interfaces failed! > > [aim-nano_02:05455] OOB: Connection to HNP lost > > [aim-plankton.unizh.ch:24222 <http://aim-plankton.unizh.ch:24222>] > > [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at > > line 275 > > [aim-plankton.unizh.ch:24222 <http://aim-plankton.unizh.ch:24222>] > > [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1164 > > [aim-plankton.unizh.ch:24222 <http://aim-plankton.unizh.ch:24222>] > > [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 90 > > [aim-plankton.unizh.ch:24222 <http://aim-plankton.unizh.ch:24222>] > > ERROR: A daemon on node nano_02 failed to start as expected. > > [ aim-plankton.unizh.ch:24222 <http://aim-plankton.unizh.ch:24222>] > > ERROR: There may be more information available from > > [aim-plankton.unizh.ch:24222 <http://aim-plankton.unizh.ch:24222>] > > ERROR: the remote shell (see above). > > [ aim-plankton.unizh.ch:24222 <http://aim-plankton.unizh.ch:24222>] > > ERROR: The daemon exited unexpectedly with status 1. > > [aim-plankton.unizh.ch:24222 <http://aim-plankton.unizh.ch:24222>] > > [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at > > line 188 > > [aim-plankton.unizh.ch:24222 <http://aim-plankton.unizh.ch:24222>] > > [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1196 > > > > The strange thing is that nano_02's address is 130.60.49.130 > > <http://130.60.49.130> and plankton's (the caller) is 130.60.49 134. > > I also made sure that nano_02 cann ssh to plankton without password, but > > that didn't change the output. > > What is happening here is that the daemon launched on nano_02 is trying > to contact mpirun on plankton, and is failing for some reason. > > Do you have any firewalls/port filtering enabled on nano_02? Open MPI > generally cannot be run when there are any firewalls on the machines > being used. > > Hope this helps, > > Tim > > > > > Does this message give any hints as to the problem? > > > > Jody > > > > > > On 8/14/07, *Tim Prins* <tpr...@open-mpi.org > > <mailto:tpr...@open-mpi.org>> wrote: > > > > Hi Jody, > > > > jody wrote: > > > Hi > > > I installed openmpi 1.2.2 on a quad core intel machine running > > fedora 6 > > > (hostname plankton) > > > I set PATH and LD_LIBRARY in the .zshrc file: > > Note that .zshrc is only used for interactive logins. You need to setup > > your system so the LD_LIBRARY_PATH and PATH is also set for > > non-interactive logins. See this zsh FAQ entry for what files you need > > to modify: > > http://zsh.sourceforge.net/FAQ/zshfaq03.html#l19 > > <http://zsh.sourceforge.net/FAQ/zshfaq03.html#l19> > > > > (BTW: I do not use zsh, but my assumption is that the file you want to > > set the PATH and LD_LIBRARY_PATH in is .zshenv) > > > $ echo $PATH > > > > > > > /opt/openmpi/bin:/usr/kerberos/bin:/usr/local/bin:/usr/bin:/bin:/usr/X11R6/bin:/home/jody/bin > > > > > > > > $ echo $LD_LIBRARY_PATH > > > /opt/openmpi/lib: > > > > > > When i run > > > $ mpirun -np 2 ./MPITest2 > > > i get the message > > > ./MPI2Test2: error while loading shared libraries: libmpi_cxx.so.0: > > > cannot open shared object file: No such file or directory > > > ./MPI2Test2: error while loading shared libraries: libmpi_cxx.so.0: > > > cannot open shared object file: No such file or directory > > > > > > However > > > $ mpirun -np 2 --prefix /opt/openmpi ./MPI2Test2 > > > works. Any explanation? > > Yes, the LD_LIBRARY_PATH is probably not set correctly. Try running: > > mpirun -np 2 ldd ./MPITest2 > > > > This should show what libraries your executable is using. Make sure all > > of the libraries are resolved. > > > > Also, try running: > > mpirun -np 1 printenv |grep LD_LIBRARY_PATH > > to see what the LD_LIBRARY_PATH is for you executables. Note that you > > can NOT simply run mpirun echo $LD_LIBRARY_PATH, as the variable > > will be > > interpreted in the executing shell. > > > > > > > > Second problem: > > > I have also installed openmpi 1.2.2 on an AMD machine running gentoo > > > linux (hostname nano_02). > > > Here as well PATH and LD_LIBRARY_PATH are set correctly, > > > and > > > $ mpirun -np 2 ./MPITest2 > > > works locally on nano_02. > > > > > > If, however, from plankton i call > > > $ mpirun -np 2 --prefix /opt/openmpi --host nano_02 ./MPI2Test2 > > > the call hangs with no output whatsoever. > > > Any pointers on how to solve this problem? > > Try running: > > mpirun --debug-daemons -np 2 --prefix /opt/openmpi --host nano_02 > > ./MPI2Test2 > > > > This should give some more output as to what is happening. > > > > Hope this helps, > > > > Tim > > > > > > > > Thank You > > > Jody > > > > > > > > > > > > > > ------------------------------------------------------------------------ > > > > > > _______________________________________________ > > > users mailing list > > > us...@open-mpi.org <mailto:us...@open-mpi.org> > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org <mailto:us...@open-mpi.org> > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > <http://www.open-mpi.org/mailman/listinfo.cgi/users> > > > > > > > > ------------------------------------------------------------------------ > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >