Hi Tim

Just a quick update about my ssh/LD_LIBRARY_PATH problem.

Apparently on my System the sshd was configured not to permit
user defined environment variables (security reasons?).
To fix that i had to change the file
  /etc/ssh/sshd_config
By changing the entry
  #PermitUserEnvironment no
to
  PermitUserEnvironment yes
and adding these lines to the file ~/.ssh/environment
  PATH=/opt/openmpi/bin:/usr/local/bin:/bin:/usr/bin
  LD_LIBRARY_PATH=/opt/openmpi/lib
Maybe it is an overkill, but at least ssh now makes the two variables available,
and simple openmpi test applications run.

I have done this fixes on all my 7 gentoo machines (nano_00 - nano_06),
and simple openmpi test applications run with any number of processes.

But the fedora machine (plankton) still has problems in some cases.
In the test application i use, process #0 broadcasts a number to all
other processes.
This works in the following cases always calling from nano_02:
 mpirun  -np 3 --host nano_00 ./MPITest
 mpirun  -np 3 --host plankton ./MPITest
 mpirun  -np 3 --host plankton,nano_00 ./MPITest
But it doesn't work like this:
 mpirun -np 4 --host nano_00,plankton ./MPITest
as soon as the MPI_Broadcast statement is rached,
i get an errorr message:
[nano_00][0,1,0][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
connect() failed with errno=113

Does this still agree with your firewall hypothesis?

Thanks
  Jody


On 8/14/07, Tim Prins <tpr...@open-mpi.org> wrote:
> Jody,
>
> jody wrote:
> > Hi TIm
> > thanks for the suggestions.
> >
> > I now set both paths  in .zshenv but it seems that LD_LIBRARY_PATH
> > still does not get set.
> > The ldd experment shows that all openmpi libraries are not found,
> > and indeed the printenv shows that PATH is there but LD_LIBRARY_PATH is
> > not.
> Are you setting LD_LIBRARY_PATH anywhere else in your scripts? I have,
> on more than one occasion, forgotten that I needed to do:
> export LD_LIBRARY_PATH="/foo:$LD_LIBRARY_PATH"
>
> Instead of just:
> export LD_LIBRARY_PATH="/foo"
>
> >
> > It is rather unclear why this happens...
> >
> > As to thew second problem:
> > $ mpirun --debug-daemons -np 2 --prefix /opt/openmpi --host nano_02
> > ./MPI2Test2
> > [aim-nano_02:05455] [0,0,1]-[0,0,0] mca_oob_tcp_peer_try_connect:
> > connect to 130.60.49.134:40618 <http://130.60.49.134:40618> failed:
> > (103)
> > [aim-nano_02:05455] [0,0,1]-[0,0,0] mca_oob_tcp_peer_try_connect:
> > connect to 130.60.49.134:40618 <http://130.60.49.134:40618> failed,
> > connecting over all interfaces failed!
> > [aim-nano_02:05455] OOB: Connection to HNP lost
> > [aim-plankton.unizh.ch:24222 <http://aim-plankton.unizh.ch:24222>]
> > [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at
> > line 275
> > [aim-plankton.unizh.ch:24222 <http://aim-plankton.unizh.ch:24222>]
> > [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1164
> > [aim-plankton.unizh.ch:24222 <http://aim-plankton.unizh.ch:24222>]
> > [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 90
> > [aim-plankton.unizh.ch:24222 <http://aim-plankton.unizh.ch:24222>]
> > ERROR: A daemon on node nano_02 failed to start as expected.
> > [ aim-plankton.unizh.ch:24222 <http://aim-plankton.unizh.ch:24222>]
> > ERROR: There may be more information available from
> > [aim-plankton.unizh.ch:24222 <http://aim-plankton.unizh.ch:24222>]
> > ERROR: the remote shell (see above).
> > [ aim-plankton.unizh.ch:24222 <http://aim-plankton.unizh.ch:24222>]
> > ERROR: The daemon exited unexpectedly with status 1.
> > [aim-plankton.unizh.ch:24222 <http://aim-plankton.unizh.ch:24222>]
> > [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at
> > line 188
> > [aim-plankton.unizh.ch:24222 <http://aim-plankton.unizh.ch:24222>]
> > [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1196
> >
> > The strange thing is that nano_02's address is 130.60.49.130
> > <http://130.60.49.130> and plankton's (the caller) is 130.60.49 134.
> > I also made sure that nano_02 cann ssh to plankton without password, but
> > that didn't change the output.
>
> What is happening here is that the daemon launched on nano_02 is trying
> to contact mpirun on plankton, and is failing for some reason.
>
> Do you have any firewalls/port filtering enabled on nano_02? Open MPI
> generally cannot be run when there are any firewalls on the machines
> being used.
>
> Hope this helps,
>
> Tim
>
> >
> > Does this message give any hints as to the problem?
> >
> > Jody
> >
> >
> > On 8/14/07, *Tim Prins* <tpr...@open-mpi.org
> > <mailto:tpr...@open-mpi.org>> wrote:
> >
> >     Hi Jody,
> >
> >     jody wrote:
> >      > Hi
> >      > I installed openmpi 1.2.2 on a quad core intel machine running
> >     fedora 6
> >      > (hostname plankton)
> >      > I set PATH and LD_LIBRARY in the .zshrc file:
> >     Note that .zshrc is only used for interactive logins. You need to setup
> >     your system so the LD_LIBRARY_PATH and PATH is also set for
> >     non-interactive logins. See this zsh FAQ entry for what files you need
> >     to modify:
> >     http://zsh.sourceforge.net/FAQ/zshfaq03.html#l19
> >     <http://zsh.sourceforge.net/FAQ/zshfaq03.html#l19>
> >
> >     (BTW: I do not use zsh, but my assumption is that the file you want to
> >     set the PATH and LD_LIBRARY_PATH in is .zshenv)
> >      > $ echo $PATH
> >      >
> >     
> > /opt/openmpi/bin:/usr/kerberos/bin:/usr/local/bin:/usr/bin:/bin:/usr/X11R6/bin:/home/jody/bin
> >
> >      >
> >      > $ echo $LD_LIBRARY_PATH
> >      > /opt/openmpi/lib:
> >      >
> >      > When i run
> >      > $ mpirun -np 2 ./MPITest2
> >      > i get the message
> >      > ./MPI2Test2: error while loading shared libraries: libmpi_cxx.so.0:
> >      > cannot open shared object file: No such file or directory
> >      > ./MPI2Test2: error while loading shared libraries: libmpi_cxx.so.0:
> >      > cannot open shared object file: No such file or directory
> >      >
> >      > However
> >      > $ mpirun -np 2 --prefix /opt/openmpi ./MPI2Test2
> >      > works.  Any explanation?
> >     Yes, the LD_LIBRARY_PATH is probably not set correctly. Try running:
> >     mpirun -np 2 ldd ./MPITest2
> >
> >     This should show what libraries your executable is using. Make sure all
> >     of the libraries are resolved.
> >
> >     Also, try running:
> >     mpirun -np 1 printenv |grep LD_LIBRARY_PATH
> >     to see what the LD_LIBRARY_PATH is for you executables. Note that you
> >     can NOT simply run mpirun echo $LD_LIBRARY_PATH, as the variable
> >     will be
> >     interpreted in the executing shell.
> >
> >      >
> >      > Second problem:
> >      > I have also  installed openmpi 1.2.2 on an AMD machine running gentoo
> >      > linux (hostname nano_02).
> >      > Here as well PATH and LD_LIBRARY_PATH are set correctly,
> >      > and
> >      > $ mpirun -np 2 ./MPITest2
> >      > works locally on nano_02.
> >      >
> >      > If, however, from plankton i call
> >      > $ mpirun -np 2 --prefix /opt/openmpi --host nano_02 ./MPI2Test2
> >      > the call hangs with no output whatsoever.
> >      > Any pointers on how to solve this problem?
> >     Try running:
> >     mpirun --debug-daemons -np 2 --prefix /opt/openmpi --host nano_02
> >     ./MPI2Test2
> >
> >     This should give some more output as to what is happening.
> >
> >     Hope this helps,
> >
> >     Tim
> >
> >      >
> >      > Thank You
> >      >   Jody
> >      >
> >      >
> >      >
> >      >
> >     ------------------------------------------------------------------------
> >      >
> >      > _______________________________________________
> >      > users mailing list
> >      > us...@open-mpi.org <mailto:us...@open-mpi.org>
> >      > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >     _______________________________________________
> >     users mailing list
> >     us...@open-mpi.org <mailto:us...@open-mpi.org>
> >     http://www.open-mpi.org/mailman/listinfo.cgi/users
> >     <http://www.open-mpi.org/mailman/listinfo.cgi/users>
> >
> >
> >
> > ------------------------------------------------------------------------
> >
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Reply via email to