[OMPI users] no reaction of remote hosts after ompi reinstall

2008-06-10 Thread jody
Hi
after a crash i reinstalled open-mpi 1.2.5 on my machines,
used
  ./configure --prefix /opt/openmpi --enable-mpirun-prefix-by-default
and set PATH and LD_LIBRARY_PATH in .bashrc:
  PATH=/opt/openmpi/bin:$PATH
  export PATH
  LD_LIBRARY_PATH=/opt/openmpi/lib:$LD_LIBRARY_PATH
  export LD_LIBRARY_PATH

First problem:
  ssh nano_00 printenv
does not contain the correct paths (and no LD_LIBRARY_PATH at all),
but with a normal ssh-login the two are set correctly.

When i run a test application on one computer, it works.

As soon as an additional computer is involved, there is no more output,
and everything just hangs.

Adding the prefix doesn't change anything, even though openmpi is
installed in the same
directory (/opt/openmpi) on every computer.

The debug-daemon doesn't help very much:

$ mpirun -np 4 --hostfile testhosts --debug-daemons MPITest
Daemon [0,0,1] checking in as pid 14927 on host aim-plankton.uzh.ch

(and nothing happens anymore)

On the remote host, i see the following three processes coming up
after i do the mpirun on the local machine:
30603 ?S  0:00 sshd: jody@notty
30604 ?Ss 0:00 bash -c  PATH=/opt/openmpi/bin:$PATH ;
export PATH ; LD_LIBRARY_PATH=/opt/openmpi/lib:$LD_LIBRARY_PATH ;
export LD_LIBRARY_PATH ; /opt/openmpi/bin/orted --debug-daemons
--bootproxy 1 --name 0.0.2 --num_procs 3 --vpid_start 0 --
30605 ?S  0:00 /opt/openmpi/bin/orted --debug-daemons
--bootproxy 1 --name 0.0.2 --num_procs 3 --vpid_start 0 --nodename
nano_00 --universe j...@aim-plankton.uzh.ch:default-universe-14934
--nsreplica 0.0.0;tcp://130.60.126.111:52562 --gprrepl

So it looks as if the correct paths are set (probably the doing of
--enable-mpirun-prefix-by-default)

If i interrupt on the local machine (Ctrl-C)::

[aim-plankton:14983] [0,0,1] orted_recv_pls: received message from [0,0,0]
[aim-plankton:14983] [0,0,1] orted_recv_pls: received kill_local_procs
[aim-plankton:14983] [0,0,1] orted_recv_pls: received message from [0,0,0]
[aim-plankton:14983] [0,0,1] orted_recv_pls: received kill_local_procs
[aim-plankton:14982] [0,0,0] ORTE_ERROR_LOG: Timeout in file
base/pls_base_orted_cmds.c at line 275
[aim-plankton:14982] [0,0,0] ORTE_ERROR_LOG: Timeout in file
pls_rsh_module.c at line 1166
[aim-plankton:14982] [0,0,0] ORTE_ERROR_LOG: Timeout in file
errmgr_hnp.c at line 90
[aim-plankton:14982] ERROR: A daemon on node nano_00 failed to start
as expected.
[aim-plankton:14982] ERROR: There may be more information available from
[aim-plankton:14982] ERROR: the remote shell (see above).
[aim-plankton:14982] ERROR: The daemon exited unexpectedly with status 255.
[aim-plankton:14982] [0,0,0] ORTE_ERROR_LOG: Timeout in file
base/pls_base_orted_cmds.c at line 275
[aim-plankton:14982] [0,0,0] ORTE_ERROR_LOG: Timeout in file
pls_rsh_module.c at line 1166
--
WARNING: mpirun has exited before it received notification that all
started processes had terminated.  You should double check and ensure
that there are no runaway processes still executing.
--
[aim-plankton:14983] OOB: Connection to HNP lost

On the remote machine, the "sshd: jody@notty" process is gone, but the
other two stay.
I would be grateful for any suggestions!

Jody


Re: [OMPI users] Different CC for orte and opmi?

2008-06-10 Thread Ashley Pittman

Sorry, I'll try and fill in the background.  I'm attempting to package
openmpi for a number of customers we have, whenever possible on our
clusters we use modules to provide users with a choice of MPI
environment.

I'm using the 1.2.6 stable release and have built the code twice, once
to /opt/openmpi-1.2.6/gnu and once to /opt/openmpi-1.2.6/intel, I have
create two modules environments called openmpi-gnu and openmpi-intel and
am also using a existing one called intel-compiler.  The build was
successful in both cases.

If I load the openmpi-gnu module I can compile and run code using
mpicc/mpirun as expected, if I load openmpi-intel and intel-compiler I
find I can compile code but I get an error about missing libimf.so when
I try to run it (reproduced below).

The application *will* run if I add the line "module load
intel-compiler" to my bashrc as this allows orted to link.  What I think
I want to do is to compile the actual library with icc but to compile
orted with gcc so that I don't need to load the intel environment by
default.  I'm assuming that the link problems only exist with orted and
not with the actual application as the LD_LIBRARY_PATH is set correctly
in the shell which is launching the program.

Ashley Pittman.

sccomp@demo4-sles-10-1-fe:~/benchmarks/IMB_3.0/src> mpirun -H comp00,comp01 
./IMB-MPI1
/opt/openmpi-1.2.6/intel/bin/orted: error while loading shared libraries: 
libimf.so: cannot open shared object file: No such file or directory
/opt/openmpi-1.2.6/intel/bin/orted: error while loading shared libraries: 
libimf.so: cannot open shared object file: No such file or directory
[demo4-sles-10-1-fe:29303] ERROR: A daemon on node comp01 failed to start as 
expected.
[demo4-sles-10-1-fe:29303] ERROR: There may be more information available from
[demo4-sles-10-1-fe:29303] ERROR: the remote shell (see above).
[demo4-sles-10-1-fe:29303] ERROR: The daemon exited unexpectedly with status 
127.
[demo4-sles-10-1-fe:29303] [0,0,0] ORTE_ERROR_LOG: Timeout in file 
base/pls_base_orted_cmds.c at line 275
[demo4-sles-10-1-fe:29303] [0,0,0] ORTE_ERROR_LOG: Timeout in file 
pls_rsh_module.c at line 1166
[demo4-sles-10-1-fe:29303] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c 
at line 90
[demo4-sles-10-1-fe:29303] ERROR: A daemon on node comp00 failed to start as 
expected.
[demo4-sles-10-1-fe:29303] ERROR: There may be more information available from
[demo4-sles-10-1-fe:29303] ERROR: the remote shell (see above).
[demo4-sles-10-1-fe:29303] ERROR: The daemon exited unexpectedly with status 
127.
[demo4-sles-10-1-fe:29303] [0,0,0] ORTE_ERROR_LOG: Timeout in file 
base/pls_base_orted_cmds.c at line 188
[demo4-sles-10-1-fe:29303] [0,0,0] ORTE_ERROR_LOG: Timeout in file 
pls_rsh_module.c at line 1198
--
mpirun was unable to cleanly terminate the daemons for this job. Returned value 
Timeout instead of ORTE_SUCCESS.
--

$ ldd /opt/openmpi-1.2.6/intel/bin/orted
linux-vdso.so.1 =>  (0x7fff877fe000)
libopen-rte.so.0 => /opt/openmpi-1.2.6/intel/lib/libopen-rte.so.0 
(0x7fe97f3ac000)
libopen-pal.so.0 => /opt/openmpi-1.2.6/intel/lib/libopen-pal.so.0 
(0x7fe97f239000)
libdl.so.2 => /lib64/libdl.so.2 (0x7fe97f135000)
libnsl.so.1 => /lib64/libnsl.so.1 (0x7fe97f01f000)
libutil.so.1 => /lib64/libutil.so.1 (0x7fe97ef1c000)
libm.so.6 => /lib64/libm.so.6 (0x7fe97edc7000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x7fe97ecba000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x7fe97eba3000)
libc.so.6 => /lib64/libc.so.6 (0x7fe97e972000)
libimf.so => /opt/intel/compiler_10.1/x86_64/lib/libimf.so 
(0x7fe97e61)
libsvml.so => /opt/intel/compiler_10.1/x86_64/lib/libsvml.so 
(0x7fe97e489000)
libintlc.so.5 => /opt/intel/compiler_10.1/x86_64/lib/libintlc.so.5 
(0x7fe97e35)
/lib64/ld-linux-x86-64.so.2 (0x7fe97f525000)
$ ssh comp00 ldd /opt/openmpi-1.2.6/intel/bin/orted
libopen-rte.so.0 => /opt/openmpi-1.2.6/intel/lib/libopen-rte.so.0 
(0x2b1f0c0c5000)
libopen-pal.so.0 => /opt/openmpi-1.2.6/intel/lib/libopen-pal.so.0 
(0x2b1f0c23e000)
libdl.so.2 => /lib64/libdl.so.2 (0x2b1f0c3bc000)
libnsl.so.1 => /lib64/libnsl.so.1 (0x2b1f0c4c)
libutil.so.1 => /lib64/libutil.so.1 (0x2b1f0c5d7000)
libm.so.6 => /lib64/libm.so.6 (0x2b1f0c6da000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x2b1f0c82f000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x2b1f0c93d000)
libc.so.6 => /lib64/libc.so.6 (0x2b1f0ca54000)
/lib64/ld-linux-x86-64.so.2 (0x2b1f0bfa9000)
libimf.so => not found
libsvml.so => not found
libintlc.so.5 => not found
libimf.so => not found
libsvml.so => not found
libi

Re: [OMPI users] no reaction of remote hosts after ompi reinstall [follow up]

2008-06-10 Thread jody
Interestingly i can start mpirun from any of the remote machines,
running processes on other remote machines and on the local machine,.
But from the local machine i can start no process on a remote machine -
it just shows the behavior detailed in the previous mail.

remote1 -> remote1 ok
remote1 -> remote2 ok
remote1 -> local  ok

remote2 -> remote1 ok
remote2 -> remote2 ok
remote2 -> local  ok

local  -> local  ok
local  -> remote1 fails
local  -> remote2 fails

My remote machines are freshly updated gentoo machines (AMD),
my local machne is a freshly installed fedora 8 (Intel Quadro).
All use a freshly installed open-mpi 1.2.5.
Before my fedora machine crashed it had fedora 6,
and everything worked great (with 1.2.2 on all machines).

Does anybody have a suggestion where i should look?

Thanks
  Jody


On Tue, Jun 10, 2008 at 12:59 PM, jody  wrote:
> Hi
> after a crash i reinstalled open-mpi 1.2.5 on my machines,
> used
>  ./configure --prefix /opt/openmpi --enable-mpirun-prefix-by-default
> and set PATH and LD_LIBRARY_PATH in .bashrc:
>  PATH=/opt/openmpi/bin:$PATH
>  export PATH
>  LD_LIBRARY_PATH=/opt/openmpi/lib:$LD_LIBRARY_PATH
>  export LD_LIBRARY_PATH
>
> First problem:
>  ssh nano_00 printenv
> does not contain the correct paths (and no LD_LIBRARY_PATH at all),
> but with a normal ssh-login the two are set correctly.
>
> When i run a test application on one computer, it works.
>
> As soon as an additional computer is involved, there is no more output,
> and everything just hangs.
>
> Adding the prefix doesn't change anything, even though openmpi is
> installed in the same
> directory (/opt/openmpi) on every computer.
>
> The debug-daemon doesn't help very much:
>
> $ mpirun -np 4 --hostfile testhosts --debug-daemons MPITest
> Daemon [0,0,1] checking in as pid 14927 on host aim-plankton.uzh.ch
>
> (and nothing happens anymore)
>
> On the remote host, i see the following three processes coming up
> after i do the mpirun on the local machine:
> 30603 ?S  0:00 sshd: jody@notty
> 30604 ?Ss 0:00 bash -c  PATH=/opt/openmpi/bin:$PATH ;
> export PATH ; LD_LIBRARY_PATH=/opt/openmpi/lib:$LD_LIBRARY_PATH ;
> export LD_LIBRARY_PATH ; /opt/openmpi/bin/orted --debug-daemons
> --bootproxy 1 --name 0.0.2 --num_procs 3 --vpid_start 0 --
> 30605 ?S  0:00 /opt/openmpi/bin/orted --debug-daemons
> --bootproxy 1 --name 0.0.2 --num_procs 3 --vpid_start 0 --nodename
> nano_00 --universe j...@aim-plankton.uzh.ch:default-universe-14934
> --nsreplica 0.0.0;tcp://130.60.126.111:52562 --gprrepl
>
> So it looks as if the correct paths are set (probably the doing of
> --enable-mpirun-prefix-by-default)
>
> If i interrupt on the local machine (Ctrl-C)::
>
> [aim-plankton:14983] [0,0,1] orted_recv_pls: received message from [0,0,0]
> [aim-plankton:14983] [0,0,1] orted_recv_pls: received kill_local_procs
> [aim-plankton:14983] [0,0,1] orted_recv_pls: received message from [0,0,0]
> [aim-plankton:14983] [0,0,1] orted_recv_pls: received kill_local_procs
> [aim-plankton:14982] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> base/pls_base_orted_cmds.c at line 275
> [aim-plankton:14982] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> pls_rsh_module.c at line 1166
> [aim-plankton:14982] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> errmgr_hnp.c at line 90
> [aim-plankton:14982] ERROR: A daemon on node nano_00 failed to start
> as expected.
> [aim-plankton:14982] ERROR: There may be more information available from
> [aim-plankton:14982] ERROR: the remote shell (see above).
> [aim-plankton:14982] ERROR: The daemon exited unexpectedly with status 255.
> [aim-plankton:14982] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> base/pls_base_orted_cmds.c at line 275
> [aim-plankton:14982] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> pls_rsh_module.c at line 1166
> --
> WARNING: mpirun has exited before it received notification that all
> started processes had terminated.  You should double check and ensure
> that there are no runaway processes still executing.
> --
> [aim-plankton:14983] OOB: Connection to HNP lost
>
> On the remote machine, the "sshd: jody@notty" process is gone, but the
> other two stay.
> I would be grateful for any suggestions!
>
> Jody
>


Re: [OMPI users] Different CC for orte and opmi?

2008-06-10 Thread Doug Reeder

Ashley,

I had a similar situation linking to the intel libraries and used the  
following in the link step


-L/opt/intel/compiler_10.1/x86_64/lib -Wl,-non_shared -limf -lsvml - 
lintlc -Wl,-call_shared


This created binaries statically linked to the intel compiler  
libraries so I didn't have to push the intel libraries out to the  
nodes or worry about the LD_LIBRARY_PATH.


Doug Reeder
On Jun 10, 2008, at 4:28 AM, Ashley Pittman wrote:



Sorry, I'll try and fill in the background.  I'm attempting to package
openmpi for a number of customers we have, whenever possible on our
clusters we use modules to provide users with a choice of MPI
environment.

I'm using the 1.2.6 stable release and have built the code twice, once
to /opt/openmpi-1.2.6/gnu and once to /opt/openmpi-1.2.6/intel, I have
create two modules environments called openmpi-gnu and openmpi- 
intel and

am also using a existing one called intel-compiler.  The build was
successful in both cases.

If I load the openmpi-gnu module I can compile and run code using
mpicc/mpirun as expected, if I load openmpi-intel and intel-compiler I
find I can compile code but I get an error about missing libimf.so  
when

I try to run it (reproduced below).

The application *will* run if I add the line "module load
intel-compiler" to my bashrc as this allows orted to link.  What I  
think

I want to do is to compile the actual library with icc but to compile
orted with gcc so that I don't need to load the intel environment by
default.  I'm assuming that the link problems only exist with orted  
and
not with the actual application as the LD_LIBRARY_PATH is set  
correctly

in the shell which is launching the program.

Ashley Pittman.

sccomp@demo4-sles-10-1-fe:~/benchmarks/IMB_3.0/src> mpirun -H  
comp00,comp01 ./IMB-MPI1
/opt/openmpi-1.2.6/intel/bin/orted: error while loading shared  
libraries: libimf.so: cannot open shared object file: No such file  
or directory
/opt/openmpi-1.2.6/intel/bin/orted: error while loading shared  
libraries: libimf.so: cannot open shared object file: No such file  
or directory
[demo4-sles-10-1-fe:29303] ERROR: A daemon on node comp01 failed to  
start as expected.
[demo4-sles-10-1-fe:29303] ERROR: There may be more information  
available from

[demo4-sles-10-1-fe:29303] ERROR: the remote shell (see above).
[demo4-sles-10-1-fe:29303] ERROR: The daemon exited unexpectedly  
with status 127.
[demo4-sles-10-1-fe:29303] [0,0,0] ORTE_ERROR_LOG: Timeout in file  
base/pls_base_orted_cmds.c at line 275
[demo4-sles-10-1-fe:29303] [0,0,0] ORTE_ERROR_LOG: Timeout in file  
pls_rsh_module.c at line 1166
[demo4-sles-10-1-fe:29303] [0,0,0] ORTE_ERROR_LOG: Timeout in file  
errmgr_hnp.c at line 90
[demo4-sles-10-1-fe:29303] ERROR: A daemon on node comp00 failed to  
start as expected.
[demo4-sles-10-1-fe:29303] ERROR: There may be more information  
available from

[demo4-sles-10-1-fe:29303] ERROR: the remote shell (see above).
[demo4-sles-10-1-fe:29303] ERROR: The daemon exited unexpectedly  
with status 127.
[demo4-sles-10-1-fe:29303] [0,0,0] ORTE_ERROR_LOG: Timeout in file  
base/pls_base_orted_cmds.c at line 188
[demo4-sles-10-1-fe:29303] [0,0,0] ORTE_ERROR_LOG: Timeout in file  
pls_rsh_module.c at line 1198
-- 

mpirun was unable to cleanly terminate the daemons for this job.  
Returned value Timeout instead of ORTE_SUCCESS.
-- 



$ ldd /opt/openmpi-1.2.6/intel/bin/orted
linux-vdso.so.1 =>  (0x7fff877fe000)
libopen-rte.so.0 => /opt/openmpi-1.2.6/intel/lib/libopen- 
rte.so.0 (0x7fe97f3ac000)
libopen-pal.so.0 => /opt/openmpi-1.2.6/intel/lib/libopen- 
pal.so.0 (0x7fe97f239000)

libdl.so.2 => /lib64/libdl.so.2 (0x7fe97f135000)
libnsl.so.1 => /lib64/libnsl.so.1 (0x7fe97f01f000)
libutil.so.1 => /lib64/libutil.so.1 (0x7fe97ef1c000)
libm.so.6 => /lib64/libm.so.6 (0x7fe97edc7000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x7fe97ecba000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x7fe97eba3000)
libc.so.6 => /lib64/libc.so.6 (0x7fe97e972000)
libimf.so => /opt/intel/compiler_10.1/x86_64/lib/libimf.so  
(0x7fe97e61)
libsvml.so => /opt/intel/compiler_10.1/x86_64/lib/ 
libsvml.so (0x7fe97e489000)
libintlc.so.5 => /opt/intel/compiler_10.1/x86_64/lib/ 
libintlc.so.5 (0x7fe97e35)

/lib64/ld-linux-x86-64.so.2 (0x7fe97f525000)
$ ssh comp00 ldd /opt/openmpi-1.2.6/intel/bin/orted
libopen-rte.so.0 => /opt/openmpi-1.2.6/intel/lib/libopen- 
rte.so.0 (0x2b1f0c0c5000)
libopen-pal.so.0 => /opt/openmpi-1.2.6/intel/lib/libopen- 
pal.so.0 (0x2b1f0c23e000)

libdl.so.2 => /lib64/libdl.so.2 (0x2b1f0c3bc000)
libnsl.so.1 => /lib64/libnsl.so.1 (0x2b1f0c4c)
libutil.so.1 => /lib64/libutil.so.1 (0x