Re: [OMPI users] Absoft compilation problem

2007-07-09 Thread Jeff Squyres
You didn't say, but I assume that my other two statements were  
therefore correct (1.1.2 works with a static F90 library, 1.2.3 does  
not work).


Do you need the MPI F90 bindings?  If not, does --disable-mpi-f90  
work for you?




On Jul 5, 2007, at 1:37 PM, Yip, Elizabeth L wrote:



1.2.1 does NOT work for me.

-Original Message-
From: Jeff Squyres [mailto:jsquy...@cisco.com]
Sent: Thu 7/5/2007 2:39 AM
To: Open MPI Users
Subject: Re: [OMPI users] Absoft compilation problem

On Jul 2, 2007, at 7:31 PM, Yip, Elizabeth L wrote:

> I downloaded openmpi-1.2.3rc2r15098 from your "nightly snapshot",
> same problem.
> I notice in version 1.1.2, you generate libmpi_f90.a instead of
> the .so files.

Brian clarified for me off-list that we use the same LT for nightly
trunk and OMPI 1.2.x tarballs.

In 1.1.2, you're correct that we only made static F90 libraries.

Can you clarify/confirm:

- 1.1.2 works for you (static F90 library)
- 1.2.1 works for you
- 1.2.3 does not work for you

If this is correct, then something is really, really weird.


>
> -Original Message-
> From: Jeff Squyres [mailto:jsquy...@cisco.com]
> Sent: Sun 7/1/2007 4:03 AM
> To: Open MPI Users
> Subject: Re: [OMPI users] Absoft compilation problem
>
> I unfortunately do not have access to an Absoft compiler to test  
this

> with; it looks like GNU Libtool is getting the wrong arguments to
> pass to the f95 compiler to build a shared library.
>
> A quick workaround for this issue would be to disable the MPI F90
> bindings with the --disable-mpi-f90 switch to configure.
>
> Could you try the Open MPI nightly trunk tarball and see if it works
> there?  We use a different version of Libtool to make those  
tarballs.

>
>
> On Jun 30, 2007, at 2:09 AM, Yip, Elizabeth L wrote:
>
> >
> > The attachment shows my problems when I tried to compile openmpi
> > 1.2.3 with absoft 95
> > (Absoft 64-bit Fortran 95 9.0 with Service Pack 1).  I have  
similar

> > problems with version 1.2.1, but
> > no problem with version 1.2.1.
> >
> > Elizabeth Yip
> > 
> > 
>
>
> --
> Jeff Squyres
> Cisco Systems
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> 


--
Jeff Squyres
Cisco Systems

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users





--
Jeff Squyres
Cisco Systems



Re: [OMPI users] mpi with icc, icpc and ifort :: segfault (Jeff Squyres)

2007-07-09 Thread Jeff Squyres
Ok, that unfortunately doesn't make much sense -- I don't know what  
opal_event_set() inside opal_event_init() would cause a segv.


Can you recompile OMPI with -g and re-run this test?  The "where"  
information from gdb will then give us more information.



On Jul 5, 2007, at 12:38 PM, Ricardo Reis wrote:



Has requested:


Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread -1214711408 (LWP 23581)]
0xb7eb9d98 in opal_event_set ()
   from /usr/local/share/openmpi-1.2.3.icc.ifort/lib/libopen-pal.so.0

(gdb) where
#0  0xb7eb9d98 in opal_event_set ()
   from /usr/local/share/openmpi-1.2.3.icc.ifort/lib/libopen-pal.so.0
#1  0xb7ebb86f in opal_evsignal_init ()
   from /usr/local/share/openmpi-1.2.3.icc.ifort/lib/libopen-pal.so.0
#2  0x0006 in ?? ()
#3  0x0002 in ?? ()
#4  0xb7ebb78a in opal_evsignal_add ()
   from /usr/local/share/openmpi-1.2.3.icc.ifort/lib/libopen-pal.so.0
#5  0x0800 in ?? ()
#6  0xb7ed44b8 in ?? ()
   from /usr/local/share/openmpi-1.2.3.icc.ifort/lib/libopen-pal.so.0
#7  0xb7ebc577 in opal_poll_init ()
   from /usr/local/share/openmpi-1.2.3.icc.ifort/lib/libopen-pal.so.0
#8  0x0023 in ?? ()
#9  0xb7eb9f61 in opal_event_init ()
   from /usr/local/share/openmpi-1.2.3.icc.ifort/lib/libopen-pal.so.0
#10 0x0804d22a in ompi_info::open_components ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

(gdb) shared
Symbols already loaded for /lib/ld-linux.so.2
Symbols already loaded for /usr/local/share/openmpi-1.2.3.icc.ifort/ 
lib/libmpi.so.0
Symbols already loaded for /usr/local/share/openmpi-1.2.3.icc.ifort/ 
lib/libopen-rte.so.0
Symbols already loaded for /usr/local/share/openmpi-1.2.3.icc.ifort/ 
lib/libopen-pal.so.0

Symbols already loaded for /lib/i686/cmov/libnsl.so.1
Symbols already loaded for /lib/i686/cmov/libutil.so.1
Symbols already loaded for /lib/i686/cmov/libm.so.6
Symbols already loaded for /usr/lib/libstdc++.so.6
Symbols already loaded for /lib/libgcc_s.so.1
Symbols already loaded for /opt/intel/cc/10.0.023/lib/libcxaguard.so.5
Symbols already loaded for /lib/i686/cmov/libpthread.so.0
Symbols already loaded for /lib/i686/cmov/libc.so.6
Symbols already loaded for /lib/i686/cmov/libdl.so.2
Symbols already loaded for /opt/intel/cc/10.0.023/lib/libimf.so
Symbols already loaded for /opt/intel/cc/10.0.023/lib/libintlc.so.5


 Ricardo Reis

 'Non Serviam'

 PhD student @ Lasef
 Computational Fluid Dynamics, High Performance Computing, Turbulence
 

 &

 Cultural Instigator @ RĂ¡dio Zero
 http:// 
radio.ist.utl.pt___

users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Cisco Systems




Re: [OMPI users] Can't get TotalView to find main program

2007-07-09 Thread Jeff Squyres

On Jul 5, 2007, at 4:02 PM, Dennis McRitchie wrote:

Any idea why the main program can't be found when running under  
mpirun?


Just to be sure: you compiled your test MPI application with -g, right?


Does openmpi need to be built with either --enable-debug or
--enable-mem-debug? The "configure --help" says the former is not for
general MPI users. Unclear about the latter.


No, both of those should be just for OMPI developers; you should not  
need them for user installations.  Indeed, OMPI should build itself  
with -g as relevant for TV support (i.e., use -g to compile the  
relevant .c files in libmpi); you shouldn't need to build OMPI itself  
with -g.


--
Jeff Squyres
Cisco Systems



Re: [OMPI users] OpenMPI output over several ssh-hops?

2007-07-09 Thread Tim Prins

Hi Jody,

Sorry for the super long delay. I don't know how this one got lost...

I run like this all the time. Unfortunately, it is not as simple as I 
would like. Here is what I do:


1. Log into the machine using ssh -X
2. Run mpirun with the following parameters:
	-mca pls rsh  (This makes sure that Open MPI uses the rsh/ssh launcher. 
It may not be necessary depending on your setup)
	-mca pls_rsh_agent "ssh -X" (To make sure X information is forwarded. 
This might not be necessary if you have ssh setup to always forward X 
information)
	--debug-daemons (This ensures that the ssh connections to the backed 
nodes are kept open. Otherwise, they are closed and X information cannot 
be forwarded. Unfortunately, this will also cause some debugging output 
to be printed, but right now there is no other way :( )


So, the complete command is:
mpirun -np 4 -mca pls rsh -mca pls_rsh_agent "ssh -X" --debug-daemons 
xterm -e gdb my_prog


I hope this helps. Let me know if you are still experiencing problems.

Tim


jody wrote:

Hi
For debugging i usually run each process in a separate X-window.
This works well if i set the DISPLAY variable to the computer
from which i am starting my OpenMPI application.

This method fails however, if i log in (via ssh) to my workstation
from a third computer and then start my OpenMPI application,
only the processes running on the workstation i logged into can
open their windows on the third computers. The processes on
the other computers cant open their windows.

This is how i start the processes

mpirun -np 4 -x DISPLAY run_gdb.sh ./TestApp

where run_gdb.sh looks like this
-
#!/bin/csh -f

echo "Running GDB on node `hostname`"
xterm -e gdb $*
exit 0
-
The output from the processes on the other computer:
xterm Xt error: Can't open display: localhost:12.0

I there a way to tell OpenMPI to forward the X windows
over yet another ssh connection?

Thanks
  Jody
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] openmpi fails on mx endpoint busy

2007-07-09 Thread SLIM H.A.

Dear Tim and Scott

I followed the suggestions made:

> 
> So you should either pass '-mca btl mx,sm,self', or just pass 
> nothing at all. 
> Open MPI is fairly smart at figuring out what components to 
> use, so you really should not need to specify anything.
> 

Using

node001>mpirun --mca btl mx,sm,self  -np 4 -hostfile ompi_machinefile
./cpi

conects to some of the mx ports, not all 4, but the program runs:

[node001:01562] mca_btl_mx_init: mx_open_endpoint() failed with
status=20
[node001:01564] mca_btl_mx_init: mx_open_endpoint() failed with
status=20

It spawned 4 processes on node001. Passing nothing at all gave the same
problem.

> Also, could you try creating a host file named "hosts" with 
> the names of your machines and then try:
> 
> $ mpirun -np 2 --hostfile hosts ./cpi
> 
> and then
> 
> $ mpirun -np 2 --hostfile hosts --mca pml cm ./cpi

node001>mpirun -np 2 -hostfile ompi_machinefile  ./cpi_gcc_ompi_mx

works but increasing to 4 cores again uses less than 4 ports.
Finally

node001>mpirun -np 4 -hostfile ompi_machinefile --mca pml cm
./cpi_gcc_ompi_mx

is successful even for -np 4. From here I tried 2 nodes:

node001>mpirun -np 8 -hostfile ompi_machinefile --mca pml cm
./cpi_gcc_ompi_mx

This gave:

orted: Command not found.
[node001:04585] [0,0,0] ORTE_ERROR_LOG: Timeout in file
base/pls_base_orted_cmds.c at line 275
[node001:04585] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c
at line 1164
[node001:04585] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at
line 90
[node001:04585] ERROR: A daemon on node node002 failed to start as
expected.
[node001:04585] ERROR: There may be more information available from
[node001:04585] ERROR: the remote shell (see above).
[node001:04585] ERROR: The daemon exited unexpectedly with status 1.
[node001:04585] [0,0,0] ORTE_ERROR_LOG: Timeout in file
base/pls_base_orted_cmds.c at line 188
[node001:04585] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c
at line 1196

--
mpirun was unable to cleanly terminate the daemons for this job.
Returned value Timeout instead of ORTE_SUCCESS.


--

Apparently orted is not started up properly. Something missing in the
installation?

Thanks

Henk


> -Original Message-
> From: users-boun...@open-mpi.org 
> [mailto:users-boun...@open-mpi.org] On Behalf Of Tim Prins
> Sent: 06 July 2007 15:59
> To: Open MPI Users
> Subject: Re: [OMPI users] openmpi fails on mx endpoint busy
> 
> Henk,
> 
> On Friday 06 July 2007 05:34:35 am SLIM H.A. wrote:
> > Dear Tim
> >
> > I followed the use of "--mca btl mx,self" as suggested in the FAQ
> >
> > http://www.open-mpi.org/faq/?category=myrinet#myri-btl
> Yeah, that FAQ is wrong. I am working right now to fix it up. 
> It should be updated this afternoon.
> 
> >
> > When I use your extra mca value I get:
> > >mpirun --mca btl mx,self --mca btl_mx_shared_mem 1 -np 4 ./cpi
> >
> > 
> --
> > --
> > --
> >
> > > WARNING: A user-supplied value attempted to override the 
> read-only 
> > > MCA parameter named "btl_mx_shared_mem".
> > >
> > > The user-supplied value was ignored.
> Opps, on the 1.2 branch this is a read-only parameter. On the 
> current trunk the user can change it. Sorry for the 
> confusion. Oh well, you should probably use Open MPI's shared 
> memory support instead anyways.
> 
> So you should either pass '-mca btl mx,sm,self', or just pass 
> nothing at all. 
> Open MPI is fairly smart at figuring out what components to 
> use, so you really should not need to specify anything.
> 
> > followed by the same error messages as before.
> >
> > Note that although I add "self" the error messages complain about it
> >
> > missing:
> > > > Process 0.1.0 is unable to reach 0.1.1 for MPI communication.
> > > > If you specified the use of a BTL component, you may have
> > >
> > > forgotten a
> > >
> > > > component (such as "self") in the list of usable components.
> >
> > I checked the output from mx_info for both the current node and 
> > another, there seems not to be a problem.
> > I attch the output from ompi_info --all Also
> >
> > >ompi_info | grep mx
> >
> >   Prefix:
> > /usr/local/Cluster-Apps/openmpi/mx/gcc/64/1.2.3
> >  MCA btl: mx (MCA v1.0, API v1.0.1, 
> Component v1.2.3)
> >  MCA mtl: mx (MCA v1.0, API v1.0, Component v1.2.3)
> >
> > As a further check, I rebuild the exe with mpich and that 
> works fine 
> > on the same node over myrinet. I wonder whether mx is 
> properly include 
> > in my openmpi build.
> > Use of ldd -v on the mpich exe gives references to 
> libmyriexpress.so, 
> > which is not the case for the ompi built exe, suggesting 
> something is 
> > missing?
> No, this is expected behavior. The Open MPI executeables are 
> not linked to libmyriexpress.so, only the mx com

Re: [OMPI users] openmpi fails on mx endpoint busy

2007-07-09 Thread Tim Prins

SLIM H.A. wrote:
 
Dear Tim and Scott


I followed the suggestions made:

So you should either pass '-mca btl mx,sm,self', or just pass 
nothing at all. 
Open MPI is fairly smart at figuring out what components to 
use, so you really should not need to specify anything.




Using

node001>mpirun --mca btl mx,sm,self  -np 4 -hostfile ompi_machinefile
./cpi

conects to some of the mx ports, not all 4, but the program runs:

[node001:01562] mca_btl_mx_init: mx_open_endpoint() failed with
status=20
[node001:01564] mca_btl_mx_init: mx_open_endpoint() failed with
status=20


I finally figured out the problem here. What is happening is that Open 
MPI now has 2 different network stacks, only one of which can be used at 
a time: the mtl and the btl. What is happening is that both the mx btl 
and the mx mtl is being opened, each of which open an endpoint. Then the 
mtl is closed because it will not be used, which releases the endpoint. 
Meanwhile, while the number of endpoints are exhausted while others are 
trying to open them.


There are two solutions:
1. Increase the number of available endpoints. According to the Myrinet 
documentation, upping the limit to 16 or so should have no performance 
impact.


2. Alternatively, you can tell the mx mtl not to run with -mca mtl ^mx

So, you should just be able to run:
mpirun --mca btl mx,sm,self -mca mtl ^mx -np 4 -hostfile 
ompi_machinefile ./cpi


And it should work.



It spawned 4 processes on node001. Passing nothing at all gave the same
problem.

Also, could you try creating a host file named "hosts" with 
the names of your machines and then try:


$ mpirun -np 2 --hostfile hosts ./cpi

and then

$ mpirun -np 2 --hostfile hosts --mca pml cm ./cpi


node001>mpirun -np 2 -hostfile ompi_machinefile  ./cpi_gcc_ompi_mx

works but increasing to 4 cores again uses less than 4 ports.
Finally

node001>mpirun -np 4 -hostfile ompi_machinefile --mca pml cm
./cpi_gcc_ompi_mx

is successful even for -np 4. From here I tried 2 nodes:

node001>mpirun -np 8 -hostfile ompi_machinefile --mca pml cm
./cpi_gcc_ompi_mx

This gave:

orted: Command not found.
[node001:04585] [0,0,0] ORTE_ERROR_LOG: Timeout in file
base/pls_base_orted_cmds.c at line 275
[node001:04585] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c
at line 1164
[node001:04585] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at
line 90
[node001:04585] ERROR: A daemon on node node002 failed to start as
expected.
[node001:04585] ERROR: There may be more information available from
[node001:04585] ERROR: the remote shell (see above).
[node001:04585] ERROR: The daemon exited unexpectedly with status 1.
[node001:04585] [0,0,0] ORTE_ERROR_LOG: Timeout in file
base/pls_base_orted_cmds.c at line 188
[node001:04585] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c
at line 1196

--
mpirun was unable to cleanly terminate the daemons for this job.
Returned value Timeout instead of ORTE_SUCCESS.


--


The problem is that on the remote ompi cannot find the 'orted' 
executable. Is the Open MPI install available on the remote node?


Try:
ssh remote_node which orted

This should locate the 'orted' program. If it does not, you may need to 
modify your PATH, as described here:

http://www.open-mpi.org/faq/?category=running#adding-ompi-to-path

Hope this helps,

Tim



Apparently orted is not started up properly. Something missing in the
installation?

Thanks

Henk



-Original Message-
From: users-boun...@open-mpi.org 
[mailto:users-boun...@open-mpi.org] On Behalf Of Tim Prins

Sent: 06 July 2007 15:59
To: Open MPI Users
Subject: Re: [OMPI users] openmpi fails on mx endpoint busy

Henk,

On Friday 06 July 2007 05:34:35 am SLIM H.A. wrote:

Dear Tim

I followed the use of "--mca btl mx,self" as suggested in the FAQ

http://www.open-mpi.org/faq/?category=myrinet#myri-btl
Yeah, that FAQ is wrong. I am working right now to fix it up. 
It should be updated this afternoon.



When I use your extra mca value I get:

mpirun --mca btl mx,self --mca btl_mx_shared_mem 1 -np 4 ./cpi



--

--
--

WARNING: A user-supplied value attempted to override the 
read-only 

MCA parameter named "btl_mx_shared_mem".

The user-supplied value was ignored.
Opps, on the 1.2 branch this is a read-only parameter. On the 
current trunk the user can change it. Sorry for the 
confusion. Oh well, you should probably use Open MPI's shared 
memory support instead anyways.


So you should either pass '-mca btl mx,sm,self', or just pass 
nothing at all. 
Open MPI is fairly smart at figuring out what components to 
use, so you really should not need to specify anything.



followed by the same error messages as before.

Note that although I add "self" the error messages complain about it

missing:

Process 0.1.0 is unable to reach 0.1.1 fo

Re: [OMPI users] OpenMPI output over several ssh-hops?

2007-07-09 Thread jody

Hi Tim

Thank You for your reply.
Unfortunately my workstation has died,
and even when i try to run openmpi application
in a simple way, i get errors:

jody@aim-nano_02 /home/aim-cari/jody $  mpirun -np 2 --hostfile hostfile ./a.out
bash: orted: command not found
[aim-nano_02:22145] ERROR: A daemon on node 130.60.49.129 failed to
start as expected.
[aim-nano_02:22145] ERROR: There may be more information available from
[aim-nano_02:22145] ERROR: the remote shell (see above).
[aim-nano_02:22145] ERROR: The daemon exited unexpectedly with status 127.
[aim-nano_02:22145] ERROR: A daemon on node 130.60.49.128 failed to
start as expected.
[aim-nano_02:22145] ERROR: There may be more information available from
[aim-nano_02:22145] ERROR: the remote shell (see above).
[aim-nano_02:22145] ERROR: The daemon exited unexpectedly with status 127.

However, i set PATH and LD_LIBRARY_PATH to the correct paths both in
.bashrc AND .bash_profile.

For example:
jody@aim-nano_02 /home/aim-cari/jody $ ssh 130.60.49.128 echo $PATH
/opt/openmpi/bin:/usr/local/bin:/usr/bin:/bin:/opt/bin:/usr/i686-pc-linux-gnu/gcc-bin/4.1.2:/opt/sun-jdk-1.4.2.10/bin:/opt/sun-jdk-1.4.2.10/jre/bin:/opt/sun-jdk-1.4.2.10/jre/javaws:/usr/qt/3/bin

But:
jody@aim-nano_02 /home/aim-cari/jody $ ssh 130.60.49.128 orted
bash: orted: command not found

Do You have any suggestions?

Thank You
 Jody



On 7/9/07, Tim Prins  wrote:

Hi Jody,

Sorry for the super long delay. I don't know how this one got lost...

I run like this all the time. Unfortunately, it is not as simple as I
would like. Here is what I do:

1. Log into the machine using ssh -X
2. Run mpirun with the following parameters:
-mca pls rsh  (This makes sure that Open MPI uses the rsh/ssh launcher.
It may not be necessary depending on your setup)
-mca pls_rsh_agent "ssh -X" (To make sure X information is forwarded.
This might not be necessary if you have ssh setup to always forward X
information)
--debug-daemons (This ensures that the ssh connections to the backed
nodes are kept open. Otherwise, they are closed and X information cannot
be forwarded. Unfortunately, this will also cause some debugging output
to be printed, but right now there is no other way :( )

So, the complete command is:
mpirun -np 4 -mca pls rsh -mca pls_rsh_agent "ssh -X" --debug-daemons
xterm -e gdb my_prog

I hope this helps. Let me know if you are still experiencing problems.

Tim


jody wrote:
> Hi
> For debugging i usually run each process in a separate X-window.
> This works well if i set the DISPLAY variable to the computer
> from which i am starting my OpenMPI application.
>
> This method fails however, if i log in (via ssh) to my workstation
> from a third computer and then start my OpenMPI application,
> only the processes running on the workstation i logged into can
> open their windows on the third computers. The processes on
> the other computers cant open their windows.
>
> This is how i start the processes
>
> mpirun -np 4 -x DISPLAY run_gdb.sh ./TestApp
>
> where run_gdb.sh looks like this
> -
> #!/bin/csh -f
>
> echo "Running GDB on node `hostname`"
> xterm -e gdb $*
> exit 0
> -
> The output from the processes on the other computer:
> xterm Xt error: Can't open display: localhost:12.0
>
> I there a way to tell OpenMPI to forward the X windows
> over yet another ssh connection?
>
> Thanks
>   Jody
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] Open MPI 1.2.3 spec file

2007-07-09 Thread Jeff Squyres

On Jul 6, 2007, at 12:05 PM, Alex Tumanov wrote:


Eureka! I managed to get it working despite the incorrect _initial_
./configure invocation. For those interested, here are my compilation
options:
# cat ompi_build.sh
#!/bin/sh

rpmbuild --rebuild  -D "configure_options \
--prefix=%{_prefix} \
--with-openib=/usr/include/infiniband \
--with-openib-libdir=/usr/lib64 \
--sysconfdir=%{_prefix}/etc" \
-D "install_in_opt 1" \
-D "_name openmpi_vendor" \
-D "_defaultdocdir %{_prefix}/share" \
-D "mflags all" openmpi-1.2.3-1.src.rpm


Is that where the docdir is supposed to be these days?  Shouldn't it  
actually be $prefix/share/doc/$name-$version?  When I didn't override  
the docdir but did use install_in_opt, I got the following in the  
resulting RPM:


/opt/openmpi/1.3a1r15304/share/openmpi/mpif90-wrapper-data.txt
/usr/share/doc/openmpi-1.3a1r15304/INSTALL

So I'm thinking that the doc files (LICENSE and friends) should be in

/opt/openmpi/1.3a1r15304/share/openmpi-1.3a1r15304/INSTALL

Which actually seems kinda weird, since there's an /opt/openmpi/ 
1.3a1r15304/share/openmpi/ directory.


I know there were changes to conventional thinking about where docdir  
should be these days, but I couldn't find any specific references to  
it in the FHS (http://www.pathname.com/fhs/pub/fhs-2.3.html), for  
example.


--
Jeff Squyres
Cisco Systems



Re: [OMPI users] OpenMPI output over several ssh-hops?

2007-07-09 Thread jody

Hi Tim
(I accidentally sent the previous message before it was ready - here's
the complete one)
Thank You for your reply.
Unfortunately my workstation, on which i could successfully run openmpi
applications, has died. But one my replacement machine (which
i assume i have setup in an equivalent way) i now get errors even when i try
to run an openmpi application in a simple way:

jody@aim-nano_02 /home/aim-cari/jody $  mpirun -np 2 --hostfile hostfile ./a.out
bash: orted: command not found
[aim-nano_02:22145] ERROR: A daemon on node 130.60.49.129 failed to
start as expected.
[aim-nano_02:22145] ERROR: There may be more information available from
[aim-nano_02:22145] ERROR: the remote shell (see above).
[aim-nano_02:22145] ERROR: The daemon exited unexpectedly with status 127.
[aim-nano_02:22145] ERROR: A daemon on node 130.60.49.128 failed to
start as expected.
[aim-nano_02:22145] ERROR: There may be more information available from
[aim-nano_02:22145] ERROR: the remote shell (see above).
[aim-nano_02:22145] ERROR: The daemon exited unexpectedly with status 127.

However, i set PATH and LD_LIBRARY_PATH to the correct paths both in
.bashrc AND .bash_profile.

For example:
jody@aim-nano_02 /home/aim-cari/jody $ ssh 130.60.49.128 echo $PATH
/opt/openmpi/bin:/usr/local/bin:/usr/bin:/bin:/opt/bin:/usr/i686-pc-linux-gnu/gcc-bin/4.1.2:/opt/sun-jdk-1.4.2.10/bin:/opt/sun-jdk-1.4.2.10/jre/bin:/opt/sun-jdk-1.4.2.10/jre/javaws:/usr/qt/3/bin

But:
jody@aim-nano_02 /home/aim-cari/jody $ ssh 130.60.49.128 orted
bash: orted: command not found

Do You have any suggestions?

Thank You
Jody

On 7/9/07, Tim Prins  wrote:

Hi Jody,

Sorry for the super long delay. I don't know how this one got lost...

I run like this all the time. Unfortunately, it is not as simple as I
would like. Here is what I do:

1. Log into the machine using ssh -X
2. Run mpirun with the following parameters:
-mca pls rsh  (This makes sure that Open MPI uses the rsh/ssh launcher.
It may not be necessary depending on your setup)
-mca pls_rsh_agent "ssh -X" (To make sure X information is forwarded.
This might not be necessary if you have ssh setup to always forward X
information)
--debug-daemons (This ensures that the ssh connections to the backed
nodes are kept open. Otherwise, they are closed and X information cannot
be forwarded. Unfortunately, this will also cause some debugging output
to be printed, but right now there is no other way :( )

So, the complete command is:
mpirun -np 4 -mca pls rsh -mca pls_rsh_agent "ssh -X" --debug-daemons
xterm -e gdb my_prog

I hope this helps. Let me know if you are still experiencing problems.

Tim


jody wrote:
> Hi
> For debugging i usually run each process in a separate X-window.
> This works well if i set the DISPLAY variable to the computer
> from which i am starting my OpenMPI application.
>
> This method fails however, if i log in (via ssh) to my workstation
> from a third computer and then start my OpenMPI application,
> only the processes running on the workstation i logged into can
> open their windows on the third computers. The processes on
> the other computers cant open their windows.
>
> This is how i start the processes
>
> mpirun -np 4 -x DISPLAY run_gdb.sh ./TestApp
>
> where run_gdb.sh looks like this
> -
> #!/bin/csh -f
>
> echo "Running GDB on node `hostname`"
> xterm -e gdb $*
> exit 0
> -
> The output from the processes on the other computer:
> xterm Xt error: Can't open display: localhost:12.0
>
> I there a way to tell OpenMPI to forward the X windows
> over yet another ssh connection?
>
> Thanks
>   Jody
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] OpenMPI output over several ssh-hops?

2007-07-09 Thread Tim Prins

jody wrote:

Hi Tim
(I accidentally sent the previous message before it was ready - here's
the complete one)
Thank You for your reply.
Unfortunately my workstation, on which i could successfully run openmpi
applications, has died. But one my replacement machine (which
i assume i have setup in an equivalent way) i now get errors even when i try
to run an openmpi application in a simple way:

jody@aim-nano_02 /home/aim-cari/jody $  mpirun -np 2 --hostfile hostfile ./a.out
bash: orted: command not found
[aim-nano_02:22145] ERROR: A daemon on node 130.60.49.129 failed to
start as expected.
[aim-nano_02:22145] ERROR: There may be more information available from
[aim-nano_02:22145] ERROR: the remote shell (see above).
[aim-nano_02:22145] ERROR: The daemon exited unexpectedly with status 127.
[aim-nano_02:22145] ERROR: A daemon on node 130.60.49.128 failed to
start as expected.
[aim-nano_02:22145] ERROR: There may be more information available from
[aim-nano_02:22145] ERROR: the remote shell (see above).
[aim-nano_02:22145] ERROR: The daemon exited unexpectedly with status 127.

However, i set PATH and LD_LIBRARY_PATH to the correct paths both in
.bashrc AND .bash_profile.

I assume you are using bash. You might try changing your .profile as well.



For example:
jody@aim-nano_02 /home/aim-cari/jody $ ssh 130.60.49.128 echo $PATH
/opt/openmpi/bin:/usr/local/bin:/usr/bin:/bin:/opt/bin:/usr/i686-pc-linux-gnu/gcc-bin/4.1.2:/opt/sun-jdk-1.4.2.10/bin:/opt/sun-jdk-1.4.2.10/jre/bin:/opt/sun-jdk-1.4.2.10/jre/javaws:/usr/qt/3/bin


When you do this, $PATH gets interpreted on the local host, not the 
remote host. Try instead:


ssh 130.60.49.128 printenv |grep PATH



But:
jody@aim-nano_02 /home/aim-cari/jody $ ssh 130.60.49.128 orted
bash: orted: command not found


You could also do:
ssh 130.60.49.128 which orted

This will show you the paths it looked in for the orted.


Do You have any suggestions?
To avoid dealing with paths (assuming everything is installed in the 
same directory on all nodes) you can also try the suggestion here 
(although I think that once it is setup modifying PATHs is the easier 
way to go, less typing :):

http://www.open-mpi.org/faq/?category=running#mpirun-prefix


Hope this helps,

Tim


Thank You
 Jody

On 7/9/07, Tim Prins  wrote:

Hi Jody,

Sorry for the super long delay. I don't know how this one got lost...

I run like this all the time. Unfortunately, it is not as simple as I
would like. Here is what I do:

1. Log into the machine using ssh -X
2. Run mpirun with the following parameters:
-mca pls rsh  (This makes sure that Open MPI uses the rsh/ssh launcher.
It may not be necessary depending on your setup)
-mca pls_rsh_agent "ssh -X" (To make sure X information is forwarded.
This might not be necessary if you have ssh setup to always forward X
information)
--debug-daemons (This ensures that the ssh connections to the backed
nodes are kept open. Otherwise, they are closed and X information cannot
be forwarded. Unfortunately, this will also cause some debugging output
to be printed, but right now there is no other way :( )

So, the complete command is:
mpirun -np 4 -mca pls rsh -mca pls_rsh_agent "ssh -X" --debug-daemons
xterm -e gdb my_prog

I hope this helps. Let me know if you are still experiencing problems.

Tim


jody wrote:

Hi
For debugging i usually run each process in a separate X-window.
This works well if i set the DISPLAY variable to the computer
from which i am starting my OpenMPI application.

This method fails however, if i log in (via ssh) to my workstation
from a third computer and then start my OpenMPI application,
only the processes running on the workstation i logged into can
open their windows on the third computers. The processes on
the other computers cant open their windows.

This is how i start the processes

mpirun -np 4 -x DISPLAY run_gdb.sh ./TestApp

where run_gdb.sh looks like this
-
#!/bin/csh -f

echo "Running GDB on node `hostname`"
xterm -e gdb $*
exit 0
-
The output from the processes on the other computer:
xterm Xt error: Can't open display: localhost:12.0

I there a way to tell OpenMPI to forward the X windows
over yet another ssh connection?

Thanks
  Jody
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] OpenMPI output over several ssh-hops?

2007-07-09 Thread jody

Tim,
thanks for your suggestions.
There seems to be something wrong with the PATH:
jody@aim-nano_02 ~/progs $ ssh 130.60.49.128 printenv | grep PATH
PATH=/usr/bin:/bin:/usr/sbin:/sbin

which i don't understand. Logging via ssh into 130.60.49.128 i get:

jody@aim-nano_02 ~/progs $ ssh 130.60.49.128
Last login: Mon Jul  9 18:26:11 2007 from 130.60.49.129
jody@aim-nano_00 ~ $ cat .bash_profile
# /etc/skel/.bash_profile

# This file is sourced by bash for login shells.  The following line
# runs your .bashrc and is recommended by the bash info pages.
[[ -f ~/.bashrc ]] && . ~/.bashrc

PATH=/opt/openmpi/bin:$PATH
export PATH
LD_LIBRARY_PATH=/opt/openmpi/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH


jody@aim-nano_00 ~ $ echo $PATH
/opt/openmpi/bin:/opt/openmpi/bin:/usr/local/bin:/usr/bin:/bin:/opt/bin:/usr/i686-pc-linux-gnu/gcc-bin/3.4.5:/opt/sun-
jdk-1.4.2.10/bin:/opt/sun-jdk-1.4.2.10/jre/bin:/opt/sun-jdk-1.4.2.10
/jre/javaws:/usr/qt/3/bin

(aim-nano_00 is the name of 130.60.49.128)
So why is the path set when i ssh by hand,
but not otherwise?

The suggestion with  the --prefix option also didn't work:
jody@aim-nano_02 /home/aim-cari/jody $ mpirun -np 2 --prefix /opt/openmpi
--hostfile hostfile ./a.out
[aim-nano_02:13733] [0,0,0] ORTE_ERROR_LOG: Data unpack failed in file
dss/dss_peek.c at line 59
[aim-nano_02:13733] [0,0,0] ORTE_ERROR_LOG: Data unpack failed in file
dss/dss_peek.c at line 59
[aim-nano_02:13733] [0,0,0] ORTE_ERROR_LOG: Data unpack failed in file
dss/dss_peek.c at line 59
[aim-nano_02:13733] [0,0,0] ORTE_ERROR_LOG: Data unpack failed in file
dss/dss_peek.c at line 59
[aim-nano_02:13733] [0,0,0] ORTE_ERROR_LOG: Data unpack failed in file
dss/dss_peek.c at line 59
[aim-nano_02:13733] [0,0,0] ORTE_ERROR_LOG: Data unpack failed in file
dss/dss_peek.c at line 59

(after which the thing seems to hang)

If i use the aim-nano_02 (130.60.49.130) instead of a hostfile,
jody@aim-nano_02 /home/aim-cari/jody $ mpirun -np 2 --prefix /opt/openmpi
--host 130.60.49.130 ./a.out
it works, as it does if i run it on the machine itself the standard way
jody@aim-nano_02 /home/aim-cari/jody $ mpirun -np 2  --host 130.60.49.130./a.out

Is there anything else i could try?

Jody

On 7/9/07, Tim Prins  wrote:

jody wrote:
> Hi Tim
> (I accidentally sent the previous message before it was ready - here's
> the complete one)
> Thank You for your reply.
> Unfortunately my workstation, on which i could successfully run openmpi
> applications, has died. But one my replacement machine (which
> i assume i have setup in an equivalent way) i now get errors even when i

try

> to run an openmpi application in a simple way:
>
> jody@aim-nano_02 /home/aim-cari/jody $  mpirun -np 2 --hostfile hostfile

./a.out

> bash: orted: command not found
> [aim-nano_02:22145] ERROR: A daemon on node 130.60.49.129 failed to
> start as expected.
> [aim-nano_02:22145] ERROR: There may be more information available from
> [aim-nano_02:22145] ERROR: the remote shell (see above).
> [aim-nano_02:22145] ERROR: The daemon exited unexpectedly with status

127.

> [aim-nano_02:22145] ERROR: A daemon on node 130.60.49.128 failed to
> start as expected.
> [aim-nano_02:22145] ERROR: There may be more information available from
> [aim-nano_02:22145] ERROR: the remote shell (see above).
> [aim-nano_02:22145] ERROR: The daemon exited unexpectedly with status

127.

>
> However, i set PATH and LD_LIBRARY_PATH to the correct paths both in
> .bashrc AND .bash_profile.
I assume you are using bash. You might try changing your .profile as well.

>
> For example:
> jody@aim-nano_02 /home/aim-cari/jody $ ssh 130.60.49.128 echo $PATH
>

/opt/openmpi/bin:/usr/local/bin:/usr/bin:/bin:/opt/bin:/usr/i686-pc-linux-gnu/gcc-bin/4.1.2:/opt/sun-
jdk-1.4.2.10/bin:/opt/sun-jdk-1.4.2.10/jre/bin:/opt/sun-jdk-1.4.2.10
/jre/javaws:/usr/qt/3/bin


When you do this, $PATH gets interpreted on the local host, not the
remote host. Try instead:

ssh 130.60.49.128 printenv |grep PATH

>
> But:
> jody@aim-nano_02 /home/aim-cari/jody $ ssh 130.60.49.128 orted
> bash: orted: command not found
>
You could also do:
ssh 130.60.49.128 which orted

This will show you the paths it looked in for the orted.

> Do You have any suggestions?
To avoid dealing with paths (assuming everything is installed in the
same directory on all nodes) you can also try the suggestion here
(although I think that once it is setup modifying PATHs is the easier
way to go, less typing :):
http://www.open-mpi.org/faq/?category=running#mpirun-prefix


Hope this helps,

Tim
>
> Thank You
>  Jody
>
> On 7/9/07, Tim Prins  wrote:
>> Hi Jody,
>>
>> Sorry for the super long delay. I don't know how this one got lost...
>>
>> I run like this all the time. Unfortunately, it is not as simple as I
>> would like. Here is what I do:
>>
>> 1. Log into the machine using ssh -X
>> 2. Run mpirun with the following parameters:
>> -mca pls rsh  (This makes sure that Open MPI uses the rsh/ssh

launcher.

>> It may not be necessary depe

Re: [OMPI users] OpenMPI output over several ssh-hops?

2007-07-09 Thread Tim Prins
On Monday 09 July 2007 12:52:29 pm jody wrote:
> Tim,
> thanks for your suggestions.
> There seems to be something wrong with the PATH:
> jody@aim-nano_02 ~/progs $ ssh 130.60.49.128 printenv | grep PATH
> PATH=/usr/bin:/bin:/usr/sbin:/sbin
>
> which i don't understand. Logging via ssh into 130.60.49.128 i get:
>
> jody@aim-nano_02 ~/progs $ ssh 130.60.49.128
> Last login: Mon Jul  9 18:26:11 2007 from 130.60.49.129
> jody@aim-nano_00 ~ $ cat .bash_profile
> # /etc/skel/.bash_profile
>
> # This file is sourced by bash for login shells.  The following line
> # runs your .bashrc and is recommended by the bash info pages.
> [[ -f ~/.bashrc ]] && . ~/.bashrc
>
> PATH=/opt/openmpi/bin:$PATH
> export PATH
> LD_LIBRARY_PATH=/opt/openmpi/lib:$LD_LIBRARY_PATH
> export LD_LIBRARY_PATH
>
>
> jody@aim-nano_00 ~ $ echo $PATH
> /opt/openmpi/bin:/opt/openmpi/bin:/usr/local/bin:/usr/bin:/bin:/opt/bin:/us
>r/i686-pc-linux-gnu/gcc-bin/3.4.5:/opt/sun-
> jdk-1.4.2.10/bin:/opt/sun-jdk-1.4.2.10/jre/bin:/opt/sun-jdk-1.4.2.10
> /jre/javaws:/usr/qt/3/bin
>
> (aim-nano_00 is the name of 130.60.49.128)
> So why is the path set when i ssh by hand,
> but not otherwise?
You must set the path in .bashrc. See 
http://www.open-mpi.org/faq/?category=running#adding-ompi-to-path

Make sure:
ssh 130.60.49.128 which orted
works. If it doesn't, there is something wrong with the PATH.

>
> The suggestion with  the --prefix option also didn't work:
> jody@aim-nano_02 /home/aim-cari/jody $ mpirun -np 2 --prefix /opt/openmpi
> --hostfile hostfile ./a.out
> [aim-nano_02:13733] [0,0,0] ORTE_ERROR_LOG: Data unpack failed in file
> dss/dss_peek.c at line 59
> [aim-nano_02:13733] [0,0,0] ORTE_ERROR_LOG: Data unpack failed in file
> dss/dss_peek.c at line 59
> [aim-nano_02:13733] [0,0,0] ORTE_ERROR_LOG: Data unpack failed in file
> dss/dss_peek.c at line 59
> [aim-nano_02:13733] [0,0,0] ORTE_ERROR_LOG: Data unpack failed in file
> dss/dss_peek.c at line 59
> [aim-nano_02:13733] [0,0,0] ORTE_ERROR_LOG: Data unpack failed in file
> dss/dss_peek.c at line 59
> [aim-nano_02:13733] [0,0,0] ORTE_ERROR_LOG: Data unpack failed in file
> dss/dss_peek.c at line 59
Often this means that there is a version mismatch. Do all the nodes have the 
same version of Open MPI installed? Did you compile your application with 
this version of Open MPI?

Tim

>
> (after which the thing seems to hang)
>
> If i use the aim-nano_02 (130.60.49.130) instead of a hostfile,
> jody@aim-nano_02 /home/aim-cari/jody $ mpirun -np 2 --prefix /opt/openmpi
> --host 130.60.49.130 ./a.out
> it works, as it does if i run it on the machine itself the standard way
> jody@aim-nano_02 /home/aim-cari/jody $ mpirun -np 2  --host
> 130.60.49.130./a.out
>
> Is there anything else i could try?
>
> Jody
>
> On 7/9/07, Tim Prins  wrote:
> > jody wrote:
> > > Hi Tim
> > > (I accidentally sent the previous message before it was ready - here's
> > > the complete one)
> > > Thank You for your reply.
> > > Unfortunately my workstation, on which i could successfully run openmpi
> > > applications, has died. But one my replacement machine (which
> > > i assume i have setup in an equivalent way) i now get errors even when
> > > i
>
> try
>
> > > to run an openmpi application in a simple way:
> > >
> > > jody@aim-nano_02 /home/aim-cari/jody $  mpirun -np 2 --hostfile
> > > hostfile
>
> ./a.out
>
> > > bash: orted: command not found
> > > [aim-nano_02:22145] ERROR: A daemon on node 130.60.49.129 failed to
> > > start as expected.
> > > [aim-nano_02:22145] ERROR: There may be more information available from
> > > [aim-nano_02:22145] ERROR: the remote shell (see above).
> > > [aim-nano_02:22145] ERROR: The daemon exited unexpectedly with status
>
> 127.
>
> > > [aim-nano_02:22145] ERROR: A daemon on node 130.60.49.128 failed to
> > > start as expected.
> > > [aim-nano_02:22145] ERROR: There may be more information available from
> > > [aim-nano_02:22145] ERROR: the remote shell (see above).
> > > [aim-nano_02:22145] ERROR: The daemon exited unexpectedly with status
>
> 127.
>
> > > However, i set PATH and LD_LIBRARY_PATH to the correct paths both in
> > > .bashrc AND .bash_profile.
> >
> > I assume you are using bash. You might try changing your .profile as
> > well.
> >
> > > For example:
> > > jody@aim-nano_02 /home/aim-cari/jody $ ssh 130.60.49.128 echo $PATH
>
> /opt/openmpi/bin:/usr/local/bin:/usr/bin:/bin:/opt/bin:/usr/i686-pc-linux-g
>nu/gcc-bin/4.1.2:/opt/sun-
> jdk-1.4.2.10/bin:/opt/sun-jdk-1.4.2.10/jre/bin:/opt/sun-jdk-1.4.2.10
> /jre/javaws:/usr/qt/3/bin
>
> > When you do this, $PATH gets interpreted on the local host, not the
> > remote host. Try instead:
> >
> > ssh 130.60.49.128 printenv |grep PATH
> >
> > > But:
> > > jody@aim-nano_02 /home/aim-cari/jody $ ssh 130.60.49.128 orted
> > > bash: orted: command not found
> >
> > You could also do:
> > ssh 130.60.49.128 which orted
> >
> > This will show you the paths it looked in for the orted.
> >
> > > Do You have any suggestions?
> >
> > To avoid deal