[OMPI users] mpirun does not propagate environment from master node to slave nodes

2011-06-28 Thread yanyg
Hello All,

I installed Open MPI 1.4.3 on our new HPC blades, with Infiniband 
interconnection.

My system environments are as:

1)uname -a output:  
Linux gulftown 2.6.18-194.el5 #1 SMP Tue Mar 16 21:52:39 EDT 
2010 x86_64 x86_64 x86_64 GNU/Linux

2) /home is mounted over all nodes, and mpirun is started under 
/home/...

Open MPI and application codes are compiled with intel(R) 
compilers V11. Infiniband stack is Mellanox OFED 1.5.2.

I have two questions about mpirun:

a) how could I get to know what is the network interconnect 
protocol used by the MPI application? 

I specify "--mca btl openib,self,sm,tcp" to mpirun, but I want to 
make sure it really uses infiniband interconnect.

b) when I run mpirun, I get the following message:
== Quote begin
bash: orted: command not found
bash: orted: command not found
bash: orted: command not found
--
A daemon (pid 15120) died unexpectedly with status 127 while 
attempting
to launch so we are aborting.

There may be more information reported by the environment (see 
above).

This may be because the daemon was unable to find all the 
needed shared
libraries on the remote node. You may set your 
LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--
--
mpirun noticed that the job aborted, but has no info as to the 
process
that caused that situation.
--
--
mpirun was unable to cleanly terminate the daemons on the nodes 
shown
below. Additional manual cleanup may be required - please refer to
the "orte-clean" tool for assistance.
--
ibnode001 - daemon did not report back when launched
ibnode002 - daemon did not report back when launched
ibnode003 - daemon did not report back when launched

== Quote end

It seems orted is not found on slave nodes. If I set the PATH and 
LD_LIBRARY_PATH through --prefix to mpirun, or --path, or -x 
options to mpirun, to make the orted and related dynamic libs 
available on slave nodes, it does not work as expected from mpirun 
manual page. The only working case is that I set PATH and 
LD_LIBRARY_PATH in ~/.bashrc for mpirun, and this .bashrc is 
invoked by slave nodes too for login shell. I do not want to set PATH 
and LD_LIBRARY_PATH in ~/.bashrc, but instead to set options to 
mpirun directly.

Thanks,
Yiguang



[OMPI users] InfiniBand, different OpenFabrics transport types

2011-06-28 Thread Bill Johnstone
Hello all.

I have a heterogeneous network of InfiniBand-equipped hosts which are all 
connected to the same backbone switch, an older SDR 10 Gb/s unit.

One set of nodes uses the Mellanox "ib_mthca" driver, while the other uses the 
"mlx4" driver.


This is on Linux 2.6.32, with Open MPI 1.5.3 .


When I run Open MPI across these node types, I get an error message of the form:

Open MPI detected two different OpenFabrics transport types in the same 
Infiniband network. 
Such mixed network trasport configuration is not supported by Open MPI.

Local host: compute-chassis-1-node-01
Local adapter: mthca0 (vendor 0x5ad, part ID 25208) 
Local transport type: MCA_BTL_OPENIB_TRANSPORT_UNKNOWN 

Remote host: compute-chassis-3-node-01
Remote Adapter: (vendor 0x2c9, part ID 26428) 
Remote transport type: MCA_BTL_OPENIB_TRANSPORT_IB

Two questions:

1. Why is this occurring if both adapters have all the OpenIB software set up?  
Is it because Open MPI is trying to use functionality such as ConnectX with the 
newer hardware, which is incompatible with older hardware, or is it something 
more mundane?

2. How can I use IB amongst these heterogeneous nodes?

Thank you.




Re: [OMPI users] mpirun does not propagate environment from master node to slave nodes

2011-06-28 Thread Ralph Castain

On Jun 28, 2011, at 9:05 AM, ya...@adina.com wrote:

> Hello All,
> 
> I installed Open MPI 1.4.3 on our new HPC blades, with Infiniband 
> interconnection.
> 
> My system environments are as:
> 
> 1)uname -a output:  
> Linux gulftown 2.6.18-194.el5 #1 SMP Tue Mar 16 21:52:39 EDT 
> 2010 x86_64 x86_64 x86_64 GNU/Linux
> 
> 2) /home is mounted over all nodes, and mpirun is started under 
> /home/...
> 
> Open MPI and application codes are compiled with intel(R) 
> compilers V11. Infiniband stack is Mellanox OFED 1.5.2.
> 
> I have two questions about mpirun:
> 
> a) how could I get to know what is the network interconnect 
> protocol used by the MPI application? 
> 
> I specify "--mca btl openib,self,sm,tcp" to mpirun, but I want to 
> make sure it really uses infiniband interconnect.

Why specify tcp if you don't want it used? Just leave that off and it will have 
no choice but to use IB.

> 
> b) when I run mpirun, I get the following message:

> It seems orted is not found on slave nodes. If I set the PATH and 
> LD_LIBRARY_PATH through --prefix to mpirun, or --path, or -x 
> options to mpirun, to make the orted and related dynamic libs 
> available on slave nodes, it does not work as expected from mpirun 
> manual page. The only working case is that I set PATH and 
> LD_LIBRARY_PATH in ~/.bashrc for mpirun, and this .bashrc is 
> invoked by slave nodes too for login shell. I do not want to set PATH 
> and LD_LIBRARY_PATH in ~/.bashrc, but instead to set options to 
> mpirun directly.

Should work with either prefix or -x options, assuming the right syntax with 
the latter.

I take it your default shell is bash, and that you are using the rsh launcher 
(as opposed to something like torque)? Are you launching from your default 
shell, or did you perhaps change shell?

Can you send the actual mpirun command you typed?

> 
> Thanks,
> Yiguang
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Problems with Mpi Accept - ORTE_ERROR_LOG

2011-06-28 Thread Ralph Castain
How are you passing the port info between the server and client? You're hitting 
a race condition between the two sides.

On Jun 27, 2011, at 9:29 AM, Rodrigo Oliveira wrote:

> Hi there.
> I am developing a server/client application using Open MPI 1.5.3. In a point 
> of the server code I open a port to receive connections from a client. After 
> that, I call the function MPI_Comm_accept and on the client side I call 
> MPI_Comm_connect. Sometimes I get an ORTE_ERROR_LOG, as showed bellow.
> before accept in host hydra9 port name = 
> 4108386304.0;tcp://150.164.3.204:48761;tcp://192.168.63.9:48761+4108386305.0tcp://150.164.3.204:49211;tcp://192.168.63.9:49211:300
>  
> [hydra9:11199] [[62689,1],0] ORTE_ERROR_LOG: Not found in file 
> base/grpcomm_base_allgather.c at line 220  
> [hydra9:11199] [[62689,1],0] ORTE_ERROR_LOG: Not found in file 
> base/grpcomm_base_modex.c at line 116  
> [hydra9:11199] [[62689,1],0] ORTE_ERROR_LOG: Not found in file 
> grpcomm_bad_module.c at line 608   
> [hydra9:11199] [[62689,1],0] ORTE_ERROR_LOG: Not found in file dpm_orte.c at 
> line 379 
> MPI 2 C++ exception throwing is disabled, MPI::mpi_errno has the error code   
> 
> after accept in host hydra9 error code = 17   
> 
> MPI 2 C++ exception throwing is disabled, MPI::mpi_errno has the error code
> The mpi_errno is 17 and I could not find a clear explanation about this 
> error. It occurs sporadically. Sometimes the application works, sometimes 
> does not.
> 
> Any ideas?
> 
> Thanks
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] mpirun does not propagate environment from master node to slave nodes

2011-06-28 Thread yanyg
Thanks, Ralph!

a) Yes, I know I could use only IB by "--mca btl openib", but just 
want to make sure I am using IB interfaces. I am seeking an option 
to mpirun to print out the actual interconnect protocol, like --prot to 
mpirun in MPICH2.

b) Yes, my default shell is bash, but I run a c-shell script from bash 
terminal, mpirun is invoked inside this c-shell script. I am using rsh 
launcher, exactly as your guess. I try different mpirun command in 
the c-shell, one of them is

/path/to/bin/mpirun --mca btl openib --app appfile

and mpirun and orted are under /path/to/bin, and necessary libs are 
under /path/to/lib. I tried the -x, --prefix, and -path, all does not work 
as expected to propagate the PATH and LD_LIBRARY_PATH, 
since orted is not found on slave nodes, although it shoud since it 
on the shared NFS partition.

Thanks,
Yiguang


On Jun 28, 2011, at 9:05 AM, yanyg_at_[hidden] wrote:

> Hello All,
>
> I installed Open MPI 1.4.3 on our new HPC blades, with Infiniband
> interconnection.
>
> My system environments are as:
>
> 1)uname -a output:
> Linux gulftown 2.6.18-194.el5 #1 SMP Tue Mar 16 21:52:39 EDT
> 2010 x86_64 x86_64 x86_64 GNU/Linux
>
> 2) /home is mounted over all nodes, and mpirun is started under
> /home/...
>
> Open MPI and application codes are compiled with intel(R)
> compilers V11. Infiniband stack is Mellanox OFED 1.5.2.
>
> I have two questions about mpirun:
>
> a) how could I get to know what is the network interconnect
> protocol used by the MPI application?
>
> I specify "--mca btl openib,self,sm,tcp" to mpirun, but I want to
> make sure it really uses infiniband interconnect.

Why specify tcp if you don't want it used? Just leave that off and it 
will have no choice but to use IB.



>
> b) when I run mpirun, I get the following message:

> It seems orted is not found on slave nodes. If I set the PATH and
> LD_LIBRARY_PATH through --prefix to mpirun, or --path, or -x
> options to mpirun, to make the orted and related dynamic libs
> available on slave nodes, it does not work as expected from 
mpirun
> manual page. The only working case is that I set PATH and
> LD_LIBRARY_PATH in ~/.bashrc for mpirun, and this .bashrc is
> invoked by slave nodes too for login shell. I do not want to set 
PATH
> and LD_LIBRARY_PATH in ~/.bashrc, but instead to set options 
to
> mpirun directly.

Should work with either prefix or -x options, assuming the right 
syntax with the latter.

I take it your default shell is bash, and that you are using the rsh 
launcher (as opposed to something like torque)? Are you launching 
from your default shell, or did you perhaps change shell?

Can you send the actual mpirun command you typed? 


Re: [OMPI users] mpirun does not propagate environment from master node to slave nodes

2011-06-28 Thread Ralph Castain

On Jun 28, 2011, at 3:52 PM, ya...@adina.com wrote:

> Thanks, Ralph!
> 
> a) Yes, I know I could use only IB by "--mca btl openib", but just 
> want to make sure I am using IB interfaces. I am seeking an option 
> to mpirun to print out the actual interconnect protocol, like --prot to 
> mpirun in MPICH2.

Afraid it doesn't exist - OMPI will -only- use the specified interfaces and 
will abort if it can't connect processes across at least one of them.

> 
> b) Yes, my default shell is bash, but I run a c-shell script from bash 
> terminal, mpirun is invoked inside this c-shell script. I am using rsh 
> launcher, exactly as your guess. I try different mpirun command in 
> the c-shell, one of them is
> 
> /path/to/bin/mpirun --mca btl openib --app appfile
> 
> and mpirun and orted are under /path/to/bin, and necessary libs are 
> under /path/to/lib. I tried the -x, --prefix, and -path, all does not work 
> as expected to propagate the PATH and LD_LIBRARY_PATH, 
> since orted is not found on slave nodes, although it shoud since it 
> on the shared NFS partition.


I suspect the code is getting confused by the different shells. I've seen other 
reports of this, and have observed it myself - suggest you avoid using the 
c-shell and launch from your default shell. I know that works.


> 
> Thanks,
> Yiguang
> 
> 
> On Jun 28, 2011, at 9:05 AM, yanyg_at_[hidden] wrote:
> 
>> Hello All,
>> 
>> I installed Open MPI 1.4.3 on our new HPC blades, with Infiniband
>> interconnection.
>> 
>> My system environments are as:
>> 
>> 1)uname -a output:
>> Linux gulftown 2.6.18-194.el5 #1 SMP Tue Mar 16 21:52:39 EDT
>> 2010 x86_64 x86_64 x86_64 GNU/Linux
>> 
>> 2) /home is mounted over all nodes, and mpirun is started under
>> /home/...
>> 
>> Open MPI and application codes are compiled with intel(R)
>> compilers V11. Infiniband stack is Mellanox OFED 1.5.2.
>> 
>> I have two questions about mpirun:
>> 
>> a) how could I get to know what is the network interconnect
>> protocol used by the MPI application?
>> 
>> I specify "--mca btl openib,self,sm,tcp" to mpirun, but I want to
>> make sure it really uses infiniband interconnect.
> 
> Why specify tcp if you don't want it used? Just leave that off and it 
> will have no choice but to use IB.
> 
> 
> 
>> 
>> b) when I run mpirun, I get the following message:
> 
>> It seems orted is not found on slave nodes. If I set the PATH and
>> LD_LIBRARY_PATH through --prefix to mpirun, or --path, or -x
>> options to mpirun, to make the orted and related dynamic libs
>> available on slave nodes, it does not work as expected from 
> mpirun
>> manual page. The only working case is that I set PATH and
>> LD_LIBRARY_PATH in ~/.bashrc for mpirun, and this .bashrc is
>> invoked by slave nodes too for login shell. I do not want to set 
> PATH
>> and LD_LIBRARY_PATH in ~/.bashrc, but instead to set options 
> to
>> mpirun directly.
> 
> Should work with either prefix or -x options, assuming the right 
> syntax with the latter.
> 
> I take it your default shell is bash, and that you are using the rsh 
> launcher (as opposed to something like torque)? Are you launching 
> from your default shell, or did you perhaps change shell?
> 
> Can you send the actual mpirun command you typed? 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] mpirun does not propagate environment from master node to slave nodes

2011-06-28 Thread Ralph Castain

On Jun 28, 2011, at 3:52 PM, ya...@adina.com wrote:

> Thanks, Ralph!
> 
> a) Yes, I know I could use only IB by "--mca btl openib", but just 
> want to make sure I am using IB interfaces. I am seeking an option 
> to mpirun to print out the actual interconnect protocol, like --prot to 
> mpirun in MPICH2.
> 
> b) Yes, my default shell is bash, but I run a c-shell script from bash 
> terminal, mpirun is invoked inside this c-shell script. I am using rsh 
> launcher, exactly as your guess. I try different mpirun command in 
> the c-shell, one of them is
> 
> /path/to/bin/mpirun --mca btl openib --app appfile
> 
> and mpirun and orted are under /path/to/bin, and necessary libs are 
> under /path/to/lib. I tried the -x, --prefix, and -path, all does not work 
> as expected to propagate the PATH and LD_LIBRARY_PATH, 
> since orted is not found on slave nodes, although it shoud since it 
> on the shared NFS partition.
> 

I looked a little deeper into this. I keep forgetting that we changed our 
default settings a few years ago. In the dim past, OMPI would always probe the 
remote node to find out what shell it was using, and then use the proper 
command syntax for that shell. However, people complained about the extra time 
during launch, and very very few people actually used mis-matched shells.

So we changed the setting the other way to default to assuming the remote shell 
is the same as the local one. For those like yourself that actually do have a 
mismatch, we left a parameter you can set to override that assumption. Just add 
"-mca plm_rsh_assume_same_shell 0" to your mpirun cmd line and it should 
resolve the problem.



> Thanks,
> Yiguang
> 
> 
> On Jun 28, 2011, at 9:05 AM, yanyg_at_[hidden] wrote:
> 
>> Hello All,
>> 
>> I installed Open MPI 1.4.3 on our new HPC blades, with Infiniband
>> interconnection.
>> 
>> My system environments are as:
>> 
>> 1)uname -a output:
>> Linux gulftown 2.6.18-194.el5 #1 SMP Tue Mar 16 21:52:39 EDT
>> 2010 x86_64 x86_64 x86_64 GNU/Linux
>> 
>> 2) /home is mounted over all nodes, and mpirun is started under
>> /home/...
>> 
>> Open MPI and application codes are compiled with intel(R)
>> compilers V11. Infiniband stack is Mellanox OFED 1.5.2.
>> 
>> I have two questions about mpirun:
>> 
>> a) how could I get to know what is the network interconnect
>> protocol used by the MPI application?
>> 
>> I specify "--mca btl openib,self,sm,tcp" to mpirun, but I want to
>> make sure it really uses infiniband interconnect.
> 
> Why specify tcp if you don't want it used? Just leave that off and it 
> will have no choice but to use IB.
> 
> 
> 
>> 
>> b) when I run mpirun, I get the following message:
> 
>> It seems orted is not found on slave nodes. If I set the PATH and
>> LD_LIBRARY_PATH through --prefix to mpirun, or --path, or -x
>> options to mpirun, to make the orted and related dynamic libs
>> available on slave nodes, it does not work as expected from 
> mpirun
>> manual page. The only working case is that I set PATH and
>> LD_LIBRARY_PATH in ~/.bashrc for mpirun, and this .bashrc is
>> invoked by slave nodes too for login shell. I do not want to set 
> PATH
>> and LD_LIBRARY_PATH in ~/.bashrc, but instead to set options 
> to
>> mpirun directly.
> 
> Should work with either prefix or -x options, assuming the right 
> syntax with the latter.
> 
> I take it your default shell is bash, and that you are using the rsh 
> launcher (as opposed to something like torque)? Are you launching 
> from your default shell, or did you perhaps change shell?
> 
> Can you send the actual mpirun command you typed? 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Problems with Mpi Accept - ORTE_ERROR_LOG

2011-06-28 Thread Ralph Castain
Looking deeper, I believe we may have a race condition in the code. Sadly, that 
error message is actually irrelevant, but causes the code to abort.

It can be triggered by race conditions in the app as well, but ultimately is 
something we need to clean up.


On Jun 27, 2011, at 9:29 AM, Rodrigo Oliveira wrote:

> Hi there.
> I am developing a server/client application using Open MPI 1.5.3. In a point 
> of the server code I open a port to receive connections from a client. After 
> that, I call the function MPI_Comm_accept and on the client side I call 
> MPI_Comm_connect. Sometimes I get an ORTE_ERROR_LOG, as showed bellow.
> before accept in host hydra9 port name = 
> 4108386304.0;tcp://150.164.3.204:48761;tcp://192.168.63.9:48761+4108386305.0tcp://150.164.3.204:49211;tcp://192.168.63.9:49211:300
>  
> [hydra9:11199] [[62689,1],0] ORTE_ERROR_LOG: Not found in file 
> base/grpcomm_base_allgather.c at line 220  
> [hydra9:11199] [[62689,1],0] ORTE_ERROR_LOG: Not found in file 
> base/grpcomm_base_modex.c at line 116  
> [hydra9:11199] [[62689,1],0] ORTE_ERROR_LOG: Not found in file 
> grpcomm_bad_module.c at line 608   
> [hydra9:11199] [[62689,1],0] ORTE_ERROR_LOG: Not found in file dpm_orte.c at 
> line 379 
> MPI 2 C++ exception throwing is disabled, MPI::mpi_errno has the error code   
> 
> after accept in host hydra9 error code = 17   
> 
> MPI 2 C++ exception throwing is disabled, MPI::mpi_errno has the error code
> The mpi_errno is 17 and I could not find a clear explanation about this 
> error. It occurs sporadically. Sometimes the application works, sometimes 
> does not.
> 
> Any ideas?
> 
> Thanks
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users