[OMPI users] intel compiler linking issue and issue of environment variable on remote node, with open mpi 1.4.3

2011-03-21 Thread yanyg
Hi,

I am trying to compile our codes with open mpi 1.4.3, by intel 
compilers 8.1. 

(1) For open mpi 1.4.3 installation on linux beowulf cluster, I use:

./configure --prefix=/home/yiguang/dmp-setup/openmpi-1.4.3 
CC=icc 
CXX=icpc F77=ifort FC=ifort --enable-static LDFLAGS="-i-static -
static-libcxa" --with-wrapper-ldflags="-i-static -static-libcxa" 2>&1 |
tee config.log

and 

make all install 2>&1 | tee install.log

The issue is that I am trying to build open mpi 1.4.3 with intel 
compiler libraries statically linked to it, so that when we run 
mpirun/orterun, it does not need to dynamically load any intel 
libraries. But what I got is mpirun always asks for some intel 
library(e.g. libsvml.so) if I do not put intel library path on library 
search path($LD_LIBRARY_PATH). I checked the open mpi user 
archive, it seems only some kind user mentioned to use
"-i-static"(in my case) or "-static-intel" in ldflags, this is what I did,
but it seems not working, and I did not get any confirmation whether 
or not this works for anyone else from the user archive. could 
anyone help me on this? thanks!

(2) After compiling and linking our in-house codes  with open mpi 
1.4.3, we want to make a minimal list of executables for our codes 
with some from open mpi 1.4.3 installation, without any dependent 
on external setting such as environment variables, etc.

I orgnize my directory as follows:

parent---
|
package
|
bin  
|
lib
|
tools

In package/ directory are executables from our codes. bin/ has 
mpirun and orted, copied from openmpi installation. lib/ includes 
open mpi libraries, and intel libraries. tools/ includes some c-shell 
scripts to launch mpi jobs, which uses mpirun in bin/.

The parent/ directory is on a NFS shared by all nodes of the 
cluster. In ~/.bashrc(shared by all nodes too), I clear PATH and 
LD_LIBRARY_PATH without direct to any directory of open mpi 
1.4.3 installation. 

First, if I set above bin/ directory  to PATH and lib/ 
LD_LIBRARY_PATH in ~/.bashrc, our parallel codes(starting by the 
C shell script in tools/) run AS EXPECTED without any problem, so 
that I set other things right.

Then again, to avoid modifying ~/.bashrc or ~/.profile, I set bin/ to 
PATH and lib/ to LD_LIBRARY_PATH in the C shell script under 
tools/ directory, as:

setenv PATH /path/to/bin:$PATH
setenv LD_LIBRARY_PATH /path/to/lib:$LD_LIBRARY_PATH

Then I start our codes from the C shell script in tools/, I got 
message: "orted command not found", which is from slave nodes, 
and orted should be in directory /path/to/bin. So I guess the $PATH 
variable or more general, the environment variables set in the script 
are not passed to the slave nodes by mpirun(I use absolute path for 
mpirun in the script). After I checked open mpi FAQ, I tried to set 
the "--prefix /path/to/parent" to mpirun command in the C shell 
script. it still does not work. Does any one have any hints? thanks!

I have tried my best to describe the issues, if anything not clear, 
please let me know as well. Thanks a lot for helps!

Sincerely,
Yiguang



Re: [OMPI users] intel compiler linking issue and issue of environment variable on remote node, with open mpi 1.4.3 (Tim Prince)

2011-03-22 Thread yanyg


Thank you very much for the comments and hints. I will try to 
upgrade our intel compiler collections. As for my second issue, 
with open mpi, is there any way to propagate enviroment variables 
of the current process on the master node to other slave nodes, 
such that orted daemon could run on slave nodes too?

Thanks,
Yiguang

> On 3/21/2011 5:21 AM, ya...@adina.com wrote:
> 
> > I am trying to compile our codes with open mpi 1.4.3, by intel
> > compilers 8.1.
> >
> > (1) For open mpi 1.4.3 installation on linux beowulf cluster, I use:
> >
> > ./configure --prefix=/home/yiguang/dmp-setup/openmpi-1.4.3
> > CC=icc
> > CXX=icpc F77=ifort FC=ifort --enable-static LDFLAGS="-i-static -
> > static-libcxa" --with-wrapper-ldflags="-i-static -static-libcxa"
> > 2>&1 | tee config.log
> >
> > and
> >
> > make all install 2>&1 | tee install.log
> >
> > The issue is that I am trying to build open mpi 1.4.3 with intel
> > compiler libraries statically linked to it, so that when we run
> > mpirun/orterun, it does not need to dynamically load any intel
> > libraries. But what I got is mpirun always asks for some intel
> > library(e.g. libsvml.so) if I do not put intel library path on
> > library search path($LD_LIBRARY_PATH). I checked the open mpi user
> > archive, it seems only some kind user mentioned to use
> > "-i-static"(in my case) or "-static-intel" in ldflags, this is what
> > I did, but it seems not working, and I did not get any confirmation
> > whether or not this works for anyone else from the user archive.
> > could anyone help me on this? thanks!
> >
> 
> If you are to use such an ancient compiler (apparently a 32-bit one),
> you must read the docs which come with it, rather than relying on
> comments about a more recent version.  libsvml isn't included
> automatically at link time by that 32-bit compiler, unless you specify
> an SSE option, such as -xW. It's likely that no one has verified
> OpenMPI with a compiler of that vintage.  We never used the 32-bit
> compiler for MPI, and we encountered run-time library bugs for the
> ifort x86_64 which weren't fixed until later versions.
> 
> 
> -- 
> Tim Prince
> 
> 
> --



Re: [OMPI users] intel compiler linking issue and issue of environment variable on remote node, with open mpi 1.4.3

2011-03-24 Thread yanyg
Thanks for your information. For my Open MPI installation, actually 
the executables such as mpirun and orted are dependent on those 
dynamic intel libraries, when I use ldd on mpirun, some dynamic 
libraries show up. I am trying to make these open mpi executables 
statically linked with these intel libraries, but it shows no progress 
even if I use "--with-gnu-ld" with specific static intel libraries set in 
LIBS when I configure open mpi 1.4.3 installation. It seems there 
are something for the compiling process of open mpi 1.4.3 that I do 
not have control, or I just missed something. I will try different 
things, and will report here once I have a confirmative conclusion. 
However, any hints or information on how to make open mpi 
executables statically linked to intel libs through intel compilers are 
very welcomed. Thanks!

As for the issue that environment variables set in a script do not 
propagate to remote slave nodes, I use rsh connection for 
simplicity. If I set PATH and LD_LIBRARY_PATH in ~/.bashrc 
(which shared by all nodes, master or slave), my MPI application 
does work as expected, and this confirms Ralph's suggestions. 
The thing is that I just want to avoid set the environment variables in 
 .bashrc or .porfile file, but instead, set them in the script, and want 
these environment variables propagating to other slave nodes 
when I do mpirun, as I could do for MPICH. I also try use the prefix 
path before mpirun when I do mpirun, as suggested by Jeff, it does 
not work either. Any hints to solve this issue?

Thanks,
Yiguang


On 23 Mar 2011, at 12:00, users-requ...@open-mpi.org wrote:

> On Mar 21, 2011, at 8:21 AM, ya...@adina.com wrote:
> 
> > The issue is that I am trying to build open mpi 1.4.3 with intel
> > compiler libraries statically linked to it, so that when we run
> > mpirun/orterun, it does not need to dynamically load any intel
> > libraries. But what I got is mpirun always asks for some intel
> > library(e.g. libsvml.so) if I do not put intel library path on
> > library search path($LD_LIBRARY_PATH). I checked the open mpi user
> > archive, it seems only some kind user mentioned to use
> > "-i-static"(in my case) or "-static-intel" in ldflags, this is what
> > I did, but it seems not working, and I did not get any confirmation
> > whether or not this works for anyone else from the user archive.
> > could anyone help me on this? thanks!
> 
> Is it Open MPI's executables that require the intel shared libraries
> at run time, or your application?  Keep in mind the difference:
> 
> 1. Compile/link flags that you specify to OMPI's configure script are
> used to compile/link Open MPI itself (including executables such as
> mpirun).
> 
> 2. mpicc (and friends) use a similar-but-different set of flags to
> compile and link MPI applications.  Specifically, we try to use the
> minimal set of flags necessary to compile/link, and let the user
> choose to add more flags if they want to.  See this FAQ entry for more
> details:
> 
> http://www.open-mpi.org/faq/?category=mpi-apps#override-wrappers-a
> fter -v1.0
> 
> > (2) After compiling and linking our in-house codes  with open mpi
> > 1.4.3, we want to make a minimal list of executables for our codes
> > with some from open mpi 1.4.3 installation, without any dependent on
> > external setting such as environment variables, etc.
> > 
> > I orgnize my directory as follows:
> > 
> > parent---
> >|
> > package
> > |
> > bin  
> > |
> > lib
> > |
> > tools
> > 
> > In package/ directory are executables from our codes. bin/ has
> > mpirun and orted, copied from openmpi installation. lib/ includes
> > open mpi libraries, and intel libraries. tools/ includes some
> > c-shell scripts to launch mpi jobs, which uses mpirun in bin/.
> 
> FWIW, you can use the following OMPI options to configure to eliminate
> all the OMPI plugins (i.e., locate all that code up in libmpi and
> friends, vs. being standalone-DSOs):
> 
> --disable-shared --enable-static
> 
> This will make libmpi.a (vs. libmpi.so and a bunch of plugins) which
> your application can statically link against.  But it does make a
> larger executable.  Alternatively, you can:
> 
> --disable-dlopen
> 
> (instead of disable-shared/enable-static) which will make a giant
> libmpi.so (vs. libmpi.so and all the plugin DSOs).  So your MPI app
> will still dynamically link against libmpi, but all the plugins will
> be physically located in libmpi.so vs. being dlopen'ed at run time.
> 
> > The parent/ directory is on a NFS shared by all nodes of the 
> > cluster. In ~/.bashrc(shared by all nodes too), I clear PATH and
> > LD_LIBRARY_PATH without direct to any directory of open mpi 1.4.3
> > installation. 
> > 
> > First, if I set above bin/ directory  to PATH and lib/ 
> > LD_LIBRARY_PATH in ~/.bashrc, our parallel codes(starting by the C
> > shell script in tools/) run AS EXPECTED witho

Re: [OMPI users] intel compiler linking issue and issue of environment variable on remote node, with open mpi 1.4.3

2011-04-22 Thread yanyg
Open MPI 1.4.3 + Intel Compilers V8.1 summary:
(in case someone likes to refer to it later)

(1) To make all Open MPI executables statically linked and 
independent of any dynamic libraries,
"--disable-shared" and "--enable-static" options should BOTH be 
fowarded to configure, and "-i-static"
option should be specified for intel compilers too.

(2) It is confirmed that environment variables could be forwarded to 
slave nodes, such as $PATH 
and $LD_LIBARY_PATH, by specifying options to mpirun. 
However, mpirun will invoke orted daemon on
master and slave nodes. These environment variables passed to 
slave nodes via mpirun options does not 
take into effect before orted started. So if orted daemon needs 
these environment variables to run,
the only way is to set these environment variables in a shared 
.bashrc or .profile file, visible to 
both master and slave nodes, say, on a shared NFS partition. It 
seems no other way to resolve this kind
of dependence.

Regards,
Yiguang




[OMPI users] mpirun does not propagate environment from master node to slave nodes

2011-06-28 Thread yanyg
Hello All,

I installed Open MPI 1.4.3 on our new HPC blades, with Infiniband 
interconnection.

My system environments are as:

1)uname -a output:  
Linux gulftown 2.6.18-194.el5 #1 SMP Tue Mar 16 21:52:39 EDT 
2010 x86_64 x86_64 x86_64 GNU/Linux

2) /home is mounted over all nodes, and mpirun is started under 
/home/...

Open MPI and application codes are compiled with intel(R) 
compilers V11. Infiniband stack is Mellanox OFED 1.5.2.

I have two questions about mpirun:

a) how could I get to know what is the network interconnect 
protocol used by the MPI application? 

I specify "--mca btl openib,self,sm,tcp" to mpirun, but I want to 
make sure it really uses infiniband interconnect.

b) when I run mpirun, I get the following message:
== Quote begin
bash: orted: command not found
bash: orted: command not found
bash: orted: command not found
--
A daemon (pid 15120) died unexpectedly with status 127 while 
attempting
to launch so we are aborting.

There may be more information reported by the environment (see 
above).

This may be because the daemon was unable to find all the 
needed shared
libraries on the remote node. You may set your 
LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--
--
mpirun noticed that the job aborted, but has no info as to the 
process
that caused that situation.
--
--
mpirun was unable to cleanly terminate the daemons on the nodes 
shown
below. Additional manual cleanup may be required - please refer to
the "orte-clean" tool for assistance.
--
ibnode001 - daemon did not report back when launched
ibnode002 - daemon did not report back when launched
ibnode003 - daemon did not report back when launched

== Quote end

It seems orted is not found on slave nodes. If I set the PATH and 
LD_LIBRARY_PATH through --prefix to mpirun, or --path, or -x 
options to mpirun, to make the orted and related dynamic libs 
available on slave nodes, it does not work as expected from mpirun 
manual page. The only working case is that I set PATH and 
LD_LIBRARY_PATH in ~/.bashrc for mpirun, and this .bashrc is 
invoked by slave nodes too for login shell. I do not want to set PATH 
and LD_LIBRARY_PATH in ~/.bashrc, but instead to set options to 
mpirun directly.

Thanks,
Yiguang



Re: [OMPI users] mpirun does not propagate environment from master node to slave nodes

2011-06-28 Thread yanyg
Thanks, Ralph!

a) Yes, I know I could use only IB by "--mca btl openib", but just 
want to make sure I am using IB interfaces. I am seeking an option 
to mpirun to print out the actual interconnect protocol, like --prot to 
mpirun in MPICH2.

b) Yes, my default shell is bash, but I run a c-shell script from bash 
terminal, mpirun is invoked inside this c-shell script. I am using rsh 
launcher, exactly as your guess. I try different mpirun command in 
the c-shell, one of them is

/path/to/bin/mpirun --mca btl openib --app appfile

and mpirun and orted are under /path/to/bin, and necessary libs are 
under /path/to/lib. I tried the -x, --prefix, and -path, all does not work 
as expected to propagate the PATH and LD_LIBRARY_PATH, 
since orted is not found on slave nodes, although it shoud since it 
on the shared NFS partition.

Thanks,
Yiguang


On Jun 28, 2011, at 9:05 AM, yanyg_at_[hidden] wrote:

> Hello All,
>
> I installed Open MPI 1.4.3 on our new HPC blades, with Infiniband
> interconnection.
>
> My system environments are as:
>
> 1)uname -a output:
> Linux gulftown 2.6.18-194.el5 #1 SMP Tue Mar 16 21:52:39 EDT
> 2010 x86_64 x86_64 x86_64 GNU/Linux
>
> 2) /home is mounted over all nodes, and mpirun is started under
> /home/...
>
> Open MPI and application codes are compiled with intel(R)
> compilers V11. Infiniband stack is Mellanox OFED 1.5.2.
>
> I have two questions about mpirun:
>
> a) how could I get to know what is the network interconnect
> protocol used by the MPI application?
>
> I specify "--mca btl openib,self,sm,tcp" to mpirun, but I want to
> make sure it really uses infiniband interconnect.

Why specify tcp if you don't want it used? Just leave that off and it 
will have no choice but to use IB.



>
> b) when I run mpirun, I get the following message:

> It seems orted is not found on slave nodes. If I set the PATH and
> LD_LIBRARY_PATH through --prefix to mpirun, or --path, or -x
> options to mpirun, to make the orted and related dynamic libs
> available on slave nodes, it does not work as expected from 
mpirun
> manual page. The only working case is that I set PATH and
> LD_LIBRARY_PATH in ~/.bashrc for mpirun, and this .bashrc is
> invoked by slave nodes too for login shell. I do not want to set 
PATH
> and LD_LIBRARY_PATH in ~/.bashrc, but instead to set options 
to
> mpirun directly.

Should work with either prefix or -x options, assuming the right 
syntax with the latter.

I take it your default shell is bash, and that you are using the rsh 
launcher (as opposed to something like torque)? Are you launching 
from your default shell, or did you perhaps change shell?

Can you send the actual mpirun command you typed? 


Re: [OMPI users] mpirun does not propagate environment from master node to slave nodes

2011-07-05 Thread yanyg
Thanks, Ralph.
Your information is very deep and detailed.

I tried with your suggestion to set ""-mca 
plm_rsh_assume_same_shell 0", it still does not work though. My 
situation is that we start a c-shell script from bash shell, which in 
turn invokes mpirun to other slave nodes. These slave nodes have 
bash login shell by default, and mpirun will execute another c-shell 
script on each node, will these mess thing up a little bit and related 
to the orted missing message?

Thanks again,
Yiguang

On Jun 28, 2011, at 3:52 PM, yanyg_at_[hidden] wrote: 

I looked a little deeper into this. I keep forgetting that we changed 
our default settings a few years ago. In the dim past, OMPI would 
always probe the remote node to find out what shell it was using, 
and then use the proper command syntax for that shell. However, 
people complained about the extra time during launch, and very 
very few people actually used mis-matched shells.

So we changed the setting the other way to default to assuming the 
remote shell is the same as the local one. For those like yourself 
that actually do have a mismatch, we left a parameter you can set 
to override that assumption. Just add "-mca 
plm_rsh_assume_same_shell 0" to your mpirun cmd line and it 
should resolve the problem. 


[OMPI users] MPI_Reduce error over Infiniband or TCP

2011-07-05 Thread yanyg
Dear all,

We are testing Open MPI over Infiniband, and got a MPI_Reduce 
error message when we run our codes either over TCP or 
Infiniband interface, as follows,

---
[gulftown:25487] *** An error occurred in MPI_Reduce
[gulftown:25487] *** on communicator MPI COMMUNICATOR 3 
CREATE FROM 0
[gulftown:25487] *** MPI_ERR_ARG: invalid argument of some 
other kind
[gulftown:25487] *** MPI_ERRORS_ARE_FATAL (your MPI job will 
now abort)

Elapsed time: 6:33.78

--
mpirun has exited due to process rank 0 with PID 25428 on
node gulftown exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--

---

Any hints?

Thanks,
Yiguang


Re: [OMPI users] mpirun does not propagate environment from master node to slave nodes

2011-07-08 Thread yanyg
Thanks, Ralph.

*** quote begin *
Let me get this straight. You are executing mpirun from inside a c-
shell script, launching onto nodes where you will by default be 
running bash. The param I gave you should support that mode - it 
basically tells OMPI to probe the remote node to discover what 
shell it will run under there, and then formats the orted cmd line 
accordingly. If that isn't working (and it almost never gets used, so 
may have bit-rotted), then your only option is to convert the c-shell 
to bash.

However, you are saying that the app you are asking us to run is a 
c-shell script??? Have you included the !/bin/csh directive in the top 
of that file so the system will automatically exec it using csh?

Note that the orted comes alive and running prior to your "app" 
being executed, so the fact that your "app" is a c-shell script is 
irrelevant. 
*** quote end *

You got exactly as in my case. and I agree with you that the app c-
shell should not matter here. I checked that I have the #!/bin/csh to 
the head of the c-shell scripts. I guess I have to rewrite the c-shell 
script in bash to solve this issue totally, although it is not that easy.

Thanks again,
Yiguang




[OMPI users] Error-Open MPI over Infiniband: polling LP CQ with status LOCAL LENGTH ERROR

2011-07-08 Thread yanyg
Hi all,

The message says :

[[17549,1],0][btl_openib_component.c:3224:handle_wc] from 
gulftown to: gulftown error polling LP CQ with status LOCAL 
LENGTH ERROR status number 1 for wr_id 492359816 opcode 
32767  vendor error 105 qp_idx 3

This is very arcane to me, the same test ran when only one MPI 
process on each node, but when we switch to two MPI processes 
on each node, then this error message comes up. Anything I could 
do? Anything related to infiniband configuration, as guessed form 
the string "vendor error 105 qp_idx 3"?

Thanks,
Yiguang



Re: [OMPI users] Error-Open MPI over Infiniband: polling LP CQ with status LOCAL LENGTH ERROR

2011-07-11 Thread yanyg
Hi Yevgeny,

Thanks.

Here is the output of /usr/bin/ibv_devinfo:


hca_id: mlx4_0
transport:  InfiniBand (0)
fw_ver: 2.8.000
node_guid:  0002:c903:0010:a85a
sys_image_guid: 0002:c903:0010:a85d
vendor_id:  0x02c9
vendor_part_id: 26428
hw_ver: 0xB0
board_id:   HP_016009
phys_port_cnt:  2
port:   1
state:  PORT_ACTIVE (4)
max_mtu:2048 (4)
active_mtu: 2048 (4)
sm_lid: 1
port_lid:   1
port_lmc:   0x00
link_layer: IB

port:   2
state:  PORT_ACTIVE (4)
max_mtu:2048 (4)
active_mtu: 2048 (4)
sm_lid: 1
port_lid:   6
port_lmc:   0x00
link_layer: IB


Each node has a HCA card with two ports active. Network 
controller is MT26428, version 09:00.0

I am running over Open MPI 1.4.3, the command line is:

/path/to/mpirun -mca btl_openib_warn_default_gid_prefix 0 --mca 
btl openib,self -app appfile

Thanks again,
Yiguang


On 10 Jul 2011, at 9:55, Yevgeny Kliteynik wrote:

> Hi Yiguang,
> 
> On 08-Jul-11 4:38 PM, ya...@adina.com wrote:
> > Hi all,
> > 
> > The message says :
> > 
> > [[17549,1],0][btl_openib_component.c:3224:handle_wc] from
> > gulftown to: gulftown error polling LP CQ with status LOCAL
> > LENGTH ERROR status number 1 for wr_id 492359816 opcode
> > 32767  vendor error 105 qp_idx 3
> > 
> > This is very arcane to me, the same test ran when only one MPI
> > process on each node, but when we switch to two MPI processes
> > on each node, then this error message comes up. Anything I could do?
> > Anything related to infiniband configuration, as guessed form the
> > string "vendor error 105 qp_idx 3"?
> 
> What OMPI version are you using and what kind of HCAs do you have? You
> can get details about HCA with "ibv_devinfo" command. Also, can you
> post here all the OMPI command line parameters that you use when you
> run your test?
> 
> Thanks.
> 
> -- YK
> 
> > Thanks,
> > Yiguang
> > 
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > 
> 




[OMPI users] help: sm btl does not work when I specify the same host twice or more in the node list

2012-02-09 Thread yanyg
Hi all,

Good morning!

I have trouble to communicate through sm btl in open MPI, please 
check the attached file for my system information. I am using open 
MPI 1.4.3, intel compilers V11.1, on linux RHEL 5.4 with kernel 2.6.

The tests are the following: 

(1) if I specify the btl to mpirun by "--mca btl self,sm,openib", if I did 
not specify any of my computing nodes twice or more in the node 
list, my job runs fine. However, if I specify any of the computing 
nodes twice or more in the node list, it will hang there forever. 

(2) if I did not specify the sm btl to mpirun as "--mca btl 
self,openib", I could run my job smoothly, either put any of the 
computing nodes twice or more in the node list, or not. 

>From above 2 tests, apparently something wrong with sm btl 
interface on my system. As I checked the user archive, sm btl 
issue has been encountered due to the comm_spawned 
parent/child processes. But this seems not the case here, if I do 
not use any of my MPI based solver, only with MPI initialization and 
finalization procedures called, it still has this issue. 

Any comments?

Thanks,
Yiguang

The following section of this message contains a file attachment
prepared for transmission using the Internet MIME message format.
If you are using Pegasus Mail, or any another MIME-compliant system,
you should be able to save it or view it from within your mailer.
If you cannot, please ask your system administrator for assistance.

    File information ---
 File:  ompiinfo-config-uname-output.tgz
 Date:  9 Feb 2012, 8:58
 Size:  126316 bytes.
 Type:  Unknown


ompiinfo-config-uname-output.tgz
Description: Binary data


Re: [OMPI users] help: sm btl does not work when I specify the same host twice or more in the node list

2012-02-13 Thread yanyg
Hi Jeff,

Thank you very much for your help!

I tried to run the same test of ring_c from standard examples in 
Open MPI 1.4.3 distribution. If I ran as you described from the 
command line, it worked without any problem with sm btl 
included(with --mca btl self,sm,openib). However, if I use sm 
btl(with --mca btl self,sm,openib), and ran ring_c from an in-house 
script, it showed the same issue as I described in my previous 
email, it will hang at MPI_Init(...) call. I think this issue is related to 
some environmental setting in the script. Do you have any hints, 
any prerequisite of system environmental configuration to work with 
sm btl layer in Open MPI?

Thanks again,
Yiguang



Re: [OMPI users] help: sm btl does not work when I specify the same host twice or more in the node list

2012-02-14 Thread yanyg
Hi Jeff,

The command "env | grep OMPI" output nothing but a blank line 
from my script. Anything I should set for mpirun?

On the other hand, you may get reminded that I found you 
discussed some similar issue with Jonathan Dursi. The difference 
is that when I tried with --mca btl_sm_num_fifos #(np-1), it does 
not work with me, and I did find those files in the tmp directory that 
sm mmaped in(shared_mem_pool.ibnode001, etc), but for some 
mysterious reason, it hang at MPI_Init, so these files are created 
when we call MPI_Init?

Thanks,
Yiguang



Re: [OMPI users] help: sm btl does not work when I specify the same host twice or more in the node list

2012-02-14 Thread yanyg
Hi Ralph,

Could you please tell me what OMPI envars are broken? or what 
OMPI envars should be there for OMPI to work properly?

Although I start my c-shell script from a bash command line(not 
sure if this matters), I only add Open MPI executable and lib path to 
$PATH and $LD_LIBRARY_PATH, no other OMPI environmental 
variables are set on my system(in bash or csh) as I checked.

Thanks,
Yiguang



Re: [OMPI users] help: sm btl does not work when I specify the same host twice or more in the node list

2012-02-14 Thread yanyg
Yes, in short, I start a c-shell script from bash command line, in 
which I mpirun another c-shell script which start the computing 
process. The only OMPI related envars are PATH and 
LD_LIBRARY_PATH. Any other OPMI envars I should set?


Re: [OMPI users] help: sm btl does not work when I specify the same host twice or more in the node list

2012-02-15 Thread yanyg
> No, there are no others you need to set. Ralph's referring to the fact
> that we set OMPI environment variables in the processes that are
> started on the remote nodes.
> 
> I was asking to ensure you hadn't set any MCA parameters in the
> environment that could be creating a problem. Do you have any set in
> files, perchance?
> 
> And can you run "env | grep OMPI" from the script that you invoked via
> mpirun?
> 
> So just to be clear on the exact problem you're seeing:
> 
> - you mpirun on a single node and all works fine
> - you mpirun on multiple nodes and all works fine (e.g., mpirun --host
> a,b,c your_executable) - you mpirun on multiple nodes and list a host
> more than once and it hangs (e.g., mpirun --host a,a,b,c
> your_executable)
> 
> Is that correct?
> 
> If so, can you attach a debugger to one of the hung processes and see
> exactly where it's hung? (i.e., get the stack traces)
> 
> Per a question from your prior mail: yes, Open MPI does create mmapped
> files in /tmp for use with shared memory communication. They *should*
> get cleaned up when you exit, however, unless something disastrous
> happens. 

Thank you very much!

Now I am more clear with what Ralph asked. 

Yes what you described is right with the sm btl layer. As I double 
checked again, the problem is that when I use sm btl for MPI 
commnunication on the same host(as --mca btl openib,sm,self), 
issues come up as you described, all ran well on a single node, all 
ran well on multiple but different nodes, but it hang at MPI_Init() call 
if I ran on multiple nodes and list a host more than once. However, 
if I instead use tcp or openib btl without sm layer(as --mca btl 
openib,self), all these 3 cases ran just fine. 

I do setup the MCAs "plm_rsh_agent" to "rsh:ssh" and 
"btl_openib_warn_default_gid_prefix" to 0 in all cases, with or 
without sm btl layer. The OMPI environment variables set for each 
processes are quoted below(as output by env | grep OMPI in my 
script invoked by mpirun):

--
//process #0:

OMPI_MCA_plm_rsh_agent=rsh:ssh 
OMPI_MCA_btl_openib_warn_default_gid_prefix=0 
OMPI_MCA_btl=openib,sm,self 
OMPI_MCA_orte_precondition_transports=3a07553f5dca58b5-
21784eac1fc85294 
OMPI_MCA_orte_local_daemon_uri=195559424.0;tcp://198.177.14
6.70:53997;tcp://10.10.10.4:53997;tcp://172.23.10.1:53997;tcp://172
.33.10.1:53997 
OMPI_MCA_orte_hnp_uri=195559424.0;tcp://198.177.146.70:53997
;tcp://10.10.10.4:53997;tcp://172.23.10.1:53997;tcp://172.33.10.1:53
997 OMPI_MCA_mpi_yield_when_idle=0 
OMPI_MCA_orte_app_num=0 OMPI_UNIVERSE_SIZE=4 
OMPI_MCA_ess=env OMPI_MCA_orte_ess_num_procs=4 
OMPI_COMM_WORLD_SIZE=4 
OMPI_COMM_WORLD_LOCAL_SIZE=2 
OMPI_MCA_orte_ess_jobid=195559425 
OMPI_MCA_orte_ess_vpid=0 OMPI_COMM_WORLD_RANK=0 
OMPI_COMM_WORLD_LOCAL_RANK=0

//process #1:

OMPI_MCA_plm_rsh_agent=rsh:ssh 
OMPI_MCA_btl_openib_warn_default_gid_prefix=0 
OMPI_MCA_btl=openib,sm,self 
OMPI_MCA_orte_precondition_transports=3a07553f5dca58b5-
21784eac1fc85294 
OMPI_MCA_orte_local_daemon_uri=195559424.0;tcp://198.177.14
6.70:53997;tcp://10.10.10.4:53997;tcp://172.23.10.1:53997;tcp://172
.33.10.1:53997 
OMPI_MCA_orte_hnp_uri=195559424.0;tcp://198.177.146.70:53997
;tcp://10.10.10.4:53997;tcp://172.23.10.1:53997;tcp://172.33.10.1:53
997 OMPI_MCA_mpi_yield_when_idle=0 
OMPI_MCA_orte_app_num=1 OMPI_UNIVERSE_SIZE=4 
OMPI_MCA_ess=env OMPI_MCA_orte_ess_num_procs=4 
OMPI_COMM_WORLD_SIZE=4 
OMPI_COMM_WORLD_LOCAL_SIZE=2 
OMPI_MCA_orte_ess_jobid=195559425 
OMPI_MCA_orte_ess_vpid=1 OMPI_COMM_WORLD_RANK=1 
OMPI_COMM_WORLD_LOCAL_RANK=1

//process #3:

OMPI_MCA_plm_rsh_agent=rsh:ssh 
OMPI_MCA_btl_openib_warn_default_gid_prefix=0 
OMPI_MCA_btl=openib,sm,self 
OMPI_MCA_orte_precondition_transports=3a07553f5dca58b5-
21784eac1fc85294 OMPI_MCA_orte_daemonize=1 
OMPI_MCA_orte_hnp_uri=195559424.0;tcp://198.177.146.70:53997
;tcp://10.10.10.4:53997;tcp://172.23.10.1:53997;tcp://172.33.10.1:53
997 OMPI_MCA_ess=env OMPI_MCA_orte_ess_jobid=195559425 
OMPI_MCA_orte_ess_vpid=3 OMPI_MCA_orte_ess_num_procs=4 
OMPI_MCA_orte_local_daemon_uri=195559424.1;tcp://198.177.14
6.71:53290;tcp://10.10.10.1:53290;tcp://172.23.10.2:53290;tcp://172
.33.10.2:53290 OMPI_MCA_mpi_yield_when_idle=0 
OMPI_MCA_orte_app_num=3 OMPI_UNIVERSE_SIZE=4 
OMPI_COMM_WORLD_SIZE=4 
OMPI_COMM_WORLD_LOCAL_SIZE=2 
OMPI_COMM_WORLD_RANK=3 
OMPI_COMM_WORLD_LOCAL_RANK=1

//process #2:

OMPI_MCA_plm_rsh_agent=rsh:ssh 
OMPI_MCA_btl_openib_warn_default_gid_prefix=0 
OMPI_MCA_btl=openib,sm,self 
OMPI_MCA_orte_precondition_transports=3a07553f5dca58b5-
21784eac1fc85294 OMPI_MCA_orte_daemonize=1 
OMPI_MCA_orte_hnp_uri=195559424.0;tcp://198.177.146.70:53997
;tcp://10.10.10.4:53997;tcp://172.23.10.1:53997;tcp://172.33.10.1:53
997 OMPI_MCA_ess=env OMPI_MCA_orte_ess_jobid=195559425 
OMPI_MCA_orte_ess_vpid=2 OMPI_MCA_orte_ess_num_procs=4 
OMPI_MCA_orte_local_daemon_uri=195559424.1;tcp://198.177.14
6.71:53290;tcp://10.10.10.1:53290;tcp://172.23.10.2:53290;tcp://172
.33.10.2:53290 OMPI_MCA_mpi_yield_when_idle=0 
OMPI_MCA_orte_app_

Re: [OMPI users] help: sm btl does not work when I specify the same host twice or more in the node list

2012-02-15 Thread yanyg

> So the real issue is: the sm BTL is not working for you.
>

Yes.

> What version of Open MPI are you using?
> 

It is 1.4.3 I am using.

> Can you rm -rf any Open MPI directories that may be left over in /tmp?

Yes, I have tried that. The clean up does not help to make sm btl 
work.




Re: [OMPI users] help: sm btl does not work when I specify the same host twice or more in the node list

2012-02-16 Thread yanyg

OK, with Jeff's kind help, I solved this issue in a very simple way. 
Now I would like to report back the reason for this issue and the 
solution.

(1) The scenario under which this issue happened:

In my OPMI environment, the $TMPDIR envar is set to different 
scratch directory for different MPI process, even some MPI 
processes are running on the same host. This is not troublesome if 
we use openib,self,tcp btl layer for communication. However, if we 
use sm btl layer, then, as Jeff said:

"""
Open MPI creates its shared memory files in $TMPDIR. It implicitly 
expects all shared memory files to be found under the same 
$TMPDIR for all procs on a single machine.  

More specifically, Open MPI creates what we call a "session 
directory" under $TMPDIR that is an implicit rendezvous point for all 
processes on the same machine.  Some meta data is put in there, 
to include the shared memory mmap files.

So if the different processes have a different idea of where the
rendezvous session directory exists, they'll end up blocking waiting 
for others to show up at their (individual) rendezvous points... but 
that will never happen, because each process is waiting at their 
own rendezvous point.

"""

So in this case, there is a block and wait on each other for MPI 
processes shared data through shared memory, which will never 
be released, hence the hang at the MPI_Init call.

(2) Solution to this issue:

You may set the $TMPDIR to a same directory on the same host if 
possible; or you could setenv OMPI_PREFIX_ENV to a common 
directory for MPI processes on the same host while keeping your 
$TMPDIR setting. either way is verified and working fine for me!

Thanks,
Yiguang



[OMPI users] orted daemon no found! --- environment not passed to slave nodes?

2012-02-27 Thread yanyg
Greetings!

I have tried to run ring_c example test from a bash script. In this 
bash script, I setup PATH and LD_LIBRARY_PATH(I donot want to 
disturb ~/.bashrc, etc), then use a full path of mpirun to invoke mpi 
processes, the mpirun and orted are both on the PATH. However, 
from the Open MPI message, orted was not found, to me, it was 
not found only on slave nodes. Then I tried to set the --prefix or -x 
PATH -x LD_LIBRARY_PATH to hope these envars passed to 
slave nodes, but it turned out they are not forwarded to slave 
nodes. 

On the other hand, if I set the same PATH and 
LD_LIBRARY_PATH in ~/.bashrc which shared by all nodes, 
mpirun from bash script runs fine and orted could be found. This is 
easy to understand though, but I realy do not want to change 
~/.bashrc.

It seems the non-interactive bash shell does not pass envars to 
slave nodes. 

Any comments and solutions?

Thanks,
Yiguang