[OMPI users] Fault Tolerant Method

2006-07-28 Thread bdickinson
I have implemented the fault tolerance method in which you would use
MPI_COMM_SPAWN to dynamically create communication groups and use
those communicators for a form of process fault tolerance (as 
described by William Gropp and Ewing Lusk in their 2004 paper),
but am having some problems getting it to work the way I intended.
Basically, when it runs, it is spawning all the processes on the
same machine (as it always starts at the top of the machine_list
when spawning a process).  Is there a way that I get get these
processes to spawn on different machines?

One possible route I considerd was using something like SLURM to
distribute the jobs, and just putting '+' in the machine file.  Will
this work?  Is this the best route to go?

Thanks for any help with this.

Byron



Re: [OMPI users] Fault Tolerant Method

2006-07-28 Thread Josh Hursey
> I have implemented the fault tolerance method in which you would use
> MPI_COMM_SPAWN to dynamically create communication groups and use
> those communicators for a form of process fault tolerance (as
> described by William Gropp and Ewing Lusk in their 2004 paper),
> but am having some problems getting it to work the way I intended.
> Basically, when it runs, it is spawning all the processes on the
> same machine (as it always starts at the top of the machine_list
> when spawning a process).  Is there a way that I get get these
> processes to spawn on different machines?
>

In Open MPI (and most other MPI implementations) you will be restricted to
using only the machines in your allocation when you use MPI_Comm_spawn*.
The standard allows you can suggest to MPI_Comm_spawn where to place the
'children' that it creates using the MPI_Info key -- specifically the
{host} keyvalue referenced here:
http://www.mpi-forum.org/docs/mpi-20-html/node97.htm#Node97
MPI_Info is described here:
http://www.mpi-forum.org/docs/mpi-20-html/node53.htm#Node53

Open MPI, in the current release, does not do anything with this key.
This has been fixed in subversion (as of r11039) and will be in the next
release of Open MPI.

If you want to use this functionality in the near term I would suggest
using the nightly build of the subversion trunk available here:
http://www.open-mpi.org/nightly/trunk/


> One possible route I considerd was using something like SLURM to
> distribute the jobs, and just putting '+' in the machine file.  Will
> this work?  Is this the best route to go?

Off the top of my head, I'm not sure if that would work of not. The
best/cleanest route would be to use the MPI_Info command and the {host}
key.

Let us know if you have any trouble with MPI_Comm_spawn or MPI_Info in
this scenario.

Hope that helps,
Josh

>
> Thanks for any help with this.
>
> Byron
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



Re: [OMPI users] Fault Tolerant Method

2006-07-28 Thread Edgar Gabriel
don't forget furthermore, that for successfully using this 
fault-tolerance approach, the parents or other child processes should 
not be affected by the death/failure of another child process. Right now 
in Open MPI, if one of the child processes (which you spawned using 
MPI_Comm_spawn) fails, the whole application will fail. [To be more 
precise: the MPI standard does not enforce/mandate the behavior 
described in the paper which you mentioned]


Thanks
Edgar

Josh Hursey wrote:

I have implemented the fault tolerance method in which you would use
MPI_COMM_SPAWN to dynamically create communication groups and use
those communicators for a form of process fault tolerance (as
described by William Gropp and Ewing Lusk in their 2004 paper),
but am having some problems getting it to work the way I intended.
Basically, when it runs, it is spawning all the processes on the
same machine (as it always starts at the top of the machine_list
when spawning a process).  Is there a way that I get get these
processes to spawn on different machines?



In Open MPI (and most other MPI implementations) you will be restricted to
using only the machines in your allocation when you use MPI_Comm_spawn*.
The standard allows you can suggest to MPI_Comm_spawn where to place the
'children' that it creates using the MPI_Info key -- specifically the
{host} keyvalue referenced here:
http://www.mpi-forum.org/docs/mpi-20-html/node97.htm#Node97
MPI_Info is described here:
http://www.mpi-forum.org/docs/mpi-20-html/node53.htm#Node53

Open MPI, in the current release, does not do anything with this key.
This has been fixed in subversion (as of r11039) and will be in the next
release of Open MPI.

If you want to use this functionality in the near term I would suggest
using the nightly build of the subversion trunk available here:
http://www.open-mpi.org/nightly/trunk/



One possible route I considerd was using something like SLURM to
distribute the jobs, and just putting '+' in the machine file.  Will
this work?  Is this the best route to go?


Off the top of my head, I'm not sure if that would work of not. The
best/cleanest route would be to use the MPI_Info command and the {host}
key.

Let us know if you have any trouble with MPI_Comm_spawn or MPI_Info in
this scenario.

Hope that helps,
Josh


Thanks for any help with this.

Byron

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] Problem with Openmpi 1.1

2006-07-28 Thread Jeff Squyres
Trolling through some really old mails that never got replies... :-(

I'm afraid that the guy who did the GM code in Open MPI is currently on
vacation, but we have made a small number of changes since 1.1 that may have
fixed your issue.

Could you try one of the 1.1.1 release candidate tarballs and see if you
still have the problem?

http://www.open-mpi.org/software/ompi/v1.1/


On 7/3/06 12:58 PM, "Borenstein, Bernard S"
 wrote:

> I've built and sucessfully run the Nasa Overflow 2.0aa program with
> Openmpi 1.0.2.  I'm running on an opteron linux cluster running SLES 9
> and GM 2.0.24. I built Openmpi 1.1 with the intel 9 compilers and try to
> run Overflow 2.0aa with myrinet, it get what looks like a data
> corruption error and the program dies quickly.
> There are no mpi errors at all.If I run using GIGE (--mca btl self,tcp),
> the program runs to competion correctly.  Here is my ompi_info output :
> 
> bsb3227@mahler:~/openmpi_1.1/bin> ./ompi_info
> Open MPI: 1.1
>Open MPI SVN revision: r10477
> Open RTE: 1.1
>Open RTE SVN revision: r10477
> OPAL: 1.1
>OPAL SVN revision: r10477
>   Prefix: /home/bsb3227/openmpi_1.1
>  Configured architecture: x86_64-unknown-linux-gnu
>Configured by: bsb3227
>Configured on: Fri Jun 30 07:08:54 PDT 2006
>   Configure host: mahler
> Built by: bsb3227
> Built on: Fri Jun 30 07:54:46 PDT 2006
>   Built host: mahler
>   C bindings: yes
> C++ bindings: yes
>   Fortran77 bindings: yes (all)
>   Fortran90 bindings: yes
>  Fortran90 bindings size: small
>   C compiler: icc
>  C compiler absolute: /opt/intel/cce/9.0.25/bin/icc
> C++ compiler: icpc
>C++ compiler absolute: /opt/intel/cce/9.0.25/bin/icpc
>   Fortran77 compiler: ifort
>   Fortran77 compiler abs: /opt/intel/fce/9.0.25/bin/ifort
>   Fortran90 compiler: /opt/intel/fce/9.0.25/bin/ifort
>   Fortran90 compiler abs: /opt/intel/fce/9.0.25/bin/ifort
>  C profiling: yes
>C++ profiling: yes
>  Fortran77 profiling: yes
>  Fortran90 profiling: yes
>   C++ exceptions: no
>   Thread support: posix (mpi: no, progress: no)
>   Internal debug support: no
>  MPI parameter check: runtime
> Memory profiling support: no
> Memory debugging support: no
>  libltdl support: yes
>   MCA memory: ptmalloc2 (MCA v1.0, API v1.0, Component v1.1)
>MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.1)
>MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.1)
>MCA maffinity: libnuma (MCA v1.0, API v1.0, Component v1.1)
>MCA timer: linux (MCA v1.0, API v1.0, Component v1.1)
>MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0)
>MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0)
> MCA coll: basic (MCA v1.0, API v1.0, Component v1.1)
> MCA coll: hierarch (MCA v1.0, API v1.0, Component v1.1)
> MCA coll: self (MCA v1.0, API v1.0, Component v1.1)
> MCA coll: sm (MCA v1.0, API v1.0, Component v1.1)
> MCA coll: tuned (MCA v1.0, API v1.0, Component v1.1)
>   MCA io: romio (MCA v1.0, API v1.0, Component v1.1)
>MCA mpool: sm (MCA v1.0, API v1.0, Component v1.1)
>MCA mpool: gm (MCA v1.0, API v1.0, Component v1.1)
>  MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.1)
>  MCA bml: r2 (MCA v1.0, API v1.0, Component v1.1)
>   MCA rcache: rb (MCA v1.0, API v1.0, Component v1.1)
>  MCA btl: self (MCA v1.0, API v1.0, Component v1.1)
>  MCA btl: sm (MCA v1.0, API v1.0, Component v1.1)
>  MCA btl: gm (MCA v1.0, API v1.0, Component v1.1)
>  MCA btl: tcp (MCA v1.0, API v1.0, Component v1.0)
> MCA topo: unity (MCA v1.0, API v1.0, Component v1.1)
>  MCA osc: pt2pt (MCA v1.0, API v1.0, Component v1.0)
>  MCA gpr: null (MCA v1.0, API v1.0, Component v1.1)
>  MCA gpr: proxy (MCA v1.0, API v1.0, Component v1.1)
>  MCA gpr: replica (MCA v1.0, API v1.0, Component v1.1)
>  MCA iof: proxy (MCA v1.0, API v1.0, Component v1.1)
>  MCA iof: svc (MCA v1.0, API v1.0, Component v1.1)
>   MCA ns: proxy (MCA v1.0, API v1.0, Component v1.1)
>   MCA ns: replica (MCA v1.0, API v1.0, Component v1.1)
>  MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0)
>  MCA ras: dash_host (MCA v1.0, API v1.0, Component v1.1)
>  MCA ras: hostfile (MCA v1.0, API v1.0, Component v1.1)
>  MCA ras: localhost (MCA v1.0, API v1.0, Component v1.1)
>  MCA ras: slurm (MCA v

Re: [OMPI users] OS X, OpenMPI 1.1: An error occurred in MPI_Allreduce on, communicator MPI_COMM_WORLD (Jeff Squyres (jsquyres))

2006-07-28 Thread Jeff Squyres
Trolling through some really old messages that never got replies... :-(

The behavior that you are seeing is happening as the result of a really long
discussion among the OMPI developers when we were writing the TCP device.
The problem is that there is ambiguity when connecting peers across TCP in
Open MPI.  Specifically, since OMPI can span multiple TCP networks, each MPI
process may be able to use multiple IP addresses to each to each other MPI
process (and vice versa).  So we have to try to figure out which IP
addresses can speak to which others.

For example, say that you have a cluster with 16 nodes on a private ethernet
network.  One of these nodes doubles as the head node for the cluster and
therefore has 2 ethernet NICs -- one to the external network and one to the
internal cluster network.  But since 16 is a nice number, you also want to
use it for computation as well.  So when you mpirun spanning all 16 nodes,
OMPI has to figure out to *not* use the external NIC on the head node and
only use the internal NIC.

TCP connections are only made upon demand which is why you only see this
behavior if two processes actually attempt to communicate via MPI (i.e.,
"hello world" with no sending/receiving works fine, but adding the
MPI_SEND/MPI_RECV makes it fail).

We make connections by having all MPI processes exchange their IP
address(es) and port number(s) during MPI_INIT (via a common rendevouz
point, typically mpirun).  Then, whenever a connection is requested between
two processes, we apply a small set of rules to all pair combinations of IP
addresses of those processes:

1. If the two IP addresses match after the subnet mask is applied, assume
that they are mutually routable and allow the connection
2. If the two IP addresses are public, assume that they are mutually
routable and allow the connection
3. Otherwise, the connection is disallowed (this is not an error -- we just
disallow this connection on the hope that some other device can be used to
make a connection).

What is happening in your case is that you're falling through to #3 for all
IP address pair combinations and there is no other device that can reach
these processes.  Therefore OMPI thinks that it has no channel to reach the
remote process.  So it bails (in a horribly non-descriptive way :-( ).

We actually have a very long comment about this in the TCP code and mention
that your scenario (lots of hosts in a single cluster with private addresses
and relatively narrow subnet masks, even though all addresses are, in fact,
routable to each other) is not currently supported -- and that we need to do
something "better".  "Better" in this case probably means having a
configuration file that specifies what hosts are mutually routable when the
above rules don't work.  Do you have any suggestions on this front?



On 7/5/06 1:15 PM, "Frank Kahle"  wrote:

> users-requ...@open-mpi.org wrote:
>> A few clarifying questions:
>> 
>> What is your netmask on these hosts?
>> 
>> Where is the MPI_ALLREDUCE in your app -- right away, or somewhere deep
>> within the application?  Can you replicate this with a simple MPI
>> application that essentially calls MPI_INIT, MPI_ALLREDUCE, and
>> MPI_FINALIZE?
>> 
>> Can you replicate this with a simple MPI app that does an MPI_SEND /
>> MPI_RECV between two processes on the different subnets?
>> 
>> Thanks.
>> 
>>   
> 
> @ Jeff,
> 
> netmask 255.255.255.0
> 
> Running a simple "hello world" yields no error on each subnet, but
> running "hello world" on both subnets yields the error
> 
> [g5dual.3-net:00436] *** An error occurred in MPI_Send
> [g5dual.3-net:00436] *** on communicator MPI_COMM_WORLD
> [g5dual.3-net:00436] *** MPI_ERR_INTERN: internal error
> [g5dual.3-net:00436] *** MPI_ERRORS_ARE_FATAL (goodbye)
> 
> Hope this helps!
> 
> Frank
> 
> 
> Just in case you wanna check the source:
> cFortran example hello_world
>   program hello
>   include 'mpif.h'
>   integer rank, size, ierror, tag, status(MPI_STATUS_SIZE)
>   character*12 message
> 
>   call MPI_INIT(ierror)
>   call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)
>   call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)
>   tag = 100
> 
>   if (rank .eq. 0) then
> message = 'Hello, world'
> do i=1, size-1
>   call MPI_SEND(message, 12, MPI_CHARACTER, i, tag,
>  &  MPI_COMM_WORLD, ierror)
> enddo
> 
>   else
> call MPI_RECV(message, 12, MPI_CHARACTER, 0, tag,
>  &MPI_COMM_WORLD, status, ierror)
>   endif
> 
>   print*, 'node', rank, ':', message
>   call MPI_FINALIZE(ierror)
>   end
> 
> 
> or the full output:
> 
> [powerbook:/Network/CFD/hello] motte% mpirun -d -np 5 --hostfile
> ./hostfile /Network/CFD/hello/hello_world
> [powerbook.2-net:00606] [0,0,0] setting up session dir with
> [powerbook.2-net:00606] universe default-universe
> [powerbook.2-net:00606] user motte
> [powerbook.2-net:00606]   

Re: [OMPI users] mca_btl_tcp_frag_send: writev failed with errno=110

2006-07-28 Thread Jeff Squyres
Tony --

My apologies for taking so long to answer.  :-(

I was unfortunately unable to replicate your problem.  I ran your source
code across 32 machines connected by TCP with no problem:

  mpirun --hostfile ~/mpi/cdc -np 32 -mca btl tcp,self netbench 8

I tried this on two different clusters with the same results -- it didn't
hang.  :-(

Can you try again with a recent nightly tarball, or the 1.1.1 beta tarball
that has been posted?

  http://www.open-mpi.org/software/ompi/v1.1/


On 6/30/06 8:35 AM, "Tony Ladd"  wrote:

> Jeff
> 
> Thanks for the reply; I realize you guys must be really busy with the recent
> release of openmpi. I tried 1.1 and I don't get error messages any more. But
> the code now hangs; no error or exit. So I am not sure if this is the same
> issue or something else. I am enclosing my source code. I compiled with icc
> and linked against an icc compiled version of openmpi-1.1.
> 
> My program is a set of network benchmarks (a crude kind of netpipe) that
> checks typical message passing patterns in my application codes.
> Typical output is:
> 
>  32 CPU's: sync call time = 1003.0time
> rate (Mbytes/s) bandwidth (MBits/s)
>  loop   buffers  size XC   XE   GS   MS XC
> XE   GS   MS XC   XE   GS   MS
>1   6416384  2.48e-02 1.99e-02 1.21e+00 3.88e-02   4.23e+01
> 5.28e+01 8.65e-01 2.70e+01   1.08e+04 1.35e+04 4.43e+02 1.38e+04
>2   6416384  2.17e-02 2.09e-02 1.21e+00 4.10e-02   4.82e+01
> 5.02e+01 8.65e-01 2.56e+01   1.23e+04 1.29e+04 4.43e+02 1.31e+04
>3   6416384  2.20e-02 1.99e-02 1.01e+00 3.95e-02   4.77e+01
> 5.27e+01 1.04e+00 2.65e+01   1.22e+04 1.35e+04 5.33e+02 1.36e+04
>4   6416384  2.16e-02 1.96e-02 1.25e+00 4.00e-02   4.85e+01
> 5.36e+01 8.37e-01 2.62e+01   1.24e+04 1.37e+04 4.28e+02 1.34e+04
>5   6416384  2.25e-02 2.00e-02 1.25e+00 4.07e-02   4.66e+01
> 5.24e+01 8.39e-01 2.57e+01   1.19e+04 1.34e+04 4.30e+02 1.32e+04
>6   6416384  2.19e-02 1.99e-02 1.29e+00 4.05e-02   4.79e+01
> 5.28e+01 8.14e-01 2.59e+01   1.23e+04 1.35e+04 4.17e+02 1.33e+04
>7   6416384  2.19e-02 2.06e-02 1.25e+00 4.03e-02   4.79e+01
> 5.09e+01 8.38e-01 2.60e+01   1.23e+04 1.30e+04 4.29e+02 1.33e+04
>8   6416384  2.24e-02 2.06e-02 1.25e+00 4.01e-02   4.69e+01
> 5.09e+01 8.39e-01 2.62e+01   1.20e+04 1.30e+04 4.30e+02 1.34e+04
>9   6416384  4.29e-01 2.01e-02 6.35e-01 3.98e-02   2.45e+00
> 5.22e+01 1.65e+00 2.64e+01   6.26e+02 1.34e+04 8.46e+02 1.35e+04
>   10   6416384  2.16e-02 2.06e-02 8.87e-01 4.00e-02   4.85e+01
> 5.09e+01 1.18e+00 2.62e+01   1.24e+04 1.30e+04 6.05e+02 1.34e+04
> 
> Time is total for all 64 buffers. Rate is one way across one link (# of
> bytes/time).
> 1) XC is a bidirectional ring exchange. Each processor sends to the right
> and receives from the left
> 2) XE is an edge exchange. Pairs of nodes exchange data, with each one
> sending and receiving
> 3) GS is the MPI_AllReduce
> 4) MS is my version of MPI_AllReduce. It splits the vector into Np blocks
> (Np is # of processors); each processor then acts as a head node for one
> block. This uses the full bandwidth all the time, unlike AllReduce which
> thins out as it gets to the top of the binary tree. On a 64 node Infiniband
> system MS is about 5X faster than GS-in theory it would be 6X; ie log_2(64).
> Here it is 25X-not sure why so much. But MS seems to be the cause of the
> hangups with messages > 64K. I can run the other benchmarks OK,but this one
> seems to hang for large messages. I think the problem is at least partly due
> to the switch. All MS is doing is point to point communications, but
> unfortunately it sometimes requires a high bandwidth between ASIC's. It
> first it exchanges data between near neighbors in MPI_COMM_WORLD, but it
> must progressively span wider gaps between nodes as it goes up the various
> binary trees. After a while this requires extensive traffic between ASICS.
> This seems to be a problem on both my HP 2724 and the Extreme Networks
> Summit400t-48. I am currently working with Extreme to try to resolve the
> switch issue. As I say; the code ran great on Infiniband, but I think those
> switches have hardware flow control. Finally I checked the code again under
> LAM and it ran OK. Slow, but no hangs.
> 
> To run the code compile and type:
> mpirun -np 32 -machinefile hosts src/netbench 8
> The 8 means 2^8 bytes (ie 256K). This was enough to hang every time on my
> boxes.
> 
> You can also edit the header file (header.h). MAX_LOOPS is how many times it
> runs each test (currently 10); NUM_BUF is the number of buffers in each test
> (must be more than number of processors), SYNC defines the global sync
> frequency-every SYNC buffers. NUM_SYNC is the number of sequential barrier
> calls it uses to determine the mean barrier call time. You can also switch
> the verious te

[OMPI users] error while loading shared libraries: libmpi.so.0: cannot open shared object file

2006-07-28 Thread Dan Lipsitt

get the following error when I attempt to run an mpi program (called
"first", in this case) across several nodes (it works on a single
node):

$ mpirun -np 3 --hostfile /tmp/nodes ./first
./first: error while loading shared libraries: libmpi.so.0: cannot
open shared object file: No such file or directory

My library path looks okay and I am able to run other programs,
including listing the supposedly missing library:

$ echo $LD_LIBRARY_PATH
/opt/openmpi/1.1/lib/
$ mpirun -np 3 --hostfile /tmp/nodes uptime
16:42:51 up 22 days,  3:14, 10 users,  load average: 0.01, 0.02, 0.04
19:49:32 up  1:36,  0 users,  load average: 0.00, 0.00, 0.00
19:40:01 up  1:37,  0 users,  load average: 0.00, 0.00, 0.00
$ mpirun -np 3 --hostfile /tmp/nodes ls -l /opt/openmpi/1.1/lib/libmpi.so*
lrwxrwxrwx  1 root root  15 Jul 13 15:44
/opt/openmpi/1.1/lib/libmpi.so -> libmpi.so.0.0.0
lrwxrwxrwx  1 root root  15 Jul 13 15:44
/opt/openmpi/1.1/lib/libmpi.so.0 -> libmpi.so.0.0.0
-rwxr-xr-x  1 root root 6157698 Jul 12 18:08
/opt/openmpi/1.1/lib/libmpi.so.0.0.0
lrwxrwxrwx  1 root root  15 Jul 26 16:17
/opt/openmpi/1.1/lib/libmpi.so -> libmpi.so.0.0.0
lrwxrwxrwx  1 root root  15 Jul 26 16:17
/opt/openmpi/1.1/lib/libmpi.so.0 -> libmpi.so.0.0.0
-rwxr-xr-x  1 root root 6157698 Jul 12 18:08
/opt/openmpi/1.1/lib/libmpi.so.0.0.0
lrwxrwxrwx  1 root root  15 Jul 26 13:50
/opt/openmpi/1.1/lib/libmpi.so -> libmpi.so.0.0.0
lrwxrwxrwx  1 root root  15 Jul 26 13:50
/opt/openmpi/1.1/lib/libmpi.so.0 -> libmpi.so.0.0.0
-rwxr-xr-x  1 root root 6157698 Jul 12 18:08
/opt/openmpi/1.1/lib/libmpi.so.0.0.0

Any suggestions?

Thanks,
Dan


Re: [OMPI users] error while loading shared libraries: libmpi.so.0: cannot open shared object file

2006-07-28 Thread Jeff Squyres
A few notes:

1. I'm guessing that your LD_LIBRARY_PATH is not set properly on the remote
nodes, which is why it can't find libmpi.so on the remote nodes.  Ensure
that it's set properly on the other side (you'll likely need to modify your
shell startup files), or use the --prefix functionality in mpirun (which
will ensure to set your PATH and LD_LIBRARY_PATH properly on remote nodes),
like this:

mpirun --prefix /opt/openmpi/1.1 -np 3 --hostfile /tmp/hosts ./first

Or simply supply the full pathname to mpirun (exactly equivalent to
--prefix):

/opt/openmpi/1.1/bin/mpirun -np 3 --hostfile /tmp/hosts ./first

Or if you're lazy (like me):

`which mpirun` -np 3 --hostfile /tmp/hosts ./first

2. Note that your "ls" command was actually shell expanded on the node where
you ran mpirun, and *then* it was executed on the remote nodes.  This was
not a problem because the files are actually the same on all nodes, but I
thought you might want to know that for future reference.

Hope that helps!


On 7/28/06 4:55 PM, "Dan Lipsitt"  wrote:

>  get the following error when I attempt to run an mpi program (called
> "first", in this case) across several nodes (it works on a single
> node):
> 
> $ mpirun -np 3 --hostfile /tmp/nodes ./first
> ./first: error while loading shared libraries: libmpi.so.0: cannot
> open shared object file: No such file or directory
> 
> My library path looks okay and I am able to run other programs,
> including listing the supposedly missing library:
> 
> $ echo $LD_LIBRARY_PATH
> /opt/openmpi/1.1/lib/
> $ mpirun -np 3 --hostfile /tmp/nodes uptime
>  16:42:51 up 22 days,  3:14, 10 users,  load average: 0.01, 0.02, 0.04
>  19:49:32 up  1:36,  0 users,  load average: 0.00, 0.00, 0.00
>  19:40:01 up  1:37,  0 users,  load average: 0.00, 0.00, 0.00
> $ mpirun -np 3 --hostfile /tmp/nodes ls -l /opt/openmpi/1.1/lib/libmpi.so*
> lrwxrwxrwx  1 root root  15 Jul 13 15:44
> /opt/openmpi/1.1/lib/libmpi.so -> libmpi.so.0.0.0
> lrwxrwxrwx  1 root root  15 Jul 13 15:44
> /opt/openmpi/1.1/lib/libmpi.so.0 -> libmpi.so.0.0.0
> -rwxr-xr-x  1 root root 6157698 Jul 12 18:08
> /opt/openmpi/1.1/lib/libmpi.so.0.0.0
> lrwxrwxrwx  1 root root  15 Jul 26 16:17
> /opt/openmpi/1.1/lib/libmpi.so -> libmpi.so.0.0.0
> lrwxrwxrwx  1 root root  15 Jul 26 16:17
> /opt/openmpi/1.1/lib/libmpi.so.0 -> libmpi.so.0.0.0
> -rwxr-xr-x  1 root root 6157698 Jul 12 18:08
> /opt/openmpi/1.1/lib/libmpi.so.0.0.0
> lrwxrwxrwx  1 root root  15 Jul 26 13:50
> /opt/openmpi/1.1/lib/libmpi.so -> libmpi.so.0.0.0
> lrwxrwxrwx  1 root root  15 Jul 26 13:50
> /opt/openmpi/1.1/lib/libmpi.so.0 -> libmpi.so.0.0.0
> -rwxr-xr-x  1 root root 6157698 Jul 12 18:08
> /opt/openmpi/1.1/lib/libmpi.so.0.0.0
> 
> Any suggestions?
> 
> Thanks,
> Dan
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems


Re: [OMPI users] Error sending large number of small messages

2006-07-28 Thread Jeff Squyres
Marcelo --

Can you send your code that is failing?  I'm unable to reproduce with some
toy programs here.

I also notice that you're running a somewhat old version of and OMPI SVN
checkout of the trunk.  Can you update to the most recent version?  The
trunk is not guaranteed to be stable, and we did have some stability
problems recently -- you might want to upgrade to the most recent version
(today seems to be ok) and/or try one of the nightly or prerelease tarballs
in the 1.1 branch.


On 7/26/06 6:18 PM, "Marcelo Stival"  wrote:

> Hi,
> 
> I got a problem with ompi when sending large number of messages from
> process  A to process B.
> Process A only send... and B only receive (the buffers are reused)
> 
> int n = 4 * 1024;//number of iterations (messages to be sent) consecutively
> int len = 8; //len of each message
> 
> Process A (rank 0):
> for (i=0; i < n; i++){
> MPI_Send( sbuffer, len, MPI_BYTE,to,i,MPI_COMM_WORLD);
> }
> Process B (rank 1):
> for (i=0; i < n; i++){
> MPI_Recv(rbuffer,len,MPI_BYTE,recv_from , i,MPI_COMM_WORLD, &status);
> }
> (It's a benchmark program... will run with increasing messages sizes.. )
> (I tried with the same tag on all iterations - and got the same)
> 
> It works fine for n (number of messages) equals to 3k (for example), but do
> not work with n equals to 4k (for messages of 8 bytes 4k iterations seems to
> be the treshould).
> 
> The error messages (complete output attached):
> malloc debug: Request for 8396964 bytes failed (class/ompi_free_list.c, 142)
> mpptest: btl_tcp_endpoint.c:624: mca_btl_tcp_endpoint_recv_handler:
> Assertion `0
>  == btl_endpoint->endpoint_cache_length' failed.
> Signal:6 info.si_errno:0(Success) si_code:-6()
> 
> 
> Considerations:
> It works for synchronous send (MPI_Ssend).
> It  works with MPICH2 ( 1.0.3).
> It is a benchmark program, I want to flood the network to measure the
> bandwidth ... (for different message sizes)
> 
> 
> Thanks
> 
> Marcelo
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems


Re: [OMPI users] Open-MPI running os SMP cluster

2006-07-28 Thread Jeff Squyres
On 7/26/06 5:55 PM, "Michael Kluskens"  wrote:

>> How is the message passing of Open-MPI implemented when I have
>> say 4 nodes with 4 processors (SMP) each, nodes connected by a gigabit
>> ethernet ?... in other words, how does it manage SMP nodes when I
>> want to use all CPUs, but each with its own process. Does it take
>> any advantage of the SMP at each node?
> 
> Someone can give you a more complete/correct answer but I'll give you
> my understanding.
> 
> All communication in OpenMPI is handled via the MCA module (term?)

We call them "components" or "plugins"; a "module" is typically an instance
of those plugins (e.g., if you have 2 ethernet NICs with TCP interfaces,
you'll get 2 instances -- modules -- of the TCP BTL component).

> self - process communicating with itself
> sm - ... via shared memory to other processes
> tcp - ... via tcp
> openib - ... via Infiniband OpenIB stack
> gm & mx - ... via Myrinet GM/MX
> mvapi - ... via Infiniband Mellanox Verbs

All correct.

> If you launch your process so that four processes are on a node then
> those would use shared memory to communicate.

Also correct.

Just chiming in with verifications!  :-)

-- 
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems


Re: [OMPI users] Runtime Error

2006-07-28 Thread Jeff Squyres
This question has come up a few times now, so I've added it to the faq,
which should make the "mca_pml_teg.so:undefined symbol" message
web-searchable for others who run into this issue.


On 7/26/06 8:36 AM, "Michael Kluskens"  wrote:

> Summary: You have to properly uninstall OpenMPI 1.0.2 before
> installing OpenMPI 1.1
> 
> 
> On Jul 26, 2006, at 7:05 AM,  wrote:
> 
>> Updated to open_mpi-1.1.   I get a runtime error on the application as
>> follows
>> 
>> mca:base:component_find:unable to
>> open:/usr/local/lip/openmpi/mca_pml_teg.so:undefined
>> symbol:mca_ptl_base_modules_initialized
>> 
>> Open_mpi is compile with g95 and gcc 4.0.3
> 
> I use that combination all the time on OS X 10.4.7 and under Debian
> Sarge.
> 
> Since you did not specify how you updated to OpenMPI 1.1 I'm copying
> the instructions posted previously on the list:
> 
> 
> On Jun 26, 2006, at 5:56 PM, Benjamin Landsteiner wrote:
>> Strange.  I had actually done this before I emailed (several times,
>> in fact), but for the sake of completeness, I did it once more.  This
>> time, it worked!  No clue why it worked this time around.
>> 
>> For those of you who in the future come across this problem, here are
>> the (more or less exact) steps I took to recover from the problem:
>> 
>> PROBLEM:  You installed v1.1 of Open MPI and experience keyval parse
>> errors upon running mpicc, mpif77, mpic++, and so forth.
>> 
>> SOLUTION:
>> 1.  Go to your v1.1 directory, and type './configure' if you have not
>> already done so
>> 2.  Type 'make uninstall'
>> 3.  Go to your v1.0.2 directory, and reconfigure using the same
>> settings as you installed with (if you still have the install
>> directory, you probably don't need to do this as it has already been
>> configured)
>> 4.  In the v1.0.2 directory, type 'make uninstall'
>> 5.  For good measure, I went back to the v1.1 directory and typed
>> 'make uninstall' again
>> 6.  Find lingering Open MPI directories and files and then delete
>> them (for instance, empty Open MPI-related folders remained in my /
>> usr/local/* directories)
>> 7.  At this point, I restarted my machine.  Not sure if it's
>> necessary or not.
>> 8.  Go back to the v1.1 directory.  Type 'make clean', then
>> reconfigure, then recompile and reinstall
>> 9.  Things should work now.
>> 
>> 
>> Thank you Michael,
>> ~Ben
>> 
>> ++
>> Benjamin Landsteiner
>> lands...@stolaf.edu
>> 
>> On 2006/06/26, at 3:48 PM, Michael Kluskens wrote:
>> 
>>> You may have to properly uninstall OpenMPI 1.0.2 before installing
>>> OpenMPI 1.1
>>> 
>>> This was an issue in the past.
>>> 
>>> I would recommend you go into your OpenMPI 1.1 directory and type
>>> "make uninstall", then if you have it go into your OpenMPI 1.0.2
>>> directory and do the same.  If you don't have a directory with
>>> OpenMPI 1.0.2 configured already then either rebuild OpenMPI 1.0.2 or
>>> go into /usr/local and identify all remaining OpenMPI directories and
>>> components and remove them.  Basically you should find directories
>>> modified when you installed OpenMPI 1.1 (or when you uninstalled it)
>>> and you may find components dated from when you installed OpenMPI
>>> 1.0.2.
>>> 
>>> Michael
>>> 
>>> On Jun 26, 2006, at 4:34 PM, Benjamin Landsteiner wrote:
>>> 
 Hello all,
 I recently upgraded to v1.1 of Open MPI and ran into a problem
 on my
 head node that I can't seem to solve.  Upon running mpicc, mpiCC,
 mpic++, and so forth, I get an error like this:
>>> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems


Re: [OMPI users] Fault Tolerant Method

2006-07-28 Thread Ralph Castain

Actually, we had a problem in our implementation that caused the system to
continually reuse the same machine allocations for each "spawn" request. In
other words, we always started with the top of the machine_list whenever
your program called comm_spawn. This appears to have been the source of the
behavior you describe.

You don't need to use the MPI_Info key to solve that problem - it has been
fixed in the subversion repository, and will be included in the next
release. If all you want is to have your new processes be placed beginning
with the next process slot in your allocation (as opposed to overlaying the
existing processes), then you don't need to do anything.

On the other hand, if you want the new processes to go to a specific set of
hosts, then you need to follow Josh's suggestions.

Hope that helps
Ralph


On 7/28/06 8:38 AM, "Josh Hursey"  wrote:

>> I have implemented the fault tolerance method in which you would use
>> MPI_COMM_SPAWN to dynamically create communication groups and use
>> those communicators for a form of process fault tolerance (as
>> described by William Gropp and Ewing Lusk in their 2004 paper),
>> but am having some problems getting it to work the way I intended.
>> Basically, when it runs, it is spawning all the processes on the
>> same machine (as it always starts at the top of the machine_list
>> when spawning a process).  Is there a way that I get get these
>> processes to spawn on different machines?
>> 
> 
> In Open MPI (and most other MPI implementations) you will be restricted to
> using only the machines in your allocation when you use MPI_Comm_spawn*.
> The standard allows you can suggest to MPI_Comm_spawn where to place the
> 'children' that it creates using the MPI_Info key -- specifically the
> {host} keyvalue referenced here:
> http://www.mpi-forum.org/docs/mpi-20-html/node97.htm#Node97
> MPI_Info is described here:
> http://www.mpi-forum.org/docs/mpi-20-html/node53.htm#Node53
> 
> Open MPI, in the current release, does not do anything with this key.
> This has been fixed in subversion (as of r11039) and will be in the next
> release of Open MPI.
> 
> If you want to use this functionality in the near term I would suggest
> using the nightly build of the subversion trunk available here:
> http://www.open-mpi.org/nightly/trunk/
> 
> 
>> One possible route I considerd was using something like SLURM to
>> distribute the jobs, and just putting '+' in the machine file.  Will
>> this work?  Is this the best route to go?
> 
> Off the top of my head, I'm not sure if that would work of not. The
> best/cleanest route would be to use the MPI_Info command and the {host}
> key.
> 
> Let us know if you have any trouble with MPI_Comm_spawn or MPI_Info in
> this scenario.
> 
> Hope that helps,
> Josh
> 
>> 
>> Thanks for any help with this.
>> 
>> Byron
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users