[OMPI users] Fault Tolerant Method

2006-07-28 Thread bdickinson
I have implemented the fault tolerance method in which you would use MPI_COMM_SPAWN to dynamically create communication groups and use those communicators for a form of process fault tolerance (as described by William Gropp and Ewing Lusk in their 2004 paper), but am having some problems getting i

Re: [OMPI users] Fault Tolerant Method

2006-07-28 Thread Josh Hursey
> I have implemented the fault tolerance method in which you would use > MPI_COMM_SPAWN to dynamically create communication groups and use > those communicators for a form of process fault tolerance (as > described by William Gropp and Ewing Lusk in their 2004 paper), > but am having some problems

Re: [OMPI users] Fault Tolerant Method

2006-07-28 Thread Edgar Gabriel
don't forget furthermore, that for successfully using this fault-tolerance approach, the parents or other child processes should not be affected by the death/failure of another child process. Right now in Open MPI, if one of the child processes (which you spawned using MPI_Comm_spawn) fails, th

Re: [OMPI users] Problem with Openmpi 1.1

2006-07-28 Thread Jeff Squyres
Trolling through some really old mails that never got replies... :-( I'm afraid that the guy who did the GM code in Open MPI is currently on vacation, but we have made a small number of changes since 1.1 that may have fixed your issue. Could you try one of the 1.1.1 release candidate tarballs and

Re: [OMPI users] OS X, OpenMPI 1.1: An error occurred in MPI_Allreduce on, communicator MPI_COMM_WORLD (Jeff Squyres (jsquyres))

2006-07-28 Thread Jeff Squyres
Trolling through some really old messages that never got replies... :-( The behavior that you are seeing is happening as the result of a really long discussion among the OMPI developers when we were writing the TCP device. The problem is that there is ambiguity when connecting peers across TCP in

Re: [OMPI users] mca_btl_tcp_frag_send: writev failed with errno=110

2006-07-28 Thread Jeff Squyres
Tony -- My apologies for taking so long to answer. :-( I was unfortunately unable to replicate your problem. I ran your source code across 32 machines connected by TCP with no problem: mpirun --hostfile ~/mpi/cdc -np 32 -mca btl tcp,self netbench 8 I tried this on two different clusters wit

[OMPI users] error while loading shared libraries: libmpi.so.0: cannot open shared object file

2006-07-28 Thread Dan Lipsitt
get the following error when I attempt to run an mpi program (called "first", in this case) across several nodes (it works on a single node): $ mpirun -np 3 --hostfile /tmp/nodes ./first ./first: error while loading shared libraries: libmpi.so.0: cannot open shared object file: No such file or di

Re: [OMPI users] error while loading shared libraries: libmpi.so.0: cannot open shared object file

2006-07-28 Thread Jeff Squyres
A few notes: 1. I'm guessing that your LD_LIBRARY_PATH is not set properly on the remote nodes, which is why it can't find libmpi.so on the remote nodes. Ensure that it's set properly on the other side (you'll likely need to modify your shell startup files), or use the --prefix functionality in m

Re: [OMPI users] Error sending large number of small messages

2006-07-28 Thread Jeff Squyres
Marcelo -- Can you send your code that is failing? I'm unable to reproduce with some toy programs here. I also notice that you're running a somewhat old version of and OMPI SVN checkout of the trunk. Can you update to the most recent version? The trunk is not guaranteed to be stable, and we di

Re: [OMPI users] Open-MPI running os SMP cluster

2006-07-28 Thread Jeff Squyres
On 7/26/06 5:55 PM, "Michael Kluskens" wrote: >> How is the message passing of Open-MPI implemented when I have >> say 4 nodes with 4 processors (SMP) each, nodes connected by a gigabit >> ethernet ?... in other words, how does it manage SMP nodes when I >> want to use all CPUs, but each with its

Re: [OMPI users] Runtime Error

2006-07-28 Thread Jeff Squyres
This question has come up a few times now, so I've added it to the faq, which should make the "mca_pml_teg.so:undefined symbol" message web-searchable for others who run into this issue. On 7/26/06 8:36 AM, "Michael Kluskens" wrote: > Summary: You have to properly uninstall OpenMPI 1.0.2 before

Re: [OMPI users] Fault Tolerant Method

2006-07-28 Thread Ralph Castain
Actually, we had a problem in our implementation that caused the system to continually reuse the same machine allocations for each "spawn" request. In other words, we always started with the top of the machine_list whenever your program called comm_spawn. This appears to have been the source of th