I have implemented the fault tolerance method in which you would use
MPI_COMM_SPAWN to dynamically create communication groups and use
those communicators for a form of process fault tolerance (as
described by William Gropp and Ewing Lusk in their 2004 paper),
but am having some problems getting i
> I have implemented the fault tolerance method in which you would use
> MPI_COMM_SPAWN to dynamically create communication groups and use
> those communicators for a form of process fault tolerance (as
> described by William Gropp and Ewing Lusk in their 2004 paper),
> but am having some problems
don't forget furthermore, that for successfully using this
fault-tolerance approach, the parents or other child processes should
not be affected by the death/failure of another child process. Right now
in Open MPI, if one of the child processes (which you spawned using
MPI_Comm_spawn) fails, th
Trolling through some really old mails that never got replies... :-(
I'm afraid that the guy who did the GM code in Open MPI is currently on
vacation, but we have made a small number of changes since 1.1 that may have
fixed your issue.
Could you try one of the 1.1.1 release candidate tarballs and
Trolling through some really old messages that never got replies... :-(
The behavior that you are seeing is happening as the result of a really long
discussion among the OMPI developers when we were writing the TCP device.
The problem is that there is ambiguity when connecting peers across TCP in
Tony --
My apologies for taking so long to answer. :-(
I was unfortunately unable to replicate your problem. I ran your source
code across 32 machines connected by TCP with no problem:
mpirun --hostfile ~/mpi/cdc -np 32 -mca btl tcp,self netbench 8
I tried this on two different clusters wit
get the following error when I attempt to run an mpi program (called
"first", in this case) across several nodes (it works on a single
node):
$ mpirun -np 3 --hostfile /tmp/nodes ./first
./first: error while loading shared libraries: libmpi.so.0: cannot
open shared object file: No such file or di
A few notes:
1. I'm guessing that your LD_LIBRARY_PATH is not set properly on the remote
nodes, which is why it can't find libmpi.so on the remote nodes. Ensure
that it's set properly on the other side (you'll likely need to modify your
shell startup files), or use the --prefix functionality in m
Marcelo --
Can you send your code that is failing? I'm unable to reproduce with some
toy programs here.
I also notice that you're running a somewhat old version of and OMPI SVN
checkout of the trunk. Can you update to the most recent version? The
trunk is not guaranteed to be stable, and we di
On 7/26/06 5:55 PM, "Michael Kluskens" wrote:
>> How is the message passing of Open-MPI implemented when I have
>> say 4 nodes with 4 processors (SMP) each, nodes connected by a gigabit
>> ethernet ?... in other words, how does it manage SMP nodes when I
>> want to use all CPUs, but each with its
This question has come up a few times now, so I've added it to the faq,
which should make the "mca_pml_teg.so:undefined symbol" message
web-searchable for others who run into this issue.
On 7/26/06 8:36 AM, "Michael Kluskens" wrote:
> Summary: You have to properly uninstall OpenMPI 1.0.2 before
Actually, we had a problem in our implementation that caused the system to
continually reuse the same machine allocations for each "spawn" request. In
other words, we always started with the top of the machine_list whenever
your program called comm_spawn. This appears to have been the source of th
12 matches
Mail list logo