Actually, we had a problem in our implementation that caused the system to
continually reuse the same machine allocations for each "spawn" request. In
other words, we always started with the top of the machine_list whenever
your program called comm_spawn. This appears to have been the source of th
don't forget furthermore, that for successfully using this
fault-tolerance approach, the parents or other child processes should
not be affected by the death/failure of another child process. Right now
in Open MPI, if one of the child processes (which you spawned using
MPI_Comm_spawn) fails, th
> I have implemented the fault tolerance method in which you would use
> MPI_COMM_SPAWN to dynamically create communication groups and use
> those communicators for a form of process fault tolerance (as
> described by William Gropp and Ewing Lusk in their 2004 paper),
> but am having some problems
I have implemented the fault tolerance method in which you would use
MPI_COMM_SPAWN to dynamically create communication groups and use
those communicators for a form of process fault tolerance (as
described by William Gropp and Ewing Lusk in their 2004 paper),
but am having some problems getting i