I have implemented the fault tolerance method in which you would use
MPI_COMM_SPAWN to dynamically create communication groups and use
those communicators for a form of process fault tolerance (as 
described by William Gropp and Ewing Lusk in their 2004 paper),
but am having some problems getting it to work the way I intended.
Basically, when it runs, it is spawning all the processes on the
same machine (as it always starts at the top of the machine_list
when spawning a process).  Is there a way that I get get these
processes to spawn on different machines?

One possible route I considerd was using something like SLURM to
distribute the jobs, and just putting '+' in the machine file.  Will
this work?  Is this the best route to go?

Thanks for any help with this.

Byron

Reply via email to