I have implemented the fault tolerance method in which you would use MPI_COMM_SPAWN to dynamically create communication groups and use those communicators for a form of process fault tolerance (as described by William Gropp and Ewing Lusk in their 2004 paper), but am having some problems getting it to work the way I intended. Basically, when it runs, it is spawning all the processes on the same machine (as it always starts at the top of the machine_list when spawning a process). Is there a way that I get get these processes to spawn on different machines?
One possible route I considerd was using something like SLURM to distribute the jobs, and just putting '+' in the machine file. Will this work? Is this the best route to go? Thanks for any help with this. Byron