don't forget furthermore, that for successfully using this
fault-tolerance approach, the parents or other child processes should
not be affected by the death/failure of another child process. Right now
in Open MPI, if one of the child processes (which you spawned using
MPI_Comm_spawn) fails, the whole application will fail. [To be more
precise: the MPI standard does not enforce/mandate the behavior
described in the paper which you mentioned]
Thanks
Edgar
Josh Hursey wrote:
I have implemented the fault tolerance method in which you would use
MPI_COMM_SPAWN to dynamically create communication groups and use
those communicators for a form of process fault tolerance (as
described by William Gropp and Ewing Lusk in their 2004 paper),
but am having some problems getting it to work the way I intended.
Basically, when it runs, it is spawning all the processes on the
same machine (as it always starts at the top of the machine_list
when spawning a process). Is there a way that I get get these
processes to spawn on different machines?
In Open MPI (and most other MPI implementations) you will be restricted to
using only the machines in your allocation when you use MPI_Comm_spawn*.
The standard allows you can suggest to MPI_Comm_spawn where to place the
'children' that it creates using the MPI_Info key -- specifically the
{host} keyvalue referenced here:
http://www.mpi-forum.org/docs/mpi-20-html/node97.htm#Node97
MPI_Info is described here:
http://www.mpi-forum.org/docs/mpi-20-html/node53.htm#Node53
Open MPI, in the current release, does not do anything with this key.
This has been fixed in subversion (as of r11039) and will be in the next
release of Open MPI.
If you want to use this functionality in the near term I would suggest
using the nightly build of the subversion trunk available here:
http://www.open-mpi.org/nightly/trunk/
One possible route I considerd was using something like SLURM to
distribute the jobs, and just putting '+' in the machine file. Will
this work? Is this the best route to go?
Off the top of my head, I'm not sure if that would work of not. The
best/cleanest route would be to use the MPI_Info command and the {host}
key.
Let us know if you have any trouble with MPI_Comm_spawn or MPI_Info in
this scenario.
Hope that helps,
Josh
Thanks for any help with this.
Byron
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users