Re: [OMPI users] Fault Tolerant Method

Josh Hursey Fri, 28 Jul 2006 10:39:29 -0400

> I have implemented the fault tolerance method in which you would use
> MPI_COMM_SPAWN to dynamically create communication groups and use
> those communicators for a form of process fault tolerance (as
> described by William Gropp and Ewing Lusk in their 2004 paper),
> but am having some problems getting it to work the way I intended.
> Basically, when it runs, it is spawning all the processes on the
> same machine (as it always starts at the top of the machine_list
> when spawning a process).  Is there a way that I get get these
> processes to spawn on different machines?
>


In Open MPI (and most other MPI implementations) you will be restricted to
using only the machines in your allocation when you use MPI_Comm_spawn*.
The standard allows you can suggest to MPI_Comm_spawn where to place the
'children' that it creates using the MPI_Info key -- specifically the
{host} keyvalue referenced here:
http://www.mpi-forum.org/docs/mpi-20-html/node97.htm#Node97
MPI_Info is described here:
http://www.mpi-forum.org/docs/mpi-20-html/node53.htm#Node53

Open MPI, in the current release, does not do anything with this key.
This has been fixed in subversion (as of r11039) and will be in the next
release of Open MPI.

If you want to use this functionality in the near term I would suggest
using the nightly build of the subversion trunk available here:
http://www.open-mpi.org/nightly/trunk/


> One possible route I considerd was using something like SLURM to
> distribute the jobs, and just putting '+' in the machine file.  Will
> this work?  Is this the best route to go?

Off the top of my head, I'm not sure if that would work of not. The
best/cleanest route would be to use the MPI_Info command and the {host}
key.

Let us know if you have any trouble with MPI_Comm_spawn or MPI_Info in
this scenario.

Hope that helps,
Josh

>
> Thanks for any help with this.
>
> Byron
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] Fault Tolerant Method

Reply via email to