Re: [OMPI users] Fault Tolerant Method

Ralph Castain Fri, 28 Jul 2006 18:21:33 -0400

Actually, we had a problem in our implementation that caused the system to
continually reuse the same machine allocations for each "spawn" request. In
other words, we always started with the top of the machine_list whenever
your program called comm_spawn. This appears to have been the source of the
behavior you describe.


You don't need to use the MPI_Info key to solve that problem - it has been
fixed in the subversion repository, and will be included in the next
release. If all you want is to have your new processes be placed beginning
with the next process slot in your allocation (as opposed to overlaying the
existing processes), then you don't need to do anything.

On the other hand, if you want the new processes to go to a specific set of
hosts, then you need to follow Josh's suggestions.

Hope that helps
Ralph


On 7/28/06 8:38 AM, "Josh Hursey" <jjhur...@open-mpi.org> wrote:

>> I have implemented the fault tolerance method in which you would use
>> MPI_COMM_SPAWN to dynamically create communication groups and use
>> those communicators for a form of process fault tolerance (as
>> described by William Gropp and Ewing Lusk in their 2004 paper),
>> but am having some problems getting it to work the way I intended.
>> Basically, when it runs, it is spawning all the processes on the
>> same machine (as it always starts at the top of the machine_list
>> when spawning a process).  Is there a way that I get get these
>> processes to spawn on different machines?
>> 
> 
> In Open MPI (and most other MPI implementations) you will be restricted to
> using only the machines in your allocation when you use MPI_Comm_spawn*.
> The standard allows you can suggest to MPI_Comm_spawn where to place the
> 'children' that it creates using the MPI_Info key -- specifically the
> {host} keyvalue referenced here:
> http://www.mpi-forum.org/docs/mpi-20-html/node97.htm#Node97
> MPI_Info is described here:
> http://www.mpi-forum.org/docs/mpi-20-html/node53.htm#Node53
> 
> Open MPI, in the current release, does not do anything with this key.
> This has been fixed in subversion (as of r11039) and will be in the next
> release of Open MPI.
> 
> If you want to use this functionality in the near term I would suggest
> using the nightly build of the subversion trunk available here:
> http://www.open-mpi.org/nightly/trunk/
> 
> 
>> One possible route I considerd was using something like SLURM to
>> distribute the jobs, and just putting '+' in the machine file.  Will
>> this work?  Is this the best route to go?
> 
> Off the top of my head, I'm not sure if that would work of not. The
> best/cleanest route would be to use the MPI_Info command and the {host}
> key.
> 
> Let us know if you have any trouble with MPI_Comm_spawn or MPI_Info in
> this scenario.
> 
> Hope that helps,
> Josh
> 
>> 
>> Thanks for any help with this.
>> 
>> Byron
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Fault Tolerant Method

Reply via email to