On Apr 16, 2013, at 15:51 , Ralph Castain <r...@open-mpi.org> wrote:

> Just curious: I thought ULFM dealt with recovering an MPI job where one or 
> more processes fail. Is this correct?

It depends what is the definition of "recovering" you take. ULFM is about 
leaving the processes that remains (after a fault or a disconnect) in a state 
that allow them to continue to make progress. It is not about recovering 
processes, or user data, but it does provide the minimalistic set of 
functionalities to allow application to do this, if needed (revoke, agreement 
and shrink).

> HLA/RTI consists of processes that start at random times, run to completion, 
> and then exit normally. While a failure could occur, most process 
> terminations are normal and there is no need/intent to revive them.

As I said above, there is no revival of processes in ULFM, and it was never our 
intent to have such feature. The dynamic world is to be constructed using MPI-2 
constructs (MPI_Spawn or MPI_Connect/Accept or even MPI_Join).

> So it's mostly a case of massively exercising MPI's dynamic 
> connect/accept/disconnect functions.
> 
> Do ULFM's structures have some utility for that purpose?

Absolutely. If the process that leaves instead of calling MPI_Finalize calls 
exit() this will be interpreted by the version of the runtime in ULFM as an 
event triggering a report. All the ensuing mechanisms are then activated and 
the application can react to this event with the most meaningful approach it 
can envision.

  George.

> 
> 
> On Apr 16, 2013, at 3:20 AM, George Bosilca <bosi...@icl.utk.edu> wrote:
> 
>> There is an ongoing effort to address the potential volatility of processes 
>> in MPI called ULFM. There is a working version available at 
>> http://fault-tolerance.org. It supports TCP, sm and IB (mostly). You will 
>> find some examples, and the document explaining the additional constructs 
>> needed in MPI to achieve this.
>> 
>>   George.
>> 
>> On Apr 15, 2013, at 17:29 , John Chludzinski <john.chludzin...@gmail.com> 
>> wrote:
>> 
>>> That would seem to preclude its use for an RTI.  Unless you have a card up 
>>> your sleeve?
>>>  
>>> ---John
>>> 
>>> 
>>> On Mon, Apr 15, 2013 at 11:23 AM, Ralph Castain <r...@open-mpi.org> wrote:
>>> It isn't the fact that there are multiple programs being used - we support 
>>> that just fine. The problem with HLA/RTI is that it allows programs to 
>>> come/go at will - i.e., not every program has to start at the same time, 
>>> nor complete at the same time. MPI requires that all programs be executing 
>>> at the beginning, and that all call finalize prior to anyone exiting.
>>> 
>>> 
>>> On Apr 15, 2013, at 8:14 AM, John Chludzinski <john.chludzin...@gmail.com> 
>>> wrote:
>>> 
>>>> I just received an e-mail notifying me that MPI-2 supports MPMD.  This 
>>>> would seen to be just what the doctor ordered?
>>>>  
>>>> ---John
>>>> 
>>>> 
>>>> On Mon, Apr 15, 2013 at 11:10 AM, Ralph Castain <r...@open-mpi.org> wrote:
>>>> FWIW: some of us are working on a variant of MPI that would indeed support 
>>>> what you describe - it would support send/recv (i.e., MPI-1), but not 
>>>> collectives, and so would allow communication between arbitrary programs.
>>>> 
>>>> Not specifically targeting HLA/RTI, though I suppose a wrapper that 
>>>> conformed to that standard could be created.
>>>> 
>>>> On Apr 15, 2013, at 7:50 AM, John Chludzinski <john.chludzin...@gmail.com> 
>>>> wrote:
>>>> 
>>>> > This would be a departure from the SPMD paradigm that seems central to
>>>> > MPI's design. Each process would be a completely different program
>>>> > (piece of code) and I'm not sure how well that would working using
>>>> > MPI?
>>>> >
>>>> > BTW, MPI is commonly used in the parallel discrete even world for
>>>> > communication between LPs (federates in HLA). But these LPs are
>>>> > usually the same program.
>>>> >
>>>> > ---John
>>>> >
>>>> > On Mon, Apr 15, 2013 at 10:22 AM, John Chludzinski
>>>> > <john.chludzin...@gmail.com> wrote:
>>>> >> Is anyone aware of an MPI based HLA/RTI (DoD High Level Architecture
>>>> >> (HLA) / Runtime Infrastructure)?
>>>> >>
>>>> >> ---John
>>>> > _______________________________________________
>>>> > users mailing list
>>>> > us...@open-mpi.org
>>>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to