So the apparent conclusion to this thread is that an (Open)MPI based RTI is very doable - if we allow for the future develoment of dynamic joining and leaving of the MPI collective?
---John On Wed, Apr 17, 2013 at 12:45 PM, Ralph Castain <r...@open-mpi.org> wrote: > Thanks for the clarification - very interesting indeed! I'll look at it > more closely. > > > On Apr 17, 2013, at 9:20 AM, George Bosilca <bosi...@icl.utk.edu> wrote: > > On Apr 16, 2013, at 15:51 , Ralph Castain <r...@open-mpi.org> wrote: > > Just curious: I thought ULFM dealt with recovering an MPI job where one or > more processes fail. Is this correct? > > > It depends what is the definition of "recovering" you take. ULFM is about > leaving the processes that remains (after a fault or a disconnect) in a > state that allow them to continue to make progress. It is not about > recovering processes, or user data, but it does provide the minimalistic > set of functionalities to allow application to do this, if needed (revoke, > agreement and shrink). > > HLA/RTI consists of processes that start at random times, run to > completion, and then exit normally. While a failure could occur, most > process terminations are normal and there is no need/intent to revive them. > > > As I said above, there is no revival of processes in ULFM, and it was > never our intent to have such feature. The dynamic world is to be > constructed using MPI-2 constructs (MPI_Spawn or MPI_Connect/Accept or even > MPI_Join). > > So it's mostly a case of massively exercising MPI's dynamic > connect/accept/disconnect functions. > > Do ULFM's structures have some utility for that purpose? > > > Absolutely. If the process that leaves instead of calling MPI_Finalize > calls exit() this will be interpreted by the version of the runtime in ULFM > as an event triggering a report. All the ensuing mechanisms are then > activated and the application can react to this event with the most > meaningful approach it can envision. > > George. > > > > On Apr 16, 2013, at 3:20 AM, George Bosilca <bosi...@icl.utk.edu> wrote: > > There is an ongoing effort to address the potential volatility of > processes in MPI called ULFM. There is a working version available at > http://fault-tolerance.org. It supports TCP, sm and IB (mostly). You will > find some examples, and the document explaining the additional constructs > needed in MPI to achieve this. > > George. > > On Apr 15, 2013, at 17:29 , John Chludzinski <john.chludzin...@gmail.com> > wrote: > > That would seem to preclude its use for an RTI. Unless you have a card up > your sleeve? > > ---John > > > On Mon, Apr 15, 2013 at 11:23 AM, Ralph Castain <r...@open-mpi.org> wrote: > >> It isn't the fact that there are multiple programs being used - we >> support that just fine. The problem with HLA/RTI is that it allows programs >> to come/go at will - i.e., not every program has to start at the same time, >> nor complete at the same time. MPI requires that all programs be executing >> at the beginning, and that all call finalize prior to anyone exiting. >> >> >> On Apr 15, 2013, at 8:14 AM, John Chludzinski <john.chludzin...@gmail.com> >> wrote: >> >> I just received an e-mail notifying me that MPI-2 supports MPMD. This >> would seen to be just what the doctor ordered? >> >> ---John >> >> >> On Mon, Apr 15, 2013 at 11:10 AM, Ralph Castain <r...@open-mpi.org> wrote: >> >>> FWIW: some of us are working on a variant of MPI that would indeed >>> support what you describe - it would support send/recv (i.e., MPI-1), but >>> not collectives, and so would allow communication between arbitrary >>> programs. >>> >>> Not specifically targeting HLA/RTI, though I suppose a wrapper that >>> conformed to that standard could be created. >>> >>> On Apr 15, 2013, at 7:50 AM, John Chludzinski < >>> john.chludzin...@gmail.com> wrote: >>> >>> > This would be a departure from the SPMD paradigm that seems central to >>> > MPI's design. Each process would be a completely different program >>> > (piece of code) and I'm not sure how well that would working using >>> > MPI? >>> > >>> > BTW, MPI is commonly used in the parallel discrete even world for >>> > communication between LPs (federates in HLA). But these LPs are >>> > usually the same program. >>> > >>> > ---John >>> > >>> > On Mon, Apr 15, 2013 at 10:22 AM, John Chludzinski >>> > <john.chludzin...@gmail.com> wrote: >>> >> Is anyone aware of an MPI based HLA/RTI (DoD High Level Architecture >>> >> (HLA) / Runtime Infrastructure)? >>> >> >>> >> ---John >>> > _______________________________________________ >>> > users mailing list >>> > us...@open-mpi.org >>> > http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >