Just curious: I thought ULFM dealt with recovering an MPI job where one or more processes fail. Is this correct?
HLA/RTI consists of processes that start at random times, run to completion, and then exit normally. While a failure could occur, most process terminations are normal and there is no need/intent to revive them. So it's mostly a case of massively exercising MPI's dynamic connect/accept/disconnect functions. Do ULFM's structures have some utility for that purpose? On Apr 16, 2013, at 3:20 AM, George Bosilca <bosi...@icl.utk.edu> wrote: > There is an ongoing effort to address the potential volatility of processes > in MPI called ULFM. There is a working version available at > http://fault-tolerance.org. It supports TCP, sm and IB (mostly). You will > find some examples, and the document explaining the additional constructs > needed in MPI to achieve this. > > George. > > On Apr 15, 2013, at 17:29 , John Chludzinski <john.chludzin...@gmail.com> > wrote: > >> That would seem to preclude its use for an RTI. Unless you have a card up >> your sleeve? >> >> ---John >> >> >> On Mon, Apr 15, 2013 at 11:23 AM, Ralph Castain <r...@open-mpi.org> wrote: >> It isn't the fact that there are multiple programs being used - we support >> that just fine. The problem with HLA/RTI is that it allows programs to >> come/go at will - i.e., not every program has to start at the same time, nor >> complete at the same time. MPI requires that all programs be executing at >> the beginning, and that all call finalize prior to anyone exiting. >> >> >> On Apr 15, 2013, at 8:14 AM, John Chludzinski <john.chludzin...@gmail.com> >> wrote: >> >>> I just received an e-mail notifying me that MPI-2 supports MPMD. This >>> would seen to be just what the doctor ordered? >>> >>> ---John >>> >>> >>> On Mon, Apr 15, 2013 at 11:10 AM, Ralph Castain <r...@open-mpi.org> wrote: >>> FWIW: some of us are working on a variant of MPI that would indeed support >>> what you describe - it would support send/recv (i.e., MPI-1), but not >>> collectives, and so would allow communication between arbitrary programs. >>> >>> Not specifically targeting HLA/RTI, though I suppose a wrapper that >>> conformed to that standard could be created. >>> >>> On Apr 15, 2013, at 7:50 AM, John Chludzinski <john.chludzin...@gmail.com> >>> wrote: >>> >>> > This would be a departure from the SPMD paradigm that seems central to >>> > MPI's design. Each process would be a completely different program >>> > (piece of code) and I'm not sure how well that would working using >>> > MPI? >>> > >>> > BTW, MPI is commonly used in the parallel discrete even world for >>> > communication between LPs (federates in HLA). But these LPs are >>> > usually the same program. >>> > >>> > ---John >>> > >>> > On Mon, Apr 15, 2013 at 10:22 AM, John Chludzinski >>> > <john.chludzin...@gmail.com> wrote: >>> >> Is anyone aware of an MPI based HLA/RTI (DoD High Level Architecture >>> >> (HLA) / Runtime Infrastructure)? >>> >> >>> >> ---John >>> > _______________________________________________ >>> > users mailing list >>> > us...@open-mpi.org >>> > http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users