Re: [OMPI users] MPI based HLA/RTI ?

2013-04-16 Thread George Bosilca
There is an ongoing effort to address the potential volatility of processes in MPI called ULFM. There is a working version available at http://fault-tolerance.org. It supports TCP, sm and IB (mostly). You will find some examples, and the document explaining the additional constructs needed in M

Re: [OMPI users] MPI based HLA/RTI ?

2013-04-16 Thread Ralph Castain
Just curious: I thought ULFM dealt with recovering an MPI job where one or more processes fail. Is this correct? HLA/RTI consists of processes that start at random times, run to completion, and then exit normally. While a failure could occur, most process terminations are normal and there is no