There is an ongoing effort to address the potential volatility of processes in
MPI called ULFM. There is a working version available at
http://fault-tolerance.org. It supports TCP, sm and IB (mostly). You will find
some examples, and the document explaining the additional constructs needed in
M
Just curious: I thought ULFM dealt with recovering an MPI job where one or more
processes fail. Is this correct?
HLA/RTI consists of processes that start at random times, run to completion,
and then exit normally. While a failure could occur, most process terminations
are normal and there is no