Hi, George!
I had studied the ULFM document before begin the tests with failure
detection in open mpi and seems me a good choice.
But I'm having trouble with the ULFM-enabled version of Open MPI
(openmpi-1.7ft_b3.tar.gz). I follow the UFML setup (in
http://fault-tolerance.org/ulfm/ulfm-setup/). T
Edson,
Based on your questions I would suggest you take a look at the ULFM-enabled
version of Open MPI. You can find it at http://fault-tolerance.org/.
George.
On Aug 11, 2013, at 15:33 , Edson Tavares de Camargo
wrote:
> Thanks a lot for your reply, Ralph!
>
> Could you tell me in what si
On Aug 11, 2013, at 6:33 AM, Edson Tavares de Camargo
wrote:
> Thanks a lot for your reply, Ralph!
>
> Could you tell me in what situation the error handler would be called in
> the 1.6.5 version?
Only when an error is detected in the MPI layer
>
> I had thought that a failure in a process
Thanks a lot for your reply, Ralph!
Could you tell me in what situation the error handler would be called in
the 1.6.5 version?
I had thought that a failure in a process would be catched by the error
handler. Kill, or abort, the process wouldn't the same behaviour?
In the 1.7.4 release if a proc
The error handler wouldn't be called in that situation - we simply abort the
job. We expect to provide that integration in something like the 1.7.4 release
milestone.
On Aug 10, 2013, at 11:07 AM, Edson Tavares de Camargo
wrote:
> Hi All,
>
> I was looking for posts about fault tolerant in
Hi All,
I was looking for posts about fault tolerant in MPI and I found the post
below:
http://www.open-mpi.org/community/lists/users/2012/06/19658.php
I am trying to understand all work about failures detection present in
open-mpi. So, I began with a simple application, a ring application
(rin
The official support page for the C/R features is hosted by Indiana
University (linked from the Open MPI FAQs):
http://osl.iu.edu/research/ft/ompi-cr/
The instructions probably need to be cleaned up (some of the release
references are not quite correct any longer). But the following should
give
THANK YOU for your detailed answer.[quote]If you want a fault tolerance
feature, such as automaticcheckpoint/restart recovery, you will need to create
a build of OpenMPI with that feature enabled. There are instructions on the
variouslinks above about how to do so.[/quote]Could you give me some