Tim,

FT-MPI is gone, but the ideas it put forward have been refined and the
software algorithms behind them improved in a newer (and supported) project
ULFM. It features a smaller API, with a much more flexible approach. You
can find more information about it at http://fault-tolerance.org/. The
corresponding implementation (based on an older version of Open MPI 1.6) is
available at https://bitbucket.org/icldistcomp/ulfm

  George.



On Thu, Jun 8, 2017 at 9:17 AM, Tim Burgess <ozburgess+o...@gmail.com>
wrote:

> Hi!
>
> So I know from searching the archive that this is a repeated topic of
> discussion here, and apologies for that, but since it's been a year or
> so I thought I'd double-check whether anything has changed before
> really starting to tear my hair out too much.
>
> Is there a combination of MCA parameters or similar that will prevent
> ORTE from aborting a job when it detects a node failure?  This is
> using the tcp btl, under slurm.
>
> The application, not written by us and too complicated to re-engineer
> at short notice, has a strictly master-slave communication pattern.
> The master never blocks on communication from individual slaves, and
> apparently can itself detect slaves that have silently disappeared and
> reissue the work to those remaining.  So from an application
> standpoint I believe we should be able to handle this.  However, in
> all my testing so far the job is aborted as soon as the runtime system
> figures out what is going on.
>
> If not, do any users know of another MPI implementation that might
> work for this use case?  As far as I can tell, FT-MPI has been pretty
> quiet the last couple of years?
>
> Thanks in advance,
>
> Tim
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to