Tim, FT-MPI is gone, but the ideas it put forward have been refined and the software algorithms behind them improved in a newer (and supported) project ULFM. It features a smaller API, with a much more flexible approach. You can find more information about it at http://fault-tolerance.org/. The corresponding implementation (based on an older version of Open MPI 1.6) is available at https://bitbucket.org/icldistcomp/ulfm
George. On Thu, Jun 8, 2017 at 9:17 AM, Tim Burgess <ozburgess+o...@gmail.com> wrote: > Hi! > > So I know from searching the archive that this is a repeated topic of > discussion here, and apologies for that, but since it's been a year or > so I thought I'd double-check whether anything has changed before > really starting to tear my hair out too much. > > Is there a combination of MCA parameters or similar that will prevent > ORTE from aborting a job when it detects a node failure? This is > using the tcp btl, under slurm. > > The application, not written by us and too complicated to re-engineer > at short notice, has a strictly master-slave communication pattern. > The master never blocks on communication from individual slaves, and > apparently can itself detect slaves that have silently disappeared and > reissue the work to those remaining. So from an application > standpoint I believe we should be able to handle this. However, in > all my testing so far the job is aborted as soon as the runtime system > figures out what is going on. > > If not, do any users know of another MPI implementation that might > work for this use case? As far as I can tell, FT-MPI has been pretty > quiet the last couple of years? > > Thanks in advance, > > Tim > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users >
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users