It has been awhile since I tested it, but I believe the --enable-recovery option might do what you want.
> On Jun 8, 2017, at 6:17 AM, Tim Burgess <ozburgess+o...@gmail.com> wrote: > > Hi! > > So I know from searching the archive that this is a repeated topic of > discussion here, and apologies for that, but since it's been a year or > so I thought I'd double-check whether anything has changed before > really starting to tear my hair out too much. > > Is there a combination of MCA parameters or similar that will prevent > ORTE from aborting a job when it detects a node failure? This is > using the tcp btl, under slurm. > > The application, not written by us and too complicated to re-engineer > at short notice, has a strictly master-slave communication pattern. > The master never blocks on communication from individual slaves, and > apparently can itself detect slaves that have silently disappeared and > reissue the work to those remaining. So from an application > standpoint I believe we should be able to handle this. However, in > all my testing so far the job is aborted as soon as the runtime system > figures out what is going on. > > If not, do any users know of another MPI implementation that might > work for this use case? As far as I can tell, FT-MPI has been pretty > quiet the last couple of years? > > Thanks in advance, > > Tim > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users _______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users