It has been awhile since I tested it, but I believe the --enable-recovery 
option might do what you want.

> On Jun 8, 2017, at 6:17 AM, Tim Burgess <ozburgess+o...@gmail.com> wrote:
> 
> Hi!
> 
> So I know from searching the archive that this is a repeated topic of
> discussion here, and apologies for that, but since it's been a year or
> so I thought I'd double-check whether anything has changed before
> really starting to tear my hair out too much.
> 
> Is there a combination of MCA parameters or similar that will prevent
> ORTE from aborting a job when it detects a node failure?  This is
> using the tcp btl, under slurm.
> 
> The application, not written by us and too complicated to re-engineer
> at short notice, has a strictly master-slave communication pattern.
> The master never blocks on communication from individual slaves, and
> apparently can itself detect slaves that have silently disappeared and
> reissue the work to those remaining.  So from an application
> standpoint I believe we should be able to handle this.  However, in
> all my testing so far the job is aborted as soon as the runtime system
> figures out what is going on.
> 
> If not, do any users know of another MPI implementation that might
> work for this use case?  As far as I can tell, FT-MPI has been pretty
> quiet the last couple of years?
> 
> Thanks in advance,
> 
> Tim
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to