Hi!

So I know from searching the archive that this is a repeated topic of
discussion here, and apologies for that, but since it's been a year or
so I thought I'd double-check whether anything has changed before
really starting to tear my hair out too much.

Is there a combination of MCA parameters or similar that will prevent
ORTE from aborting a job when it detects a node failure?  This is
using the tcp btl, under slurm.

The application, not written by us and too complicated to re-engineer
at short notice, has a strictly master-slave communication pattern.
The master never blocks on communication from individual slaves, and
apparently can itself detect slaves that have silently disappeared and
reissue the work to those remaining.  So from an application
standpoint I believe we should be able to handle this.  However, in
all my testing so far the job is aborted as soon as the runtime system
figures out what is going on.

If not, do any users know of another MPI implementation that might
work for this use case?  As far as I can tell, FT-MPI has been pretty
quiet the last couple of years?

Thanks in advance,

Tim
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to