Hi! So I know from searching the archive that this is a repeated topic of discussion here, and apologies for that, but since it's been a year or so I thought I'd double-check whether anything has changed before really starting to tear my hair out too much.
Is there a combination of MCA parameters or similar that will prevent ORTE from aborting a job when it detects a node failure? This is using the tcp btl, under slurm. The application, not written by us and too complicated to re-engineer at short notice, has a strictly master-slave communication pattern. The master never blocks on communication from individual slaves, and apparently can itself detect slaves that have silently disappeared and reissue the work to those remaining. So from an application standpoint I believe we should be able to handle this. However, in all my testing so far the job is aborted as soon as the runtime system figures out what is going on. If not, do any users know of another MPI implementation that might work for this use case? As far as I can tell, FT-MPI has been pretty quiet the last couple of years? Thanks in advance, Tim _______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users