Re: [OMPI users] Node failure handling

2017-06-27 Thread r...@open-mpi.org
Okay, this should fix it - https://github.com/open-mpi/ompi/pull/3771 > On Jun 27, 2017, at 6:31 AM, r...@open-mpi.org wrote: > > Actually, the error message is coming from mpirun to indicate that it lost > connection to one (or more) of its daemons.

Re: [OMPI users] Node failure handling

2017-06-27 Thread r...@open-mpi.org
Actually, the error message is coming from mpirun to indicate that it lost connection to one (or more) of its daemons. This happens because slurm only knows about the remote daemons - mpirun was started outside of “srun”, and so slurm doesn’t know it exists. Thus, when slurm kills the job, it on

Re: [OMPI users] Node failure handling

2017-06-27 Thread George Bosilca
I would also be interested in having the slurm keep the remaining processes around, we have been struggling with this on many of the NERSC machines. That being said the error message comes from orted, and it suggest that they are giving up because they lose connection to a peer. I was not aware tha

Re: [OMPI users] Node failure handling

2017-06-26 Thread r...@open-mpi.org
Let me poke at it a bit tomorrow - we should be able to avoid the abort. It’s a bug if we can’t. > On Jun 26, 2017, at 7:39 PM, Tim Burgess wrote: > > Hi Ralph, > > Thanks for the quick response. > > Just tried again not under slurm, but the same result... (though I > just did kill -9 orted o

Re: [OMPI users] Node failure handling

2017-06-26 Thread Tim Burgess
Hi Ralph, Thanks for the quick response. Just tried again not under slurm, but the same result... (though I just did kill -9 orted on the remote node this time) Any ideas? Do you think my multiple-mpirun idea is worth trying? Cheers, Tim ``` [user@bud96 mpi_resilience]$ /d/home/user/2017/ope

Re: [OMPI users] Node failure handling

2017-06-26 Thread r...@open-mpi.org
Ah - you should have told us you are running under slurm. That does indeed make a difference. When we launch the daemons, we do so with "srun --kill-on-bad-exit” - this means that slurm automatically kills the job if any daemon terminates. We take that measure to avoid leaving zombies behind in

Re: [OMPI users] Node failure handling

2017-06-26 Thread Tim Burgess
Hi Ralph, George, Thanks very much for getting back to me. Alas, neither of these options seem to accomplish the goal. Both in OpenMPI v2.1.1 and on a recent master (7002535), with slurm's "--no-kill" and openmpi's "--enable-recovery", once the node reboots one gets the following error: ``` ---

Re: [OMPI users] Node failure handling

2017-06-09 Thread George Bosilca
Tim, FT-MPI is gone, but the ideas it put forward have been refined and the software algorithms behind them improved in a newer (and supported) project ULFM. It features a smaller API, with a much more flexible approach. You can find more information about it at http://fault-tolerance.org/. The co

Re: [OMPI users] Node failure handling

2017-06-09 Thread r...@open-mpi.org
It has been awhile since I tested it, but I believe the --enable-recovery option might do what you want. > On Jun 8, 2017, at 6:17 AM, Tim Burgess wrote: > > Hi! > > So I know from searching the archive that this is a repeated topic of > discussion here, and apologies for that, but since it's

[OMPI users] Node failure handling

2017-06-08 Thread Tim Burgess
Hi! So I know from searching the archive that this is a repeated topic of discussion here, and apologies for that, but since it's been a year or so I thought I'd double-check whether anything has changed before really starting to tear my hair out too much. Is there a combination of MCA parameters