Hi Ralph, George,
Thanks very much for getting back to me. Alas, neither of these
options seem to accomplish the goal. Both in OpenMPI v2.1.1 and on a
recent master (7002535), with slurm's "--no-kill" and openmpi's
"--enable-recovery", once the node reboots one gets the following
error:
```
---
Ah - you should have told us you are running under slurm. That does indeed make
a difference. When we launch the daemons, we do so with "srun
--kill-on-bad-exit” - this means that slurm automatically kills the job if any
daemon terminates. We take that measure to avoid leaving zombies behind in
Hi Ralph,
Thanks for the quick response.
Just tried again not under slurm, but the same result... (though I
just did kill -9 orted on the remote node this time)
Any ideas? Do you think my multiple-mpirun idea is worth trying?
Cheers,
Tim
```
[user@bud96 mpi_resilience]$
/d/home/user/2017/ope
Let me poke at it a bit tomorrow - we should be able to avoid the abort. It’s a
bug if we can’t.
> On Jun 26, 2017, at 7:39 PM, Tim Burgess wrote:
>
> Hi Ralph,
>
> Thanks for the quick response.
>
> Just tried again not under slurm, but the same result... (though I
> just did kill -9 orted o