On Fri, 2019-09-27 at 14:58:40 +0200, Rafał Kędziorski wrote: > Am Fr., 27. Sept. 2019 um 13:50 Uhr schrieb Steffen Grunewald < > steffen.grunew...@aei.mpg.de>: > > On Fri, 2019-09-27 at 11:19:16 +0200, Juergen Salk wrote: > > > > > > you may try setting `ReturnToService=2´ in slurm.conf. > > > > > Caveat: A spontaneously rebooting machine may create a "black hole" this > > way. > > > How do you mean this? Could ReturnToService=2 be a problem?
For us it was - we had (and still have) nodes spontaneously rebooting. If they come up into idle, they will eat the next job, etc as infinitum - thus we've set ReturnToService=0. "Black hole" in a figurative way, still swallowing all it could get its hands on. You've got to decide what's worse: have full control over machines rebooted intentionally, or have full control over misbehaving ones. My own choice is clear. - S