Hi, I'm wondering where the UnkillableStepProgram is actually executed. According to Mike it has to be available on every on the compute nodes. This makes sense only if it is executed there.
But the man page slurm.conf of 21.08.x states: UnkillableStepProgram Must be executable by user SlurmUser. The file must be accessible by the primary and backup control machines. So I would expect it's executed on the controller node. Best, Stefan Am Dienstag, 23. März 2021, 05:30:01 CET schrieb Chris Samuel: > Hi Mike, > > On 22/3/21 7:12 pm, Yap, Mike wrote: > > # I presume UnkillableStepTimeout is set in slurm.conf. and it act as a > > timer to trigger UnkillableStepProgram > > That is correct. > > > # UnkillableStepProgram can be use to send email or reboot compute node > > – question is how do we configure it ? > > Also - or to automate collecting debug info (which is what we do) and > then we manually intervene to reboot the node once we've determined > there's no more useful info to collect. > > It's just configured in your slurm.conf. > > UnkillableStepProgram=/path/to/the/unkillable/step/script.sh > > Of course this script has to be present on every compute node. > > All the best, > Chris -- Stefan Stäglich, Universität Freiburg, Institut für Informatik Georges-Köhler-Allee, Geb.52, 79110 Freiburg, Germany E-Mail : staeg...@informatik.uni-freiburg.de WWW : ml.informatik.uni-freiburg.de Telefon: +49 761 203-8223
signature.asc
Description: This is a digitally signed message part.