Hi, Check the UnkillableStepProgram and UnkillableStepTimeout options in slurm.conf. We use it to drain the stuck nodes and mail us - as here, usually stuck processes will require a reboot. As the drained strigger will never get triggered, we also set a finished trigger for the next RUNNING job. That trigger will either send us mail if there are only stuck processes, or strigger --fini the next RUNNING job.
Yair. On Tue, May 28, 2019 at 7:58 PM mercan <ahmet.mer...@uhem.itu.edu.tr> wrote: > Hi; > > If you did not use the epilog script, you can set the epilog script to > clean up all residues from the finished jobs: > > > https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-prolog-and-epilog-scripts > > Ahmet M. > > > 28.05.2019 19:03 tarihinde Matthew BETTINGER yazdı: > > We use triggers for the obvious alerts but is that a way to make a > trigger for nodes stuck in CG (completing) state? Some user jobs, mostly > Julia notebook can get hung in completing state is the user kills the > running job or cancels it with cntrl. When this happens we can have many > many nodes stuck in CG. Slurm 17.02.6. Thanks! > > > > -- /| | \/ | Yair Yarom | Senior DevOps Architect [] | The Rachel and Selim Benin School [] /\ | of Computer Science and Engineering []//\\/ | The Hebrew University of Jerusalem [// \\ | T +972-2-5494522 | F +972-2-5494522 // \ | ir...@cs.huji.ac.il // |