On 5/11/20 9:52 am, Alastair Neil wrote:
[2020-05-10T00:26:05.202] [533900.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 9
This caught my eye, Googling for it found a single instance, from 2019 on the list again about jobs on a node mysteriously dying.
The resolution was (courtesy of Uwe Seher): # The system is an opensuse leap 15 installation and slurm # comes from the repository. By default a slurm.epilog.clean # skript is installed which kills everything that belongs to $ the user when a job is finished including other jobs, # ssh-sessions and so on. I do not know if other distributions # do the same or if the script is broken, but removing it # solved the problem. Hope that helps! All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA