Re: [slurm-users] Running job is canceled when starting a new job from queue

2019-10-29 Thread Uwe Seher
Hi all! I think i solved the problem The system is an opensuse leap 15 installation and slurm comes from the repository. By default a slurm.epilog.clean skript is installed which kills everything that belongs to the user when a job is finished including other jobs, ssh-sessions and so on. I do not

Re: [slurm-users] Running job is canceled when starting a new job from queue

2019-10-28 Thread Uwe Seher
Hello! I cannot fond any hints on oom-kills, but it is systemd so i need maybe a little more time searching. We have 128GB mem on the node and the tasks do not use this to the limit we know, dependencies have also worked fine with the same tasks. Monitoring does not show any problems with memory. T

Re: [slurm-users] Running job is canceled when starting a new job from queue

2019-10-28 Thread Lech Nieroda
Hello Uwe, when the requested time limit of a job runs out the job is cancelled and terminated with signal SIGTERM (15) and later on SIGKILL (9) if that should fail, the job gets the state „TIMEOUT“. However the job 161 gets killed immediately by SIGKILL and gets the state „FAILED“. That sugges

[slurm-users] Running job is canceled when starting a new job from queue

2019-10-28 Thread Uwe Seher
Hello group! While running our first jobs i git a strange issue while running multiple Jobs on a single partition. The partition is a single Node with 32 cores and 128GB memory. ther is a queue with three jobs each should use 15 cores, memory usage is not important. As planned 2 jobs are running, s