Hi all!
I think i solved the problem
The system is an opensuse leap 15 installation and slurm comes from the
repository. By default a slurm.epilog.clean skript is installed which kills
everything that belongs to the user when a job is finished including other
jobs, ssh-sessions and so on. I do not
Hello!
I cannot fond any hints on oom-kills, but it is systemd so i need maybe a
little more time searching. We have 128GB mem on the node and the tasks do
not use this to the limit we know, dependencies have also worked fine with
the same tasks. Monitoring does not show any problems with memory. T
Hello Uwe,
when the requested time limit of a job runs out the job is cancelled and
terminated with signal SIGTERM (15) and later on SIGKILL (9) if that should
fail, the job gets the state „TIMEOUT“.
However the job 161 gets killed immediately by SIGKILL and gets the state
„FAILED“. That sugges
Hello group!
While running our first jobs i git a strange issue while running multiple
Jobs on a single partition.
The partition is a single Node with 32 cores and 128GB memory. ther is a
queue with three jobs each should use 15 cores, memory usage is not
important. As planned 2 jobs are running, s