>
> been having the same issue with BCM, CentOS 8.2 BCM 9.0 Slurm 20.02.3. It
> seems to have started to occur when I enabled proctrack/cgroup and changed
> select/linear to select/con_tres.
>
Our slurm.conf has the same setting:
SelectType=select/cons_tres
SelectTypeParameters=CR_CPU
SchedulerTime
Hello Robert,
I've been having the same issue with BCM, CentOS 8.2 BCM 9.0 Slurm
20.02.3. It seems to have started to occur when I enabled
proctrack/cgroup and changed select/linear to select/con_tres.
Are you using cgroup process tracking and have you manipulated the
cgroup.conf file? Do jo
This may be more "cargo cult" but I've advised users to add a "sleep 60" to
the end of their job scripts if they are "I/O intensive". Sometimes they
are somehow able to generate I/O in a way that slurm thinks the job is
finished, but the OS is still catching up on the I/O, and then slurm tries
to
Sure I've seen that in some of the posts here, e.g., a NAS. But in this
case it's a NFS share to the local RAID10 storage. There aren't any other
settings that deal with this to not drain a node?
On Mon, Nov 30, 2020 at 1:02 PM Paul Edmon wrote:
> That can help. Usually this happens due to lagg
That can help. Usually this happens due to laggy storage the job is
using taking time flushing the job's data. So making sure that your
storage is up, responsive, and stable will also cut these down.
-Paul Edmon-
On 11/30/2020 12:52 PM, Robert Kudyba wrote:
I've seen where this was a bug tha
I've seen where this was a bug that was fixed
https://bugs.schedmd.com/show_bug.cgi?id=3941 but this happens occasionally
still. A user cancels his/her job and a node gets drained.
UnkillableStepTimeout=120 is set in slurm.conf
Slurm 20.02.3 on Centos 7.9 running on Bright Cluster 8.2
Slurm Job_i