Re: [slurm-users] Kill task failed, state set to DRAINING, UnkillableStepTimeout=120

2020-12-02 Thread Robert Kudyba
> > been having the same issue with BCM, CentOS 8.2 BCM 9.0 Slurm 20.02.3. It > seems to have started to occur when I enabled proctrack/cgroup and changed > select/linear to select/con_tres. > Our slurm.conf has the same setting: SelectType=select/cons_tres SelectTypeParameters=CR_CPU SchedulerTime

Re: [slurm-users] Kill task failed, state set to DRAINING, UnkillableStepTimeout=120

2020-12-01 Thread William Markuske
Hello Robert, I've been having the same issue with BCM, CentOS 8.2 BCM 9.0 Slurm 20.02.3. It seems to have started to occur when I enabled proctrack/cgroup and changed select/linear to select/con_tres. Are you using cgroup process tracking and have you manipulated the cgroup.conf file? Do jo

Re: [slurm-users] Kill task failed, state set to DRAINING, UnkillableStepTimeout=120

2020-11-30 Thread Alex Chekholko
This may be more "cargo cult" but I've advised users to add a "sleep 60" to the end of their job scripts if they are "I/O intensive". Sometimes they are somehow able to generate I/O in a way that slurm thinks the job is finished, but the OS is still catching up on the I/O, and then slurm tries to

Re: [slurm-users] Kill task failed, state set to DRAINING, UnkillableStepTimeout=120

2020-11-30 Thread Robert Kudyba
Sure I've seen that in some of the posts here, e.g., a NAS. But in this case it's a NFS share to the local RAID10 storage. There aren't any other settings that deal with this to not drain a node? On Mon, Nov 30, 2020 at 1:02 PM Paul Edmon wrote: > That can help. Usually this happens due to lagg

Re: [slurm-users] Kill task failed, state set to DRAINING, UnkillableStepTimeout=120

2020-11-30 Thread Paul Edmon
That can help.  Usually this happens due to laggy storage the job is using taking time flushing the job's data.  So making sure that your storage is up, responsive, and stable will also cut these down. -Paul Edmon- On 11/30/2020 12:52 PM, Robert Kudyba wrote: I've seen where this was a bug tha

[slurm-users] Kill task failed, state set to DRAINING, UnkillableStepTimeout=120

2020-11-30 Thread Robert Kudyba
I've seen where this was a bug that was fixed https://bugs.schedmd.com/show_bug.cgi?id=3941 but this happens occasionally still. A user cancels his/her job and a node gets drained. UnkillableStepTimeout=120 is set in slurm.conf Slurm 20.02.3 on Centos 7.9 running on Bright Cluster 8.2 Slurm Job_i