Re: [slurm-users] draining nodes due to failed killing of task?

Willy Markuske Fri, 06 Aug 2021 08:08:57 -0700

Adrian and Diego,

Are you using AMD Epyc processors when viewing this issue? I've beenhaving the same issue but only on dual AMD Epyc systems. I haven't triedchanging the core file location from an NFS mount though so perhapsthere's an issue writing it out in time.


How did you disable core files?

Regards,

Willy Markuske

HPC Systems Engineer

        

Research Data Services

P: (619) 519-4435

On 8/6/21 6:16 AM, Adrian Sevcenco wrote:

On 8/6/21 3:19 PM, Diego Zuccato wrote:
IIRC we increased SlurmdTimeout to 7200 .
Thanks a lot!

Adrian
Il 06/08/2021 13:33, Adrian Sevcenco ha scritto:
On 8/6/21 1:56 PM, Diego Zuccato wrote:
We had a similar problem some time ago (slow creation of big corefiles) and solved it by increasing the Slurm timeouts
oh, i see.. well, in principle i should not have core files, and ido not find any...
to the point that even the slowest core wouldn't trigger it. Then,once the need for core files was over, I disabled core files andrestored the timeouts.
and how much did you increased them? i have
SlurmctldTimeout=300
SlurmdTimeout=300

Thank you!
Adrian
Il 06/08/2021 12:46, Adrian Sevcenco ha scritto:
On 8/6/21 1:27 PM, Diego Zuccato wrote:
Hi.
Hi!
Might it be due to a timeout (maybe the killed job is creating acore file, or caused heavy swap usage)?
i will have to search for culprit ..
the problem is why would the node be put in drain for the reasonof failed killing? and how can i control/disable
this?

Thank you!
Adrian
BYtE,
  Diego

Il 06/08/2021 09:02, Adrian Sevcenco ha scritto:
Having just implemented some triggers i just noticed this:
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORYTMP_DISK WEIGHT AVAIL_FE REASONalien-0-47 1 alien* draining 48 48:1:1 193324214030 1 rack-0,4 Kill task failedalien-0-56 1 alien* drained 48 48:1:1 193324214030 1 rack-0,4 Kill task failed
i was wondering why a node is drained when killing of task failsand how can i disable it? (i use cgroups)moreover, how can the killing of task fails? (this is on slurm19.05)
Thank you!
Adrian

Re: [slurm-users] draining nodes due to failed killing of task?

Reply via email to