Re: [slurm-users] draining nodes due to failed killing of task?

Diego Zuccato Fri, 06 Aug 2021 05:20:27 -0700

IIRC we increased SlurmdTimeout to 7200 .

Il 06/08/2021 13:33, Adrian Sevcenco ha scritto:

On 8/6/21 1:56 PM, Diego Zuccato wrote:
We had a similar problem some time ago (slow creation of big corefiles) and solved it by increasing the Slurm timeouts
oh, i see.. well, in principle i should not have core files, and i donot find any...
to the point that even the slowest core wouldn't trigger it. Then,once the need for core files was over, I disabled core files andrestored the timeouts.
and how much did you increased them? i have
SlurmctldTimeout=300
SlurmdTimeout=300

Thank you!
Adrian
Il 06/08/2021 12:46, Adrian Sevcenco ha scritto:
On 8/6/21 1:27 PM, Diego Zuccato wrote:
Hi.
Hi!
Might it be due to a timeout (maybe the killed job is creating acore file, or caused heavy swap usage)?
i will have to search for culprit ..
the problem is why would the node be put in drain for the reason offailed killing? and how can i control/disable
this?

Thank you!
Adrian
BYtE,
  Diego

Il 06/08/2021 09:02, Adrian Sevcenco ha scritto:
Having just implemented some triggers i just noticed this:
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORYTMP_DISK WEIGHT AVAIL_FE REASONalien-0-47 1 alien* draining 48 48:1:1 193324214030 1 rack-0,4 Kill task failedalien-0-56 1 alien* drained 48 48:1:1 193324214030 1 rack-0,4 Kill task failed
i was wondering why a node is drained when killing of task failsand how can i disable it? (i use cgroups)
moreover, how can the killing of task fails? (this is on slurm 19.05)

Thank you!
Adrian


--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786

Re: [slurm-users] draining nodes due to failed killing of task?

Reply via email to