[slurm-users] Re: Node in drain state

2025-09-21 Thread Patrick Begou via slurm-users
Hi, I also see twice a node reaching this "drain state" these last weeks. It is the first time on this cluster (Slurm is 24.05 on the latest setup) and I'm running slurm for many years (with Slurm 20.11 on the oldest cluster). No user process found, I've just resumed the node. Patrick Le 19

[slurm-users] Re: Node in drain state

2025-09-21 Thread Ole Holm Nielsen via slurm-users
Hi Patrick, On 9/22/25 07:39, Patrick Begou via slurm-users wrote: I also see twice a node reaching this "drain state" these last weeks. It is the first time on this cluster (Slurm is 24.05 on the latest setup) and I'm running slurm for many years (with Slurm 20.11 on the oldest cluster). No u

[slurm-users] Re: Node in drain state

2025-09-19 Thread Ole Holm Nielsen via slurm-users
On 9/16/25 07:38, Gestió Servidors via slurm-users wrote: Is there any way to reset node to “state=idle” after errors in the same way? First you have to investigate if the jobid's user has any processes left behind on the compute node.  It may very well be stale I/O from the job to a network

[slurm-users] Re: Node in drain state

2025-09-19 Thread Lorenzo Bosio via slurm-users
Hello, as an example, my UnkillableStepProgram is just a bash script collecting recent logs and processes and mailing me about the error. Nothing special. Best regards, -- *Lorenzo Bosio* Tecnico di Ricerca - Laboratorio HPC4AI Dipartimento di Informatica Università degli Studi di Torino Corso

[slurm-users] Re: Node in drain state

2025-09-19 Thread Ole Holm Nielsen via slurm-users
On 9/18/25 12:39, Lorenzo Bosio via slurm-users wrote: as an example, my UnkillableStepProgram is just a bash script collecting recent logs and processes and mailing me about the error. Nothing special. We use Slurm "triggers" to get alerts from many different types of events, see https://githu

[slurm-users] Re: Node in drain state

2025-09-18 Thread Gestió Servidors via slurm-users
Hi, After reading answer from Ole Holm Nielsen, I have increased "MessageTimeout" to 20s (by default is 5s) and "UnkillableStepTimeout" to 150s (by default is 60s and, always 5 times larger than "MessageTimeout"). However, I have also read that UnkillableStepProgram indicates the program to use

[slurm-users] Re: Node in drain state

2025-09-16 Thread Ole Holm Nielsen via slurm-users
On 9/16/25 07:38, Gestió Servidors via slurm-users wrote: [root@login-node ~]# sinfo     PARTITION TIMELIMIT  AVAIL  STATE NODELIST CPU_LOAD   NODES(A/I) NODES(A/I/ O/T)   CPUS  CPUS(A/I/O/T) REASON *node.q*   4:00:00 up    drained