Hi,
I also see twice a node reaching this "drain state" these last weeks. It
is the first time on this cluster (Slurm is 24.05 on the latest setup)
and I'm running slurm for many years (with Slurm 20.11 on the oldest
cluster).
No user process found, I've just resumed the node.
Patrick
Le 19
Hi Patrick,
On 9/22/25 07:39, Patrick Begou via slurm-users wrote:
I also see twice a node reaching this "drain state" these last weeks. It
is the first time on this cluster (Slurm is 24.05 on the latest setup) and
I'm running slurm for many years (with Slurm 20.11 on the oldest cluster).
No u
On 9/16/25 07:38, Gestió Servidors via slurm-users wrote:
Is there any way to reset node to “state=idle” after errors in the
same way?
First you have to investigate if the jobid's user has any processes left
behind on the compute node. It may very well be stale I/O from the job
to a network
Hello,
as an example, my UnkillableStepProgram is just a bash script collecting
recent logs and processes and mailing me about the error. Nothing special.
Best regards,
--
*Lorenzo Bosio*
Tecnico di Ricerca - Laboratorio HPC4AI
Dipartimento di Informatica
Università degli Studi di Torino
Corso
On 9/18/25 12:39, Lorenzo Bosio via slurm-users wrote:
as an example, my UnkillableStepProgram is just a bash script collecting
recent logs and processes and mailing me about the error. Nothing special.
We use Slurm "triggers" to get alerts from many different types of events, see
https://githu
Hi,
After reading answer from Ole Holm Nielsen, I have increased "MessageTimeout"
to 20s (by default is 5s) and "UnkillableStepTimeout" to 150s (by default is
60s and, always 5 times larger than "MessageTimeout"). However, I have also
read that UnkillableStepProgram indicates the program to use
On 9/16/25 07:38, Gestió Servidors via slurm-users wrote:
[root@login-node ~]# sinfo
PARTITION TIMELIMIT AVAIL STATE
NODELIST CPU_LOAD NODES(A/I) NODES(A/I/
O/T) CPUS CPUS(A/I/O/T) REASON
*node.q* 4:00:00 up drained