The nodes are being removed as they aren't resolving in DNS anymore; are you 
using a dynamic system where only active hosts' names resolve?

Xand

________________________________
From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of Joe 
Teumer <joe.teu...@gmail.com>
Sent: Tuesday, October 25, 2022 7:42:16 PM
To: slurm-us...@schedmd.com <slurm-us...@schedmd.com>
Subject: [slurm-users] slurmctld removing offline nodes

We noticed that the slurm controller will remove nodes that it cannot reach.
How can this be disabled?
We would like to see the nodes marked down/drain instead of the controller 
removing the nodes from sinfo.

/var/log/slurm/slurmctld.log
[2022-10-25T13:10:01.500] debug:  Log file re-opened
[2022-10-25T13:10:01.589] error: get_addr_info: getaddrinfo() failed: Temporary 
failure in name resolution
[2022-10-25T13:10:01.589] error: slurm_set_addr: Unable to resolve 
"spg-ethx-f4ce"
[2022-10-25T13:10:01.589] error: slurm_get_port: Address family '0' not 
supported
[2022-10-25T13:10:01.589] error: _set_slurmd_addr: failure on spg-ethx-f4ce

cat /etc/slurm/slurm.conf | grep -i f4ce
NodeName=spg-ethx-f4ce ...
PartitionName=debug spg-ethx-f4ce ...

No output in sinfo:
sinfo -N | grep f4ce
sinfo -R | grep f4ce

slurmd -V
slurm 21.08.0

Reply via email to