The nodes are being removed as they aren't resolving in DNS anymore; are you using a dynamic system where only active hosts' names resolve?
Xand ________________________________ From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of Joe Teumer <joe.teu...@gmail.com> Sent: Tuesday, October 25, 2022 7:42:16 PM To: slurm-us...@schedmd.com <slurm-us...@schedmd.com> Subject: [slurm-users] slurmctld removing offline nodes We noticed that the slurm controller will remove nodes that it cannot reach. How can this be disabled? We would like to see the nodes marked down/drain instead of the controller removing the nodes from sinfo. /var/log/slurm/slurmctld.log [2022-10-25T13:10:01.500] debug: Log file re-opened [2022-10-25T13:10:01.589] error: get_addr_info: getaddrinfo() failed: Temporary failure in name resolution [2022-10-25T13:10:01.589] error: slurm_set_addr: Unable to resolve "spg-ethx-f4ce" [2022-10-25T13:10:01.589] error: slurm_get_port: Address family '0' not supported [2022-10-25T13:10:01.589] error: _set_slurmd_addr: failure on spg-ethx-f4ce cat /etc/slurm/slurm.conf | grep -i f4ce NodeName=spg-ethx-f4ce ... PartitionName=debug spg-ethx-f4ce ... No output in sinfo: sinfo -N | grep f4ce sinfo -R | grep f4ce slurmd -V slurm 21.08.0