Hey Nate,

we actually fixed our underlying issue that caused the NOT_RESPONDING
flag - on fails we automatically terminated the node manually instead of
letting Slurm call the terminate script. That lead to Slurm believing
the node should still be there when it was terminated already.

Therefore, we do not have the issue any more as we no longer see nodes
with NOT_RESPONDING.

Nice to hear that you found a solution though.

Best,
Xaver

On 19.09.24 15:04, nate--- via slurm-users wrote:
Hi Xaver,

I found your thread while searching for a solution to the same issue with cloud 
nodes. In the past I have always used POWER_UP to get the node to register and 
clear the NOT_RESPONDING flag, but this necessarily creates an instance 
regardless of whether I need one. It turns out that updating with UNDRAIN 
accomplishes the same without booting an instance. Setting UNDRAIN allows the 
node to be scheduled, which causes the resume program to run and once booted 
and registered, NOT_RESPONDING is cleared.

Unfortunately, the node state still displays NOT_RESPONDING, so it still shows up in sinfo --dead 
and as far as I can tell there is no way to separate "will boot" from "won't 
boot" nodes. Clearly there is still some internal state there that does not appear to be 
user-visible, at least from scontrol show node. And if there is a way to administratively clear 
NOT_RESPONDING entirely, I have not found it. But hopefully this helps.

--nate


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to