Hi Xaver,

I found your thread while searching for a solution to the same issue with cloud 
nodes. In the past I have always used POWER_UP to get the node to register and 
clear the NOT_RESPONDING flag, but this necessarily creates an instance 
regardless of whether I need one. It turns out that updating with UNDRAIN 
accomplishes the same without booting an instance. Setting UNDRAIN allows the 
node to be scheduled, which causes the resume program to run and once booted 
and registered, NOT_RESPONDING is cleared.

Unfortunately, the node state still displays NOT_RESPONDING, so it still shows 
up in sinfo --dead and as far as I can tell there is no way to separate "will 
boot" from "won't boot" nodes. Clearly there is still some internal state there 
that does not appear to be user-visible, at least from scontrol show node. And 
if there is a way to administratively clear NOT_RESPONDING entirely, I have not 
found it. But hopefully this helps.

--nate

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to