The problem turned out to be that I had Gres=gpu:gp100:1 on the NodeName line for that node and it didn't have a gpu or a gres.conf. Once I moved that to the correct NodeName line in slurm.conf that node came out of the drain state and became usable again.
Pretty strange that having a Gres= property on a node that doesn't have a gpu would get it stuck in the drain state. On Thu, Jan 23, 2020 at 2:34 PM Alex Chekholko <a...@calicolabs.com> wrote: > Hey Dean, > > Does 'scontrol show node <nodename' give any "Reason:"? You can also look > at 'sinfo -R'. > > Make sure the relevant network ports are open: > > https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-firewall-for-slurm-daemons > > Also check that slurmd daemons on the compute nodes can talk to each other > (not just to the master). e.g. bottom of > https://slurm.schedmd.com/big_sys.html > > Regards, > Alex > > On Thu, Jan 23, 2020 at 1:05 PM Dean Schulze <dean.w.schu...@gmail.com> > wrote: > >> I've tried the normal things with scontrol ( >> https://blog.redbranch.net/2015/12/26/resetting-drained-slurm-node/), >> but I have a node that will not come out of the drain state. >> >> I've also done a hard reboot and tried again. Are there any other >> remedies? >> >> Thanks. >> >