I've not seen the IDLE* issue before but when my nodes got stuck I've always beena ble to fix them with this:
[root@cloud01 ~]# scontrol update nodename=cloud01 state=down reason=stuck [root@cloud01 ~]# scontrol update nodename=cloud01 state=idle [root@cloud01 ~]# scontrol update nodename=cloud01 state=power_down [root@cloud01 ~]# scontrol update nodename=cloud01 state=power_up Antony On 17 July 2018 at 18:13, Michael Gutteridge <michael.gutteri...@gmail.com> wrote: > Hi > > I'm running a cluster in a cloud provider and have run up against an odd > problem with power save. I've got several hundred nodes that Slurm won't > power up even though they appear idle and in the powered-down state. I > suspect that they are in a "not-so-idle" state: `scontrol` for all of the > nodes which aren't being powered up shows the state as > "IDLE*+CLOUD+POWER". The asterisk is throwing me off here- that state > doesn't appear to be documented in the scontrol manpage (I want to say I'd > seen it discussed on the list, but google searches haven't turned up much > yet). > > The other nodes in the cluster are being powered up and down as we'd > expect. It's just these nodes that Slurm doesn't power up. In fact, it > appears that the controller doesn't even _try_ to power up the node- the > logs (both for the controller with DebugFlags=Power and the power > management script logs) don't indicate even an attempt to start a node when > requested. > > I haven't figured a way to reliably reset the nodes to "IDLE". Some > relevant configs are: > > SchedulerType=sched/backfill > SelectType=select/cons_res > SelectTypeParameters=CR_CPU > SuspendProgram=/var/lib/slurm-llnl/suspend > SuspendTime=300 > SuspendRate=10 > ResumeRate=10 > ResumeProgram=/var/lib/slurm-llnl/resume > ResumeTimeout=300 > BatchStartTimeout=300 > > A typical node is configured thus: > > NodeName=nodef74 NodeAddr=nodef74.fhcrc.org Feature=c5.2xlarge CPUs=4 > RealMemory=16384 Weight=40 State=CLOUD > > Thanks for your time- any advice or hints are greatly appreciated. > > Michael > > >